Transfer learning based failure prediction for minority disks in large data centers of heterogeneous disk systems

Ji Zhang, Ke Zhou, Ping Huang, Xubin He, Zhili Xiao, Bin Cheng, Yongguang Ji, Yinhu Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Citations (Scopus)

Abstract

The storage system in large scale data centers is typically built upon thousands or even millions of disks, where disk failures constantly happen. A disk failure could lead to serious data loss and thus system unavailability or even catastrophic consequences if the lost data cannot be recovered. While replication and erasure coding techniques have been widely deployed to guarantee storage availability and reliability, disk failure prediction is gaining popularity as it has the potential to prevent disk failures from occurring in the first place. Recent trends have turned toward applying machine learning approaches based on disk SMART attributes for disk failure predictions. However, traditional machine learning (ML) approaches require a large set of training data in order to deliver good predictive performance. In large-scale storage systems, new disks enter gradually to augment the storage capacity or to replace failed disks, leading storage systems to consist of small amounts of new disks from different vendors and/or different models from the same vendor as time goes on. We refer to this relatively small amount of disks as minority disks. Due to the lack of sufficient training data, traditional ML approaches fail to deliver satisfactory predictive performance in evolving storage systems which consist of heterogeneous minority disks. To address this challenge and improve the predictive performance for minority disks in large data centers, we propose a minority disk failure prediction model named TLDFP based on a transfer learning approach. Our evaluation results on two realistic datasets have demonstrated that TLDFP can deliver much more precise results, compared to four popular prediction models based on traditional ML algorithms and two state-of-the-art transfer learning methods.

Original languageEnglish
Title of host publicationProceedings of the 48th International Conference on Parallel Processing, ICPP 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450362955
DOIs
Publication statusPublished - 5 Aug 2019
Externally publishedYes
Event48th International Conference on Parallel Processing, ICPP 2019 - Kyoto, Japan
Duration: 5 Aug 20198 Aug 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference48th International Conference on Parallel Processing, ICPP 2019
Country/TerritoryJapan
CityKyoto
Period5/08/198/08/19

Fingerprint

Dive into the research topics of 'Transfer learning based failure prediction for minority disks in large data centers of heterogeneous disk systems'. Together they form a unique fingerprint.

Cite this