跳到主要导航 跳到搜索 跳到主要内容

HDDse: Enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers

  • Ji Zhang
  • , Ping Huang
  • , Ke Zhou
  • , Ming Xie
  • , Sebastian Schelter
  • Huazhong University of Science and Technology
  • University of Amsterdam
  • Temple University
  • Tencent

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

The reliability of a storage system is crucial in large data centers. Hard disks are widely used as primary storage devices in modern data centers, where disk failures constantly happen. Disk failures could lead to a serious system interrupt or even permanent data loss. Many hard disk failure detection approaches have been proposed to solve this problem. However, existing approaches are not generic models for heterogeneous disks in large data centers, e.g, most of the approaches only consider datasets consisting of disks from the same manufacturer (and often of the same disk models). Moreover, some approaches achieve high detection performance in most cases but can not deliver satisfactory results when the datasets of a relatively small amount of disks or have new datasets which have not been seen during training. In this paper, we propose a novel generic disk failure detection approach for heterogeneous disks that can not only deliver a better detective performance but also have good detective adaptability to the disks which have not appeared in training, even when dealing with imbalanced or a relatively small amount of disk datasets. We employ a Long Short-Term Memory (LSTM) based siamese network that can learn the dynamically changed long-term behavior of disk healthy statues. Moreover, this structure can generate a unified and efficient high dimensional disk state embeddings for failure detection of heterogeneous disks. Our evaluation results on two real-world data centers confirm that the proposed system is effective and outperforms several state-of-the-art approaches. Furthermore, we have successfully applied the proposed system to improve the reliability of a data center and exhibit practical long-term availability.

源语言英语
主期刊名Proceedings of the 2020 USENIX Annual Technical Conference, ATC 2020
出版商USENIX Association
111-126
页数16
ISBN(电子版)9781939133144
出版状态已出版 - 2020
已对外发布
活动2020 USENIX Annual Technical Conference, ATC 2020 - Virtual, Online
期限: 15 7月 202017 7月 2020

出版系列

姓名Proceedings of the 2020 USENIX Annual Technical Conference, ATC 2020

会议

会议2020 USENIX Annual Technical Conference, ATC 2020
Virtual, Online
时期15/07/2017/07/20

指纹

探究 'HDDse: Enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers' 的科研主题。它们共同构成独一无二的指纹。

引用此