Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems

Wei Lu; Yanyan Shen; Tongtong Wang; Meihui Zhang; H. V. Jagadish; Xiaoyong Du

doi:10.1109/TKDE.2018.2843361

Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems

Wei Lu, Yanyan Shen, Tongtong Wang, Meihui Zhang^*, H. V. Jagadish, Xiaoyong Du

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

12 引用（Scopus）

摘要

There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.

源语言	英语
文章编号	8371278
页（从-至）	733-746
页数	14
期刊	IEEE Transactions on Knowledge and Data Engineering
卷	31
期	4
DOI	https://doi.org/10.1109/TKDE.2018.2843361
出版状态	已出版 - 1 4月 2019

访问文件

10.1109/TKDE.2018.2843361

其它文件与链接

链接到 Scopus 的出版物

引用此

Lu, W., Shen, Y., Wang, T., Zhang, M., Jagadish, H. V., & Du, X. (2019). Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems. IEEE Transactions on Knowledge and Data Engineering, 31(4), 733-746. 文章 8371278. https://doi.org/10.1109/TKDE.2018.2843361

@article{9d51f70037cd4b17a0aa706ff98ccc48,

title = "Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems",

abstract = "There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.",

keywords = "Distributed graph processing systems, checkpoint, compression, failure recovery, log, partition-based recovery",

author = "Wei Lu and Yanyan Shen and Tongtong Wang and Meihui Zhang and Jagadish, {H. V.} and Xiaoyong Du",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2019",

month = apr,

day = "1",

doi = "10.1109/TKDE.2018.2843361",

language = "English",

volume = "31",

pages = "733--746",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "4",

}

TY - JOUR

T1 - Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems

AU - Lu, Wei

AU - Shen, Yanyan

AU - Wang, Tongtong

AU - Zhang, Meihui

AU - Jagadish, H. V.

AU - Du, Xiaoyong

PY - 2019/4/1

Y1 - 2019/4/1

N2 - There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.

AB - There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.

KW - Distributed graph processing systems

KW - checkpoint

KW - compression

KW - failure recovery

KW - log

KW - partition-based recovery

UR - http://www.scopus.com/inward/record.url?scp=85047981402&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2018.2843361

DO - 10.1109/TKDE.2018.2843361

M3 - Article

AN - SCOPUS:85047981402

SN - 1041-4347

VL - 31

SP - 733

EP - 746

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 4

M1 - 8371278

ER -

Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems

摘要

访问文件

其它文件与链接

指纹

引用此