TY - JOUR
T1 - Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems
AU - Lu, Wei
AU - Shen, Yanyan
AU - Wang, Tongtong
AU - Zhang, Meihui
AU - Jagadish, H. V.
AU - Du, Xiaoyong
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2019/4/1
Y1 - 2019/4/1
N2 - There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.
AB - There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.
KW - Distributed graph processing systems
KW - checkpoint
KW - compression
KW - failure recovery
KW - log
KW - partition-based recovery
UR - http://www.scopus.com/inward/record.url?scp=85047981402&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2018.2843361
DO - 10.1109/TKDE.2018.2843361
M3 - Article
AN - SCOPUS:85047981402
SN - 1041-4347
VL - 31
SP - 733
EP - 746
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 4
M1 - 8371278
ER -