Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems

  • Wei Lu
  • , Yanyan Shen
  • , Tongtong Wang
  • , Meihui Zhang*
  • , H. V. Jagadish
  • , Xiaoyong Du
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)

Abstract

There is a growing need for distributed graph processing systems to have many more compute nodes processing graph-based Big Data applications, which, however, increases the chance of node failures. To address the issue, we propose a novel recovery scheme to accelerate the recovery process by parallelizing the recomputation. Once a failure occurs, all recomputations are confined to subgraphs that originally reside in the failed compute nodes. When the recovery starts, these subgraphs are reassigned to another set of compute nodes, where the recomputation over these subgraphs are conducted in parallel. To minimize the recovery latency, we also develop a reassignment strategy, from these subgraphs to the replaced compute nodes, by properly leveraging the computation and communication cost. We integrate the proposed recovery scheme into Giraph system, a widely used graph processing system. The experimental results over a variety of real graph datasets demonstrate that our proposed recovery scheme outperforms existing recovery methods by up to 30x on a cluster of 40 compute nodes.

Original languageEnglish
Article number8371278
Pages (from-to)733-746
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Volume31
Issue number4
DOIs
Publication statusPublished - 1 Apr 2019

Keywords

  • Distributed graph processing systems
  • checkpoint
  • compression
  • failure recovery
  • log
  • partition-based recovery

Fingerprint

Dive into the research topics of 'Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems'. Together they form a unique fingerprint.

Cite this