面向Flink迭代计算的高效容错处理技术

Translated title of the contribution: Efficient Fault-Tolerant Processing Technology for Flink Iterative Computing

Wen Peng Guo, Yu Hai Zhao*, Guo Ren Wang, Liu Guo Wei

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

Iterative calculation is the repeated execution of the same logic and is widely used in various machine learning and data mining methods. In the field of big data processing and analysis, distributed iterative computing is one of the current hot research issues. Fault tolerance is a necessary guarantee for high availability of distributed systems. Although the fault tolerance mechanism of existing distributed systems performs well in high availability, it ignores the problem of fault tolerance efficiency for iterative computing. This paper systematically studies the iterative fault-tolerant efficiency of batch-flow hybrid big data computing system Apache Flink. When performing stream processing tasks, Flink uses a "distributed snapshot" checkpoint mechanism to complete fault tolerance. For iterative analysis of massive data, checkpoints add unnecessary delay. When performing batch processing tasks, Flink uses the task execution method from the beginning to achieve fault tolerance. Although this method is simple to implement, it brings a lot of time overhead. In view of the above problems, this paper first proposes an optimistic iterative fault tolerance mechanism based on compensation functions. This fault-tolerant mechanism uses optimistic compensation to recover tasks when iterative tasks fail. It does not use any additional fault-tolerant methods (it does not introduce additional fault-tolerant overhead) during iterative execution, and uses user-defined compensation functions to collect healthy nodes. Iterative data, combined with the initial iterative data, recovers the lost partition data on the failed node, and continues execution to the iterative convergence state, ensuring the efficient and smooth execution of the iterative task. Because the optimistic iterative fault tolerance mechanism does not guarantee that the results obtained are completely consistent with the results obtained by fault-free execution, for the iteration tasks with higher accuracy requirements, this paper combines the iterative data flow model of the Flink system to further propose a head-to-tail checkpoint. Pessimistic iterative fault tolerance mechanism. Unlike traditional blocking checkpoints(blocking downstream operators), this fault-tolerant mechanism writes checkpoints in a non-blocking manner, fully combines the characteristics of Flink iterative data flow, and injects variable data set checkpoints into the iterative flow itself. By designing iterative awareness, the system architecture is simplified, and checkpoint costs and failure recovery times are reduced. This paper is based on the Flink system. On a large number of real data sets and simulated data sets, a comprehensive experimental study of the two proposed fault tolerance mechanisms from the aspects of incremental iteration and full iteration is conducted, and the effectiveness of the proposed iterative fault tolerance optimization technology is verified. Efficiency. The experimental results confirm that the optimistic and pessimistic fault-tolerant mechanisms proposed in this paper based on the Flink system are superior to the existing distributed iterative fault-tolerant mechanisms in terms of computational efficiency. The former can increase the running time by up to 22.8% in full iterative computing tasks and up to 33.8% in incremental iterative computing tasks; the latter can save up to 15.3% of the time overhead in full iterative tasks, and in incremental iterative tasks Saves up to 18.5% of time.

Translated title of the contributionEfficient Fault-Tolerant Processing Technology for Flink Iterative Computing
Original languageChinese (Traditional)
Pages (from-to)2101-2118
Number of pages18
JournalJisuanji Xuebao/Chinese Journal of Computers
Volume43
Issue number11
DOIs
Publication statusPublished - Nov 2020

Fingerprint

Dive into the research topics of 'Efficient Fault-Tolerant Processing Technology for Flink Iterative Computing'. Together they form a unique fingerprint.

Cite this