A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink

Hangxu Ji; Gang Wu; Yuhai Zhao; Liuguo Wei; Guoren Wang; Yuchen Fan

doi:10.1007/s11280-022-01006-5

A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink

Hangxu Ji, Gang Wu^*, Yuhai Zhao, Liuguo Wei, Guoren Wang, Yuchen Fan

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Spatiotemporal data analysis plays a vital role in big data processing, and it is also a research hotspot in location-aware and recommender systems. In these applications, graph modeling and distributed iterative computing are the basis and guarantee for data query and mining. Because of the constant repeated execution of specific calculation logic, iterative jobs have the characteristics of being time-consuming and exerting high pressure on system resources. However, iterative jobs always face the risk of stopping due to computing node fault, which in turn causes serious economic losses. At present, the latest generation of distributed computing system Flink’s recovery strategy for node faults in batch processing mode is to restart the job from the beginning, which is extremely time-consuming. If the checkpoint mechanism in Flink’s stream-processing mode is used to recover from batch jobs failures, it will greatly increase the running time and storage overhead in trouble-free state. Therefore, a lightweight fault-tolerant mechanism is needed to reduce failure recovery time while ensuring the job efficiency of spatiotemporal data analysis. In view of the above situation and the characteristics of the iterative computing model for graph computing, a single-node failure recovery mechanism only for the failed node is proposed, which reduces the failure recovery time by introducing lightweight checkpoints and local logs. Based on the proposed single-node failure recovery mechanism, a failure recovery mechanism under multi-node fault and associated fault is proposed, which can cope with more complex failure situations occurs. Experimental results show that the proposed method can quickly and effectively recover jobs after failure, reducing the average recovery time by 37% in the case of single node fault, and reducing the average recovery time by 24% in the case of multi-node fault.

源语言	英语
页（从-至）	867-887
页数	21
期刊	World Wide Web
卷	26
期	3
DOI	https://doi.org/10.1007/s11280-022-01006-5
出版状态	已出版 - 5月 2023

访问文件

10.1007/s11280-022-01006-5

其它文件与链接

链接到 Scopus 的出版物

引用此

Ji, H., Wu, G., Zhao, Y., Wei, L., Wang, G., & Fan, Y. (2023). A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink. World Wide Web, 26(3), 867-887. https://doi.org/10.1007/s11280-022-01006-5

@article{fdd7fd0d595f4f8c954c770fccc2eca0,

title = "A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink",

abstract = "Spatiotemporal data analysis plays a vital role in big data processing, and it is also a research hotspot in location-aware and recommender systems. In these applications, graph modeling and distributed iterative computing are the basis and guarantee for data query and mining. Because of the constant repeated execution of specific calculation logic, iterative jobs have the characteristics of being time-consuming and exerting high pressure on system resources. However, iterative jobs always face the risk of stopping due to computing node fault, which in turn causes serious economic losses. At present, the latest generation of distributed computing system Flink{\textquoteright}s recovery strategy for node faults in batch processing mode is to restart the job from the beginning, which is extremely time-consuming. If the checkpoint mechanism in Flink{\textquoteright}s stream-processing mode is used to recover from batch jobs failures, it will greatly increase the running time and storage overhead in trouble-free state. Therefore, a lightweight fault-tolerant mechanism is needed to reduce failure recovery time while ensuring the job efficiency of spatiotemporal data analysis. In view of the above situation and the characteristics of the iterative computing model for graph computing, a single-node failure recovery mechanism only for the failed node is proposed, which reduces the failure recovery time by introducing lightweight checkpoints and local logs. Based on the proposed single-node failure recovery mechanism, a failure recovery mechanism under multi-node fault and associated fault is proposed, which can cope with more complex failure situations occurs. Experimental results show that the proposed method can quickly and effectively recover jobs after failure, reducing the average recovery time by 37% in the case of single node fault, and reducing the average recovery time by 24% in the case of multi-node fault.",

keywords = "Failure recovery, Fault-tolerant, Flink, Iterative computing, Spatiotemporal data analysis",

author = "Hangxu Ji and Gang Wu and Yuhai Zhao and Liuguo Wei and Guoren Wang and Yuchen Fan",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = may,

doi = "10.1007/s11280-022-01006-5",

language = "English",

volume = "26",

pages = "867--887",

journal = "World Wide Web",

issn = "1386-145X",

publisher = "Springer New York",

number = "3",

}

TY - JOUR

T1 - A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink

AU - Ji, Hangxu

AU - Wu, Gang

AU - Zhao, Yuhai

AU - Wei, Liuguo

AU - Wang, Guoren

AU - Fan, Yuchen

PY - 2023/5

Y1 - 2023/5

N2 - Spatiotemporal data analysis plays a vital role in big data processing, and it is also a research hotspot in location-aware and recommender systems. In these applications, graph modeling and distributed iterative computing are the basis and guarantee for data query and mining. Because of the constant repeated execution of specific calculation logic, iterative jobs have the characteristics of being time-consuming and exerting high pressure on system resources. However, iterative jobs always face the risk of stopping due to computing node fault, which in turn causes serious economic losses. At present, the latest generation of distributed computing system Flink’s recovery strategy for node faults in batch processing mode is to restart the job from the beginning, which is extremely time-consuming. If the checkpoint mechanism in Flink’s stream-processing mode is used to recover from batch jobs failures, it will greatly increase the running time and storage overhead in trouble-free state. Therefore, a lightweight fault-tolerant mechanism is needed to reduce failure recovery time while ensuring the job efficiency of spatiotemporal data analysis. In view of the above situation and the characteristics of the iterative computing model for graph computing, a single-node failure recovery mechanism only for the failed node is proposed, which reduces the failure recovery time by introducing lightweight checkpoints and local logs. Based on the proposed single-node failure recovery mechanism, a failure recovery mechanism under multi-node fault and associated fault is proposed, which can cope with more complex failure situations occurs. Experimental results show that the proposed method can quickly and effectively recover jobs after failure, reducing the average recovery time by 37% in the case of single node fault, and reducing the average recovery time by 24% in the case of multi-node fault.

AB - Spatiotemporal data analysis plays a vital role in big data processing, and it is also a research hotspot in location-aware and recommender systems. In these applications, graph modeling and distributed iterative computing are the basis and guarantee for data query and mining. Because of the constant repeated execution of specific calculation logic, iterative jobs have the characteristics of being time-consuming and exerting high pressure on system resources. However, iterative jobs always face the risk of stopping due to computing node fault, which in turn causes serious economic losses. At present, the latest generation of distributed computing system Flink’s recovery strategy for node faults in batch processing mode is to restart the job from the beginning, which is extremely time-consuming. If the checkpoint mechanism in Flink’s stream-processing mode is used to recover from batch jobs failures, it will greatly increase the running time and storage overhead in trouble-free state. Therefore, a lightweight fault-tolerant mechanism is needed to reduce failure recovery time while ensuring the job efficiency of spatiotemporal data analysis. In view of the above situation and the characteristics of the iterative computing model for graph computing, a single-node failure recovery mechanism only for the failed node is proposed, which reduces the failure recovery time by introducing lightweight checkpoints and local logs. Based on the proposed single-node failure recovery mechanism, a failure recovery mechanism under multi-node fault and associated fault is proposed, which can cope with more complex failure situations occurs. Experimental results show that the proposed method can quickly and effectively recover jobs after failure, reducing the average recovery time by 37% in the case of single node fault, and reducing the average recovery time by 24% in the case of multi-node fault.

KW - Failure recovery

KW - Fault-tolerant

KW - Flink

KW - Iterative computing

KW - Spatiotemporal data analysis

UR - http://www.scopus.com/inward/record.url?scp=85127608977&partnerID=8YFLogxK

U2 - 10.1007/s11280-022-01006-5

DO - 10.1007/s11280-022-01006-5

M3 - Article

AN - SCOPUS:85127608977

SN - 1386-145X

VL - 26

SP - 867

EP - 887

JO - World Wide Web

JF - World Wide Web

IS - 3

ER -

A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink

摘要

访问文件

其它文件与链接

指纹

引用此