BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink

Hangxu Ji, Su Jiang, Yuhai Zhao*, Gang Wu, Guoren Wang, George Y. Yuan

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.

源语言英语
页(从-至)67-80
页数14
期刊Future Generation Computer Systems
141
DOI
出版状态已出版 - 4月 2023

指纹

探究 'BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink' 的科研主题。它们共同构成独一无二的指纹。

引用此