BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink

Hangxu Ji, Su Jiang, Yuhai Zhao*, Gang Wu, Guoren Wang, George Y. Yuan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.

Original languageEnglish
Pages (from-to)67-80
Number of pages14
JournalFuture Generation Computer Systems
Volume141
DOIs
Publication statusPublished - Apr 2023

Keywords

  • Cache
  • Flink
  • Hotspot awareness
  • Mixed batch-stream data join
  • Skip list

Fingerprint

Dive into the research topics of 'BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink'. Together they form a unique fingerprint.

Cite this