TY - JOUR
T1 - BS-Join
T2 - A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink
AU - Ji, Hangxu
AU - Jiang, Su
AU - Zhao, Yuhai
AU - Wu, Gang
AU - Wang, Guoren
AU - Yuan, George Y.
N1 - Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2023/4
Y1 - 2023/4
N2 - The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.
AB - The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.
KW - Cache
KW - Flink
KW - Hotspot awareness
KW - Mixed batch-stream data join
KW - Skip list
UR - http://www.scopus.com/inward/record.url?scp=85142725075&partnerID=8YFLogxK
U2 - 10.1016/j.future.2022.11.016
DO - 10.1016/j.future.2022.11.016
M3 - Article
AN - SCOPUS:85142725075
SN - 0167-739X
VL - 141
SP - 67
EP - 80
JO - Future Generation Computer Systems
JF - Future Generation Computer Systems
ER -