BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink

Hangxu Ji; Su Jiang; Yuhai Zhao; Gang Wu; Guoren Wang; George Y. Yuan

doi:10.1016/j.future.2022.11.016

BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink

Hangxu Ji, Su Jiang, Yuhai Zhao^*, Gang Wu, Guoren Wang, George Y. Yuan

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.

Original language	English
Pages (from-to)	67-80
Number of pages	14
Journal	Future Generation Computer Systems
Volume	141
DOIs	https://doi.org/10.1016/j.future.2022.11.016
Publication status	Published - Apr 2023

Keywords

Cache
Flink
Hotspot awareness
Mixed batch-stream data join
Skip list

Access to Document

10.1016/j.future.2022.11.016

Cite this

@article{af059d86a5e94e39b23723467eef4566,

title = "BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink",

abstract = "The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.",

keywords = "Cache, Flink, Hotspot awareness, Mixed batch-stream data join, Skip list",

author = "Hangxu Ji and Su Jiang and Yuhai Zhao and Gang Wu and Guoren Wang and Yuan, {George Y.}",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2023",

month = apr,

doi = "10.1016/j.future.2022.11.016",

language = "English",

volume = "141",

pages = "67--80",

journal = "Future Generation Computer Systems",

issn = "0167-739X",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - BS-Join

T2 - A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink

AU - Ji, Hangxu

AU - Jiang, Su

AU - Zhao, Yuhai

AU - Wu, Gang

AU - Wang, Guoren

AU - Yuan, George Y.

PY - 2023/4

Y1 - 2023/4

N2 - The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.

AB - The new computing model, mixed batch-stream data processing, plays a crucial role in big spatiotemporal data managements. As the core of the above computing method, mixed batch-stream data join has high requirements on the throughput and latency due to the coexistence of two types of data sources. Apache Flink is the most suitable distributed system for mixed batch-stream data join, with lower latency than the join calculation model based on Hadoop and Spark, and it simulates remote real-time reading of batch data sources and completes join calculation with the DataStream API. However, as the degree of parallelism increases, frequent remote data reads will cause huge disk and communication pressure, thereby reducing the job efficiency and scalability. To make things trickier, the above effects are further amplified when simulating complex operations such as range joins. Aiming at the above difficulties and the characteristics of mixed batch-stream data join, a cache-based framework supporting mixed batch-stream join computing natively is proposed, which increases the search speed in the process of data join by building indexes in batch data sources. Meanwhile, for equijoin and range join, an optimization mechanism based on hotspot awareness and an optimization mechanism based on skip list are proposed respectively to further improve the job efficiency. In summary, the advantages of our work are highlighted as follows: (1) The proposed framework enables Flink to natively support mixed batch-stream data join, and can improve throughput by 5 times and speedup by 4 times; (2) The optimization mechanism based on hotspot awareness can further improve the efficiency of equijoin; (3) Compared with range queries by traditional Operators in Flink, the throughput can be increased by 6 times while the latency is reduced by 45%.

KW - Cache

KW - Flink

KW - Hotspot awareness

KW - Mixed batch-stream data join

KW - Skip list

UR - http://www.scopus.com/inward/record.url?scp=85142725075&partnerID=8YFLogxK

U2 - 10.1016/j.future.2022.11.016

DO - 10.1016/j.future.2022.11.016

M3 - Article

AN - SCOPUS:85142725075

SN - 0167-739X

VL - 141

SP - 67

EP - 80

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

ER -

BS-Join: A novel and efficient mixed batch-stream join method for spatiotemporal data management in Flink

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this