joinTree: A novel join-oriented multivariate operator for spatio-temporal data management in Flink

Hangxu Ji, Gang Wu*, Yuhai Zhao, Shiye Wang, Guoren Wang, George Y. Yuan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

In the era of intelligent Internet, the management and analysis of massive spatio-temporal data is one of the important links to realize intelligent applications and build smart cities, in which the interaction of multi-source data is the basis of realizing spatio-temporal data management and analysis. As an important carrier to achieve the interactive calculation of massive data, Flink provides the advanced Operator Join to facilitate user program development. In a Flink job with multi-source data connection operations, the selection of join sequences and the data communication in the repartition phase are both key factors that affect the efficiency of the job. However, Flink does not provide any optimization mechanism for the two factors, which in turn leads to low job efficiency. If the enumeration method is used to find the optimal join sequence, the result will not be obtained in polynomial time, so the optimization effect cannot be achieved. We investigate the above problems, design and implement a more advanced Operator joinTree that can support multi-source data connection in Flink, and introduce two optimization strategies into the Operator. In summary, the advantages of our work are highlighted as follows: (1) the Operator enables Flink to support multi-source data connection operation, and reduces the amount of calculation and data communication by introducing lightweight optimization strategies to improve job efficiency; (2) with the optimization strategy for join sequence, the total running time can be reduced by 29% and the data communication can be reduced by 34% compared with traditional sequential execution; (3) the optimization strategy for data repartition can further enable the job to bring 35% performance improvement, and in the average case can reduce the data communication by 43%.

Original languageEnglish
Pages (from-to)107-132
Number of pages26
JournalGeoInformatica
Volume27
Issue number1
DOIs
Publication statusPublished - Jan 2023

Keywords

  • Data connection
  • Data repartition
  • Flink
  • Join sequence
  • Spatio-temporal data management

Fingerprint

Dive into the research topics of 'joinTree: A novel join-oriented multivariate operator for spatio-temporal data management in Flink'. Together they form a unique fingerprint.

Cite this