Optimization for Large-Scale Dimension Table Connection Technology in Distributed Environment

Hengtai Zhao, Yuhai Zhao*, Ye Yuan, Hangxu Ji, Baiyou Qiao, Guoren Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The large-scale dimension table connection technology in the distributed environment is one of the key technologies in online big data analysis, which is widely used in real-time recommendation, real-time analysis and other fields. The dimension table connection indicates that stream data and dimension tables stored offline will be connected to be processed accordingly. Firstly, this paper studies the existing dimension table connection technology and surveys the design of relevant optimization technologies and mainstream distributed engines. The traditional way of improving performance is optimizing dimension table data query. Traditional optimization is limited to the scale of the dimension table and data stream rate. Secondly, in terms of the inefficient usage of existent optimization technologies’consideration for the whole cluster in distributed environment, this paper puts forward a computing model suitable for hybrid calculation of offline batch data and real-time stream data. This paper proposes a method of dimension table associated data cache, which reads dimension table data from a single node and distributes and calculates the data after it is segmented. This paper also optimizes the computing logic of dimension table connection so that a higher-level scale of the dimension table is applied, and the data connection limitation is overcome. Finally, the dimension table connection technology in this paper and the traditional dimension table connection technology have been implemented in Apache Flink. The optimization for dimension table connection of distributed stream computing in this paper has been verified via the experiment of comparing throughput and latency based on dataset from Double 11 Shopping Carnival of Alibaba Group.

Original languageEnglish
Pages (from-to)337-347
Number of pages11
JournalJournal of Frontiers of Computer Science and Technology
Volume16
Issue number2
DOIs
Publication statusPublished - 1 Feb 2022

Keywords

  • Apache Flink
  • cache technology
  • dimension table connection
  • distributed computing

Fingerprint

Dive into the research topics of 'Optimization for Large-Scale Dimension Table Connection Technology in Distributed Environment'. Together they form a unique fingerprint.

Cite this