TY - JOUR
T1 - A Survey of Approximate Quantile Computation on Large-Scale Data
AU - Chen, Zhiwei
AU - Zhang, Aoqian
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2020
Y1 - 2020
N2 - As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.
AB - As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.
KW - Data profiling
KW - approximate quantile
KW - distributed model
KW - order statistics
KW - streaming model
UR - http://www.scopus.com/inward/record.url?scp=85080900661&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2020.2974919
DO - 10.1109/ACCESS.2020.2974919
M3 - Article
AN - SCOPUS:85080900661
SN - 2169-3536
VL - 8
SP - 34585
EP - 34597
JO - IEEE Access
JF - IEEE Access
M1 - 9001104
ER -