A Survey of Approximate Quantile Computation on Large-Scale Data

Zhiwei Chen; Aoqian Zhang

doi:10.1109/ACCESS.2020.2974919

A Survey of Approximate Quantile Computation on Large-Scale Data

Zhiwei Chen, Aoqian Zhang^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

13 Citations (Scopus)

Abstract

As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.

Original language	English
Article number	9001104
Pages (from-to)	34585-34597
Number of pages	13
Journal	IEEE Access
Volume	8
DOIs	https://doi.org/10.1109/ACCESS.2020.2974919
Publication status	Published - 2020
Externally published	Yes

Keywords

Data profiling
approximate quantile
distributed model
order statistics
streaming model

Access to Document

10.1109/ACCESS.2020.2974919

Cite this

@article{60a8098b10534234bea998a603bf36cc,

title = "A Survey of Approximate Quantile Computation on Large-Scale Data",

abstract = "As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.",

keywords = "Data profiling, approximate quantile, distributed model, order statistics, streaming model",

author = "Zhiwei Chen and Aoqian Zhang",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2020",

doi = "10.1109/ACCESS.2020.2974919",

language = "English",

volume = "8",

pages = "34585--34597",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - A Survey of Approximate Quantile Computation on Large-Scale Data

AU - Chen, Zhiwei

AU - Zhang, Aoqian

PY - 2020

Y1 - 2020

N2 - As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.

AB - As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.

KW - Data profiling

KW - approximate quantile

KW - distributed model

KW - order statistics

KW - streaming model

UR - http://www.scopus.com/inward/record.url?scp=85080900661&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2020.2974919

DO - 10.1109/ACCESS.2020.2974919

M3 - Article

AN - SCOPUS:85080900661

SN - 2169-3536

VL - 8

SP - 34585

EP - 34597

JO - IEEE Access

JF - IEEE Access

M1 - 9001104

ER -

A Survey of Approximate Quantile Computation on Large-Scale Data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this