A Survey of Approximate Quantile Computation on Large-Scale Data

Zhiwei Chen; Aoqian Zhang

doi:10.1109/ACCESS.2020.2974919

A Survey of Approximate Quantile Computation on Large-Scale Data

Zhiwei Chen, Aoqian Zhang^*

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

13 引用（Scopus）

摘要

As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.

源语言	英语
文章编号	9001104
页（从-至）	34585-34597
页数	13
期刊	IEEE Access
卷	8
DOI	https://doi.org/10.1109/ACCESS.2020.2974919
出版状态	已出版 - 2020
已对外发布	是

访问文件

10.1109/ACCESS.2020.2974919

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{60a8098b10534234bea998a603bf36cc,

title = "A Survey of Approximate Quantile Computation on Large-Scale Data",

abstract = "As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.",

keywords = "Data profiling, approximate quantile, distributed model, order statistics, streaming model",

author = "Zhiwei Chen and Aoqian Zhang",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2020",

doi = "10.1109/ACCESS.2020.2974919",

language = "English",

volume = "8",

pages = "34585--34597",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - A Survey of Approximate Quantile Computation on Large-Scale Data

AU - Chen, Zhiwei

AU - Zhang, Aoqian

PY - 2020

Y1 - 2020

N2 - As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.

AB - As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.

KW - Data profiling

KW - approximate quantile

KW - distributed model

KW - order statistics

KW - streaming model

UR - http://www.scopus.com/inward/record.url?scp=85080900661&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2020.2974919

DO - 10.1109/ACCESS.2020.2974919

M3 - Article

AN - SCOPUS:85080900661

SN - 2169-3536

VL - 8

SP - 34585

EP - 34597

JO - IEEE Access

JF - IEEE Access

M1 - 9001104

ER -

A Survey of Approximate Quantile Computation on Large-Scale Data

摘要

访问文件

其它文件与链接

指纹

引用此