LogLog filter: Filtering cold items within a large range over high speed data streams

Peng Jia; Pinghui Wang; Junzhou Zhao; Ye Yuan; Jing Tao; Xiaohong Guan

doi:10.1109/ICDE51399.2021.00075

LogLog filter: Filtering cold items within a large range over high speed data streams

Peng Jia, Pinghui Wang^*, Junzhou Zhao, Ye Yuan, Jing Tao, Xiaohong Guan

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

14 引用（Scopus）

摘要

Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items' frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the "LogLog"structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to {2{{24}}}. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.

源语言	英语
主期刊名	Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
出版商	IEEE Computer Society
页	804-815
页数	12
ISBN（电子版）	9781728191843
DOI	https://doi.org/10.1109/ICDE51399.2021.00075
出版状态	已出版 - 4月 2021
活动	37th IEEE International Conference on Data Engineering, ICDE 2021 - Virtual, Chania, 希腊期限: 19 4月 2021 → 22 4月 2021

出版系列

姓名	Proceedings - International Conference on Data Engineering
卷	2021-April
ISSN（印刷版）	1084-4627

会议

会议	37th IEEE International Conference on Data Engineering, ICDE 2021
国家/地区	希腊
市	Virtual, Chania
时期	19/04/21 → 22/04/21

访问文件

10.1109/ICDE51399.2021.00075

其它文件与链接

链接到 Scopus 的出版物

引用此

Jia, P., Wang, P., Zhao, J., Yuan, Y., Tao, J., & Guan, X. (2021). LogLog filter: Filtering cold items within a large range over high speed data streams. 在 Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021 (页码 804-815). 文章 9458928 (Proceedings - International Conference on Data Engineering; 卷 2021-April). IEEE Computer Society. https://doi.org/10.1109/ICDE51399.2021.00075

@inproceedings{21da570f2b20495cb800a2340547c35d,

title = "LogLog filter: Filtering cold items within a large range over high speed data streams",

abstract = "Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items' frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the {"}LogLog{"}structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to {2{{24}}}. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.",

author = "Peng Jia and Pinghui Wang and Junzhou Zhao and Ye Yuan and Jing Tao and Xiaohong Guan",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 37th IEEE International Conference on Data Engineering, ICDE 2021 ; Conference date: 19-04-2021 Through 22-04-2021",

year = "2021",

month = apr,

doi = "10.1109/ICDE51399.2021.00075",

language = "English",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "804--815",

booktitle = "Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021",

address = "United States",

}

Jia, P, Wang, P, Zhao, J, Yuan, Y, Tao, J & Guan, X 2021, LogLog filter: Filtering cold items within a large range over high speed data streams. 在 Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021., 9458928, Proceedings - International Conference on Data Engineering, 卷 2021-April, IEEE Computer Society, 页码 804-815, 37th IEEE International Conference on Data Engineering, ICDE 2021, Virtual, Chania, 希腊, 19/04/21. https://doi.org/10.1109/ICDE51399.2021.00075

LogLog filter: Filtering cold items within a large range over high speed data streams. / Jia, Peng; Wang, Pinghui; Zhao, Junzhou 等.
Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021. IEEE Computer Society, 2021. 页码 804-815 9458928 (Proceedings - International Conference on Data Engineering; 卷 2021-April).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - LogLog filter

T2 - 37th IEEE International Conference on Data Engineering, ICDE 2021

AU - Jia, Peng

AU - Wang, Pinghui

AU - Zhao, Junzhou

AU - Yuan, Ye

AU - Tao, Jing

AU - Guan, Xiaohong

PY - 2021/4

Y1 - 2021/4

N2 - Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items' frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the "LogLog"structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to {2{{24}}}. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.

AB - Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items' frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the "LogLog"structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to {2{{24}}}. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.

UR - http://www.scopus.com/inward/record.url?scp=85112866842&partnerID=8YFLogxK

U2 - 10.1109/ICDE51399.2021.00075

DO - 10.1109/ICDE51399.2021.00075

M3 - Conference contribution

AN - SCOPUS:85112866842

T3 - Proceedings - International Conference on Data Engineering

SP - 804

EP - 815

BT - Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021

PB - IEEE Computer Society

Y2 - 19 April 2021 through 22 April 2021

ER -

Jia P, Wang P, Zhao J, Yuan Y, Tao J, Guan X. LogLog filter: Filtering cold items within a large range over high speed data streams. 在 Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021. IEEE Computer Society. 2021. 页码 804-815. 9458928. (Proceedings - International Conference on Data Engineering). doi: 10.1109/ICDE51399.2021.00075

LogLog filter: Filtering cold items within a large range over high speed data streams

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此