TY - GEN
T1 - LogLog filter
T2 - 37th IEEE International Conference on Data Engineering, ICDE 2021
AU - Jia, Peng
AU - Wang, Pinghui
AU - Zhao, Junzhou
AU - Yuan, Ye
AU - Tao, Jing
AU - Guan, Xiaohong
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/4
Y1 - 2021/4
N2 - Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items' frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the "LogLog"structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to {2{{24}}}. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.
AB - Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items' frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the "LogLog"structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to {2{{24}}}. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.
UR - http://www.scopus.com/inward/record.url?scp=85112866842&partnerID=8YFLogxK
U2 - 10.1109/ICDE51399.2021.00075
DO - 10.1109/ICDE51399.2021.00075
M3 - Conference contribution
AN - SCOPUS:85112866842
T3 - Proceedings - International Conference on Data Engineering
SP - 804
EP - 815
BT - Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
PB - IEEE Computer Society
Y2 - 19 April 2021 through 22 April 2021
ER -