TY - JOUR
T1 - Finding needles in a hay stream
T2 - On persistent item lookup in data streams
AU - Chen, Lin
AU - Dai, Haipeng
AU - Meng, Lei
AU - Yu, Jihong
N1 - Publisher Copyright:
© 2020
PY - 2020/11/9
Y1 - 2020/11/9
N2 - In a data stream composed of an ordered sequence of data items, persistent items refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. Persistent item lookup, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.
AB - In a data stream composed of an ordered sequence of data items, persistent items refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. Persistent item lookup, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.
KW - Data stream mining
KW - Persistent item lookup
UR - http://www.scopus.com/inward/record.url?scp=85090299712&partnerID=8YFLogxK
U2 - 10.1016/j.comnet.2020.107518
DO - 10.1016/j.comnet.2020.107518
M3 - Article
AN - SCOPUS:85090299712
SN - 1389-1286
VL - 181
JO - Computer Networks
JF - Computer Networks
M1 - 107518
ER -