Finding needles in a hay stream: On persistent item lookup in data streams

Lin Chen; Haipeng Dai; Lei Meng; Jihong Yu

doi:10.1016/j.comnet.2020.107518

Finding needles in a hay stream: On persistent item lookup in data streams

Lin Chen^*, Haipeng Dai, Lei Meng, Jihong Yu

^*此作品的通讯作者

信息与电子学院

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

In a data stream composed of an ordered sequence of data items, persistent items refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. Persistent item lookup, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.

源语言	英语
文章编号	107518
期刊	Computer Networks
卷	181
DOI	https://doi.org/10.1016/j.comnet.2020.107518
出版状态	已出版 - 9 11月 2020

访问文件

10.1016/j.comnet.2020.107518

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3fb8e86e863c46a6add640b3a4976237,

title = "Finding needles in a hay stream: On persistent item lookup in data streams",

abstract = "In a data stream composed of an ordered sequence of data items, persistent items refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. Persistent item lookup, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.",

keywords = "Data stream mining, Persistent item lookup",

author = "Lin Chen and Haipeng Dai and Lei Meng and Jihong Yu",

note = "Publisher Copyright: {\textcopyright} 2020",

year = "2020",

month = nov,

day = "9",

doi = "10.1016/j.comnet.2020.107518",

language = "English",

volume = "181",

journal = "Computer Networks",

issn = "1389-1286",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Finding needles in a hay stream

T2 - On persistent item lookup in data streams

AU - Chen, Lin

AU - Dai, Haipeng

AU - Meng, Lei

AU - Yu, Jihong

PY - 2020/11/9

Y1 - 2020/11/9

N2 - In a data stream composed of an ordered sequence of data items, persistent items refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. Persistent item lookup, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.

AB - In a data stream composed of an ordered sequence of data items, persistent items refer to those persisting to occur over a long timespan. Compared with ordinary items, persistent ones, though not necessarily occurring more frequently, typically convey more valuable information. Persistent item lookup, the functionality to identify all persistent items, emerges as a pivotal building block in many computing and network systems. In this paper, we devise a generic persistent item lookup algorithm supporting high-speed, high-accuracy lookup with limited memory cost. The key technicalities we propose in our design are two-fold. First, our algorithm attempts to record only persistent items seen so far based on the currently available information about the stream, thus significantly reducing memory overhead, especially for real-life highly skewed data streams. Second, our algorithm balances the recording load in both time and space domains: in the time domain, we partition persistent items into approximately equal-size subsets and record only one subset in each epoch; in the space domain, we apply the state-of-the-art load balancing technique to evenly distribute recorded items across the on-die memory. By holistically integrating these components, we iron out a persistent item lookup algorithm outperforming existing solutions in a wide range of practical settings.

KW - Data stream mining

KW - Persistent item lookup

UR - http://www.scopus.com/inward/record.url?scp=85090299712&partnerID=8YFLogxK

U2 - 10.1016/j.comnet.2020.107518

DO - 10.1016/j.comnet.2020.107518

M3 - Article

AN - SCOPUS:85090299712

SN - 1389-1286

VL - 181

JO - Computer Networks

JF - Computer Networks

M1 - 107518

ER -

Finding needles in a hay stream: On persistent item lookup in data streams

摘要

访问文件

其它文件与链接

指纹

引用此