TY - GEN
T1 - SCREEN
T2 - ACM SIGMOD International Conference on Management of Data, SIGMOD 2015
AU - Song, Shaoxu
AU - Zhang, Aoqian
AU - Wang, Jianmin
AU - Yu, Philip S.
N1 - Publisher Copyright:
Copyright © 2015 ACM.
PY - 2015/5/27
Y1 - 2015/5/27
N2 - Stream data are often dirty, for example, owing to unreliable sensor reading, or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum change principle in data cleaning. To capture the knowledge about what is clean, we consider the (widely existing) constraints on the speed of data changes, such as fuel consumption per hour, or daily limit of stock prices. Guided by these semantic constraints, in this paper, we propose SCREEN, the first constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window. Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle, (3) support on out-of-order arrivals of data points, and (4) adaptive window size for balancing repair accuracy and efficiency. Experiments on real datasets demonstrate that SCREEN can show significantly higher repair accuracy than the existing approaches such as smoothing.
AB - Stream data are often dirty, for example, owing to unreliable sensor reading, or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum change principle in data cleaning. To capture the knowledge about what is clean, we consider the (widely existing) constraints on the speed of data changes, such as fuel consumption per hour, or daily limit of stock prices. Guided by these semantic constraints, in this paper, we propose SCREEN, the first constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window. Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle, (3) support on out-of-order arrivals of data points, and (4) adaptive window size for balancing repair accuracy and efficiency. Experiments on real datasets demonstrate that SCREEN can show significantly higher repair accuracy than the existing approaches such as smoothing.
KW - Data repairing
KW - Speed constraints
UR - http://www.scopus.com/inward/record.url?scp=84957558643&partnerID=8YFLogxK
U2 - 10.1145/2723372.2723730
DO - 10.1145/2723372.2723730
M3 - Conference contribution
AN - SCOPUS:84957558643
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 827
EP - 841
BT - SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 31 May 2015 through 4 June 2015
ER -