TY - GEN
T1 - Sequential data cleaning
T2 - 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
AU - Zhang, Aoqian
AU - Song, Shaoxu
AU - Wang, Jianmin
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/6/26
Y1 - 2016/6/26
N2 - Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. The probability of speed changes. We formalize the likelihood-based cleaning problem, show its np- hardness, devise exact algorithms, and propose several approximate/ heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.
AB - Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. The probability of speed changes. We formalize the likelihood-based cleaning problem, show its np- hardness, devise exact algorithms, and propose several approximate/ heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.
UR - http://www.scopus.com/inward/record.url?scp=84979684895&partnerID=8YFLogxK
U2 - 10.1145/2882903.2915233
DO - 10.1145/2882903.2915233
M3 - Conference contribution
AN - SCOPUS:84979684895
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 909
EP - 924
BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 26 June 2016 through 1 July 2016
ER -