Sequential data cleaning: A statistical approach

Aoqian Zhang; Shaoxu Song; Jianmin Wang

doi:10.1145/2882903.2915233

Sequential data cleaning: A statistical approach

Aoqian Zhang, Shaoxu Song, Jianmin Wang

Tsinghua University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

54 Citations (Scopus)

Abstract

Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. The probability of speed changes. We formalize the likelihood-based cleaning problem, show its np- hardness, devise exact algorithms, and propose several approximate/ heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.

Original language	English
Title of host publication	SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
Publisher	Association for Computing Machinery
Pages	909-924
Number of pages	16
ISBN (Electronic)	9781450335317
DOIs	https://doi.org/10.1145/2882903.2915233
Publication status	Published - 26 Jun 2016
Externally published	Yes
Event	2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016 - San Francisco, United States Duration: 26 Jun 2016 → 1 Jul 2016

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
Volume	26-June-2016
ISSN (Print)	0730-8078

Conference

Conference	2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
Country/Territory	United States
City	San Francisco
Period	26/06/16 → 1/07/16

Access to Document

10.1145/2882903.2915233

Cite this

@inproceedings{a71df7ebdfb4452f8a0b16d318a0de93,

title = "Sequential data cleaning: A statistical approach",

abstract = "Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. The probability of speed changes. We formalize the likelihood-based cleaning problem, show its np- hardness, devise exact algorithms, and propose several approximate/ heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.",

author = "Aoqian Zhang and Shaoxu Song and Jianmin Wang",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016 ; Conference date: 26-06-2016 Through 01-07-2016",

year = "2016",

month = jun,

day = "26",

doi = "10.1145/2882903.2915233",

language = "English",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "909--924",

booktitle = "SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data",

}

Zhang, A, Song, S & Wang, J 2016, Sequential data cleaning: A statistical approach. in SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, vol. 26-June-2016, Association for Computing Machinery, pp. 909-924, 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016, San Francisco, United States, 26/06/16. https://doi.org/10.1145/2882903.2915233

Sequential data cleaning: A statistical approach. / Zhang, Aoqian; Song, Shaoxu; Wang, Jianmin.
SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Association for Computing Machinery, 2016. p. 909-924 (Proceedings of the ACM SIGMOD International Conference on Management of Data; Vol. 26-June-2016).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Sequential data cleaning

T2 - 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016

AU - Zhang, Aoqian

AU - Song, Shaoxu

AU - Wang, Jianmin

PY - 2016/6/26

Y1 - 2016/6/26

N2 - Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. The probability of speed changes. We formalize the likelihood-based cleaning problem, show its np- hardness, devise exact algorithms, and propose several approximate/ heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.

AB - Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. The probability of speed changes. We formalize the likelihood-based cleaning problem, show its np- hardness, devise exact algorithms, and propose several approximate/ heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.

UR - http://www.scopus.com/inward/record.url?scp=84979684895&partnerID=8YFLogxK

U2 - 10.1145/2882903.2915233

DO - 10.1145/2882903.2915233

M3 - Conference contribution

AN - SCOPUS:84979684895

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 909

EP - 924

BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data

PB - Association for Computing Machinery

Y2 - 26 June 2016 through 1 July 2016

ER -

Sequential data cleaning: A statistical approach

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this