Enriching Data Imputation under Similarity Rule Constraints

Shaoxu Song*, Yu Sun, Aoqian Zhang, Lei Chen, Jianmin Wang

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

39 引用 (Scopus)

摘要

Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

源语言英语
文章编号8543665
页(从-至)275-287
页数13
期刊IEEE Transactions on Knowledge and Data Engineering
32
2
DOI
出版状态已出版 - 1 2月 2020
已对外发布

指纹

探究 'Enriching Data Imputation under Similarity Rule Constraints' 的科研主题。它们共同构成独一无二的指纹。

引用此