Enriching Data Imputation under Similarity Rule Constraints

Shaoxu Song*, Yu Sun, Aoqian Zhang, Lei Chen, Jianmin Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

39 Citations (Scopus)

Abstract

Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

Original languageEnglish
Article number8543665
Pages (from-to)275-287
Number of pages13
JournalIEEE Transactions on Knowledge and Data Engineering
Volume32
Issue number2
DOIs
Publication statusPublished - 1 Feb 2020
Externally publishedYes

Keywords

  • Similarity rules
  • data imputation
  • similarity neighbors

Fingerprint

Dive into the research topics of 'Enriching Data Imputation under Similarity Rule Constraints'. Together they form a unique fingerprint.

Cite this