Enriching Data Imputation under Similarity Rule Constraints

Shaoxu Song; Yu Sun; Aoqian Zhang; Lei Chen; Jianmin Wang

doi:10.1109/TKDE.2018.2883103

Enriching Data Imputation under Similarity Rule Constraints

Shaoxu Song^*, Yu Sun, Aoqian Zhang, Lei Chen, Jianmin Wang

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

39 Citations (Scopus)

Abstract

Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

Original language	English
Article number	8543665
Pages (from-to)	275-287
Number of pages	13
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	32
Issue number	2
DOIs	https://doi.org/10.1109/TKDE.2018.2883103
Publication status	Published - 1 Feb 2020
Externally published	Yes

Keywords

Similarity rules
data imputation
similarity neighbors

Access to Document

10.1109/TKDE.2018.2883103

Cite this

@article{145c9e7d788d4a67b989c612b2e4976c,

title = "Enriching Data Imputation under Similarity Rule Constraints",

abstract = "Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.",

keywords = "Similarity rules, data imputation, similarity neighbors",

author = "Shaoxu Song and Yu Sun and Aoqian Zhang and Lei Chen and Jianmin Wang",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2020",

month = feb,

day = "1",

doi = "10.1109/TKDE.2018.2883103",

language = "English",

volume = "32",

pages = "275--287",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "2",

}

TY - JOUR

T1 - Enriching Data Imputation under Similarity Rule Constraints

AU - Song, Shaoxu

AU - Sun, Yu

AU - Zhang, Aoqian

AU - Chen, Lei

AU - Wang, Jianmin

PY - 2020/2/1

Y1 - 2020/2/1

N2 - Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

AB - Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

KW - Similarity rules

KW - data imputation

KW - similarity neighbors

UR - http://www.scopus.com/inward/record.url?scp=85057433626&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2018.2883103

DO - 10.1109/TKDE.2018.2883103

M3 - Article

AN - SCOPUS:85057433626

SN - 1041-4347

VL - 32

SP - 275

EP - 287

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 2

M1 - 8543665

ER -

Enriching Data Imputation under Similarity Rule Constraints

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this