Enriching Data Imputation under Similarity Rule Constraints

Shaoxu Song; Yu Sun; Aoqian Zhang; Lei Chen; Jianmin Wang

doi:10.1109/TKDE.2018.2883103

Enriching Data Imputation under Similarity Rule Constraints

Shaoxu Song^*, Yu Sun, Aoqian Zhang, Lei Chen, Jianmin Wang

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

39 引用（Scopus）

摘要

Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

源语言	英语
文章编号	8543665
页（从-至）	275-287
页数	13
期刊	IEEE Transactions on Knowledge and Data Engineering
卷	32
期	2
DOI	https://doi.org/10.1109/TKDE.2018.2883103
出版状态	已出版 - 1 2月 2020
已对外发布	是

访问文件

10.1109/TKDE.2018.2883103

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{145c9e7d788d4a67b989c612b2e4976c,

title = "Enriching Data Imputation under Similarity Rule Constraints",

abstract = "Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.",

keywords = "Similarity rules, data imputation, similarity neighbors",

author = "Shaoxu Song and Yu Sun and Aoqian Zhang and Lei Chen and Jianmin Wang",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2020",

month = feb,

day = "1",

doi = "10.1109/TKDE.2018.2883103",

language = "English",

volume = "32",

pages = "275--287",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "2",

}

TY - JOUR

T1 - Enriching Data Imputation under Similarity Rule Constraints

AU - Song, Shaoxu

AU - Sun, Yu

AU - Zhang, Aoqian

AU - Chen, Lei

AU - Wang, Jianmin

PY - 2020/2/1

Y1 - 2020/2/1

N2 - Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

AB - Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

KW - Similarity rules

KW - data imputation

KW - similarity neighbors

UR - http://www.scopus.com/inward/record.url?scp=85057433626&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2018.2883103

DO - 10.1109/TKDE.2018.2883103

M3 - Article

AN - SCOPUS:85057433626

SN - 1041-4347

VL - 32

SP - 275

EP - 287

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 2

M1 - 8543665

ER -

Enriching Data Imputation under Similarity Rule Constraints

摘要

访问文件

其它文件与链接

指纹

引用此