Enriching data imputation with extensive similarity neighbors

Shaoxu Song; Aoqian Zhang; Lei Chen; Jianmin Wang

doi:10.14778/2809974.2809989

Enriching data imputation with extensive similarity neighbors

Shaoxu Song, Aoqian Zhang, Lei Chen, Jianmin Wang

Research output: Chapter in Book/Report/Conference proceeding › Chapter › peer-review

43 Citations (Scopus)

Abstract

Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the miss- ing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational de- pendency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively en- rich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions in- clude (1) the np-hardness analysis on solving and approx- imating the problem, (2) exact algorithms for tackling the problem, and (3) eficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.

Original language	English
Title of host publication	Proceedings of the VLDB Endowment
Editors	Christophe Claramunt, Simonas Saltenis, Ki-Joune Li
Publisher	Association for Computing Machinery
Pages	1286-1297
Number of pages	12
Volume	8
Edition	11 11
DOIs	https://doi.org/10.14778/2809974.2809989
Publication status	Published - 2015
Externally published	Yes
Event	3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006 - Seoul, Korea, Republic of Duration: 11 Sept 2006 → 11 Sept 2006

Conference

Conference	3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006
Country/Territory	Korea, Republic of
City	Seoul
Period	11/09/06 → 11/09/06

Access to Document

10.14778/2809974.2809989

Cite this

Song, S., Zhang, A., Chen, L., & Wang, J. (2015). Enriching data imputation with extensive similarity neighbors. In C. Claramunt, S. Saltenis, & K.-J. Li (Eds.), Proceedings of the VLDB Endowment (11 11 ed., Vol. 8, pp. 1286-1297). Association for Computing Machinery. https://doi.org/10.14778/2809974.2809989

@inbook{27312a0d7a3849249400c964a9f93d0d,

title = "Enriching data imputation with extensive similarity neighbors",

abstract = "Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the miss- ing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational de- pendency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively en- rich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions in- clude (1) the np-hardness analysis on solving and approx- imating the problem, (2) exact algorithms for tackling the problem, and (3) eficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.",

author = "Shaoxu Song and Aoqian Zhang and Lei Chen and Jianmin Wang",

note = "Publisher Copyright: {\textcopyright} 2015 VLDB Endowment 21508097/15/07.; 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006 ; Conference date: 11-09-2006 Through 11-09-2006",

year = "2015",

doi = "10.14778/2809974.2809989",

language = "English",

volume = "8",

pages = "1286--1297",

editor = "Christophe Claramunt and Simonas Saltenis and Ki-Joune Li",

booktitle = "Proceedings of the VLDB Endowment",

publisher = "Association for Computing Machinery",

edition = "11 11",

}

Song, S, Zhang, A, Chen, L & Wang, J 2015, Enriching data imputation with extensive similarity neighbors. in C Claramunt, S Saltenis & K-J Li (eds), Proceedings of the VLDB Endowment. 11 11 edn, vol. 8, Association for Computing Machinery, pp. 1286-1297, 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006, Seoul, Korea, Republic of, 11/09/06. https://doi.org/10.14778/2809974.2809989

TY - CHAP

T1 - Enriching data imputation with extensive similarity neighbors

AU - Song, Shaoxu

AU - Zhang, Aoqian

AU - Chen, Lei

AU - Wang, Jianmin

PY - 2015

Y1 - 2015

N2 - Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the miss- ing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational de- pendency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively en- rich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions in- clude (1) the np-hardness analysis on solving and approx- imating the problem, (2) exact algorithms for tackling the problem, and (3) eficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.

AB - Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the miss- ing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational de- pendency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively en- rich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions in- clude (1) the np-hardness analysis on solving and approx- imating the problem, (2) exact algorithms for tackling the problem, and (3) eficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.

UR - http://www.scopus.com/inward/record.url?scp=84953884895&partnerID=8YFLogxK

U2 - 10.14778/2809974.2809989

DO - 10.14778/2809974.2809989

M3 - Chapter

AN - SCOPUS:84953884895

VL - 8

SP - 1286

EP - 1297

BT - Proceedings of the VLDB Endowment

A2 - Claramunt, Christophe

A2 - Saltenis, Simonas

A2 - Li, Ki-Joune

PB - Association for Computing Machinery

T2 - 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006

Y2 - 11 September 2006 through 11 September 2006

ER -

Enriching data imputation with extensive similarity neighbors

Abstract

Conference

Access to Document

Other files and links

Fingerprint

Cite this