TY - GEN
T1 - A hybrid machine-crowdsourcing system for matching web tables
AU - Fan, Ju
AU - Lu, Meiyu
AU - Ooi, Beng Chin
AU - Tan, Wang Chiew
AU - Zhang, Meihui
PY - 2014
Y1 - 2014
N2 - The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for 'difficult' columns. Our overall framework assigns the most 'beneficial' column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.
AB - The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for 'difficult' columns. Our overall framework assigns the most 'beneficial' column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.
UR - https://www.scopus.com/pages/publications/84901770462
U2 - 10.1109/ICDE.2014.6816716
DO - 10.1109/ICDE.2014.6816716
M3 - Conference contribution
AN - SCOPUS:84901770462
SN - 9781479925544
T3 - Proceedings - International Conference on Data Engineering
SP - 976
EP - 987
BT - 2014 IEEE 30th International Conference on Data Engineering, ICDE 2014
PB - IEEE Computer Society
T2 - 30th IEEE International Conference on Data Engineering, ICDE 2014
Y2 - 31 March 2014 through 4 April 2014
ER -