TY - GEN
T1 - Automatic discovery of attributes in relational databases
AU - Zhang, Meihui
AU - Hadjieleftheriou, Marios
AU - Ooi, Beng Chin
AU - Procopiuc, Cecilia M.
AU - Srivastava, Divesh
PY - 2011
Y1 - 2011
N2 - In this work we design algorithms for clustering relational columns into attributes, i.e., for identifying strong relationships between columns based on the common properties and characteristics of the values they contain. For example, identifying whether a certain set of columns refers to telephone numbers versus social security numbers, or names of customers versus names of nations. Traditional relational database schema languages use very limited primitive data types and simple foreign key constraints to express relationships between columns. Object oriented schema languages allow the definition of custom data types; still, certain relationships between columns might be unknown at design time or they might appear only in a particular database instance. Nevertheless, these relationships are an invaluable tool for schema matching, and generally for better understanding and working with the data. Here, we introduce data oriented solutions (we do not consider solutions that assume the existence of any external knowledge) that use statistical measures to identify strong relationships between the values of a set of columns. Interpreting the database as a graph where nodes correspond to database columns and edges correspond to column relationships, we decompose the graph into connected components and cluster sets of columns into attributes. To test the quality of our solution, we also provide a comprehensive experimental evaluation using real and synthetic datasets.
AB - In this work we design algorithms for clustering relational columns into attributes, i.e., for identifying strong relationships between columns based on the common properties and characteristics of the values they contain. For example, identifying whether a certain set of columns refers to telephone numbers versus social security numbers, or names of customers versus names of nations. Traditional relational database schema languages use very limited primitive data types and simple foreign key constraints to express relationships between columns. Object oriented schema languages allow the definition of custom data types; still, certain relationships between columns might be unknown at design time or they might appear only in a particular database instance. Nevertheless, these relationships are an invaluable tool for schema matching, and generally for better understanding and working with the data. Here, we introduce data oriented solutions (we do not consider solutions that assume the existence of any external knowledge) that use statistical measures to identify strong relationships between the values of a set of columns. Interpreting the database as a graph where nodes correspond to database columns and edges correspond to column relationships, we decompose the graph into connected components and cluster sets of columns into attributes. To test the quality of our solution, we also provide a comprehensive experimental evaluation using real and synthetic datasets.
KW - attribute discovery
KW - schema matching
UR - https://www.scopus.com/pages/publications/79959934451
U2 - 10.1145/1989323.1989336
DO - 10.1145/1989323.1989336
M3 - Conference contribution
AN - SCOPUS:79959934451
SN - 9781450306614
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 109
EP - 120
BT - Proceedings of SIGMOD 2011 and PODS 2011
PB - Association for Computing Machinery
T2 - 2011 ACM SIGMOD and 30th PODS 2011 Conference
Y2 - 12 June 2011 through 16 June 2011
ER -