Exploring Heterogeneous Data Lake based on Unified Canonical Graphs

  • Qin Yuan
  • , Ye Yuan*
  • , Zhenyu Wen
  • , He Wang
  • , Chen Chen
  • , Guoren Wang
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Citations (Scopus)

Abstract

A data lake is a repository for massive raw and heterogeneous data, which includes multiple data models with different data schemas and query interfaces. Keyword search can extract valuable information for users without the knowledge of underlying schemas and query languages. However, conventional keyword searches are restricted to a certain data model and cannot easily adapt to a data lake. In this paper, we study a novel keyword search. To achieve high accuracy and efficiency, we introduce canonical graphs and then integrate semantically related vertices based on vertex representations. A matching entity based keyword search algorithm is presented to find answers across multiple data sources. Finally, extensive experimental study shows the effectiveness and efficiency of our solution.

Original languageEnglish
Title of host publicationSIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages1834-1838
Number of pages5
ISBN (Electronic)9781450387323
DOIs
Publication statusPublished - 7 Jul 2022
Event45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022 - Madrid, Spain
Duration: 11 Jul 202215 Jul 2022

Publication series

NameSIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022
Country/TerritorySpain
CityMadrid
Period11/07/2215/07/22

Keywords

  • canonical graph
  • data lake
  • keyword search
  • matching entity

Fingerprint

Dive into the research topics of 'Exploring Heterogeneous Data Lake based on Unified Canonical Graphs'. Together they form a unique fingerprint.

Cite this