跳到主要导航 跳到搜索 跳到主要内容

AI-driven data lineage verification using temporal analysis with graph-based anomaly detection: a comparative approach of supervised and unsupervised learning

  • Gang Yuan
  • , Jing Geng*
  • , Shengjun Wen
  • *此作品的通讯作者
  • Beijing Institute of Technology
  • China Network Security Review Certification and Market Supervision Big Data Center

科研成果: 期刊稿件文章同行评审

摘要

In data-centric industries, such as finance, healthcare, and cybersecurity, maintaining the integrity and accuracy of data lineage is crucial due to compliance requirements. Current methods for verifying data lineage often struggle with the dynamic and multi-sourced nature of datasets, as well as their scale, resulting in reduced performance in detecting anomalies or validating lineage. In this article, we introduce and provide an empirical assessment of two artificial intelligence-based frameworks for data lineage verification and anomaly detection, which complement each other. In the first phase, we developed an unsupervised approach using Graph Attention Networks (GATs) for structural representation learning and an Isolation Forest for 'no-label' anomaly detection. The model surrogate for validation produced over 99.8 accurate reporting on replicated anomaly patterns in unseen test data for Olist, Transaction Processing Council Benchmark H (TPC-H), and Medical Information Mart for Intensive Care III (MIMIC-III). In the second phase, we developed a supervised, multi-modal framework that integrates graph neural networks (GNNs), long short-term memory (LSTM)-Attention networks, Dynamic Time Warping (DTW) for automatic labeling, and Contrastive Learning. To counter the integrated class imbalance of the anomaly detection class, this framework incorporates the Synthetic Minority Over-Sampling Technique (SMOTE) as a fundamental component of its training. Comparing both models on three datasets, the unsupervised model outperforms the supervised model due to its ability to dynamically adapt to data without requiring labels. The supervised model achieves a maximum area under the curve (AUC) of 0.96, and the unsupervised model achieves an AUC of 1.00, indicating better prediction efficiency. The multi-faceted comparison of performance, feature importance, and operational dashboards provides the user with valuable insights, thereby confirming the effectiveness of the first unsupervised model and the second supervised multi-modal model, while fully retaining the explainability, governance, and scalability of data lineage in a comparative pair.

源语言英语
文章编号e3608
期刊PeerJ Computer Science
12
DOI
出版状态已出版 - 2026
已对外发布

指纹

探究 'AI-driven data lineage verification using temporal analysis with graph-based anomaly detection: a comparative approach of supervised and unsupervised learning' 的科研主题。它们共同构成独一无二的指纹。

引用此