Skip to main navigation Skip to search Skip to main content

AI-driven data lineage verification using temporal analysis with graph-based anomaly detection: a comparative approach of supervised and unsupervised learning

  • Gang Yuan
  • , Jing Geng*
  • , Shengjun Wen
  • *Corresponding author for this work
  • Beijing Institute of Technology
  • China Network Security Review Certification and Market Supervision Big Data Center

Research output: Contribution to journalArticlepeer-review

Abstract

In data-centric industries, such as finance, healthcare, and cybersecurity, maintaining the integrity and accuracy of data lineage is crucial due to compliance requirements. Current methods for verifying data lineage often struggle with the dynamic and multi-sourced nature of datasets, as well as their scale, resulting in reduced performance in detecting anomalies or validating lineage. In this article, we introduce and provide an empirical assessment of two artificial intelligence-based frameworks for data lineage verification and anomaly detection, which complement each other. In the first phase, we developed an unsupervised approach using Graph Attention Networks (GATs) for structural representation learning and an Isolation Forest for 'no-label' anomaly detection. The model surrogate for validation produced over 99.8 accurate reporting on replicated anomaly patterns in unseen test data for Olist, Transaction Processing Council Benchmark H (TPC-H), and Medical Information Mart for Intensive Care III (MIMIC-III). In the second phase, we developed a supervised, multi-modal framework that integrates graph neural networks (GNNs), long short-term memory (LSTM)-Attention networks, Dynamic Time Warping (DTW) for automatic labeling, and Contrastive Learning. To counter the integrated class imbalance of the anomaly detection class, this framework incorporates the Synthetic Minority Over-Sampling Technique (SMOTE) as a fundamental component of its training. Comparing both models on three datasets, the unsupervised model outperforms the supervised model due to its ability to dynamically adapt to data without requiring labels. The supervised model achieves a maximum area under the curve (AUC) of 0.96, and the unsupervised model achieves an AUC of 1.00, indicating better prediction efficiency. The multi-faceted comparison of performance, feature importance, and operational dashboards provides the user with valuable insights, thereby confirming the effectiveness of the first unsupervised model and the second supervised multi-modal model, while fully retaining the explainability, governance, and scalability of data lineage in a comparative pair.

Original languageEnglish
Article numbere3608
JournalPeerJ Computer Science
Volume12
DOIs
Publication statusPublished - 2026
Externally publishedYes

Keywords

  • Contrastive learning for anomaly detection
  • Data lineage verification
  • Lineage drift detection
  • Meta-learning for data lineage
  • Temporal analysis

Fingerprint

Dive into the research topics of 'AI-driven data lineage verification using temporal analysis with graph-based anomaly detection: a comparative approach of supervised and unsupervised learning'. Together they form a unique fingerprint.

Cite this