TY - JOUR
T1 - AI-driven data lineage verification using temporal analysis with graph-based anomaly detection
T2 - a comparative approach of supervised and unsupervised learning
AU - Yuan, Gang
AU - Geng, Jing
AU - Wen, Shengjun
N1 - Publisher Copyright:
Copyright 2026 Yuan et al. Distributed under Creative Commons CC-BY 4.0
PY - 2026
Y1 - 2026
N2 - In data-centric industries, such as finance, healthcare, and cybersecurity, maintaining the integrity and accuracy of data lineage is crucial due to compliance requirements. Current methods for verifying data lineage often struggle with the dynamic and multi-sourced nature of datasets, as well as their scale, resulting in reduced performance in detecting anomalies or validating lineage. In this article, we introduce and provide an empirical assessment of two artificial intelligence-based frameworks for data lineage verification and anomaly detection, which complement each other. In the first phase, we developed an unsupervised approach using Graph Attention Networks (GATs) for structural representation learning and an Isolation Forest for 'no-label' anomaly detection. The model surrogate for validation produced over 99.8 accurate reporting on replicated anomaly patterns in unseen test data for Olist, Transaction Processing Council Benchmark H (TPC-H), and Medical Information Mart for Intensive Care III (MIMIC-III). In the second phase, we developed a supervised, multi-modal framework that integrates graph neural networks (GNNs), long short-term memory (LSTM)-Attention networks, Dynamic Time Warping (DTW) for automatic labeling, and Contrastive Learning. To counter the integrated class imbalance of the anomaly detection class, this framework incorporates the Synthetic Minority Over-Sampling Technique (SMOTE) as a fundamental component of its training. Comparing both models on three datasets, the unsupervised model outperforms the supervised model due to its ability to dynamically adapt to data without requiring labels. The supervised model achieves a maximum area under the curve (AUC) of 0.96, and the unsupervised model achieves an AUC of 1.00, indicating better prediction efficiency. The multi-faceted comparison of performance, feature importance, and operational dashboards provides the user with valuable insights, thereby confirming the effectiveness of the first unsupervised model and the second supervised multi-modal model, while fully retaining the explainability, governance, and scalability of data lineage in a comparative pair.
AB - In data-centric industries, such as finance, healthcare, and cybersecurity, maintaining the integrity and accuracy of data lineage is crucial due to compliance requirements. Current methods for verifying data lineage often struggle with the dynamic and multi-sourced nature of datasets, as well as their scale, resulting in reduced performance in detecting anomalies or validating lineage. In this article, we introduce and provide an empirical assessment of two artificial intelligence-based frameworks for data lineage verification and anomaly detection, which complement each other. In the first phase, we developed an unsupervised approach using Graph Attention Networks (GATs) for structural representation learning and an Isolation Forest for 'no-label' anomaly detection. The model surrogate for validation produced over 99.8 accurate reporting on replicated anomaly patterns in unseen test data for Olist, Transaction Processing Council Benchmark H (TPC-H), and Medical Information Mart for Intensive Care III (MIMIC-III). In the second phase, we developed a supervised, multi-modal framework that integrates graph neural networks (GNNs), long short-term memory (LSTM)-Attention networks, Dynamic Time Warping (DTW) for automatic labeling, and Contrastive Learning. To counter the integrated class imbalance of the anomaly detection class, this framework incorporates the Synthetic Minority Over-Sampling Technique (SMOTE) as a fundamental component of its training. Comparing both models on three datasets, the unsupervised model outperforms the supervised model due to its ability to dynamically adapt to data without requiring labels. The supervised model achieves a maximum area under the curve (AUC) of 0.96, and the unsupervised model achieves an AUC of 1.00, indicating better prediction efficiency. The multi-faceted comparison of performance, feature importance, and operational dashboards provides the user with valuable insights, thereby confirming the effectiveness of the first unsupervised model and the second supervised multi-modal model, while fully retaining the explainability, governance, and scalability of data lineage in a comparative pair.
KW - Contrastive learning for anomaly detection
KW - Data lineage verification
KW - Lineage drift detection
KW - Meta-learning for data lineage
KW - Temporal analysis
UR - https://www.scopus.com/pages/publications/105038895177
U2 - 10.7717/peerj-cs.3608
DO - 10.7717/peerj-cs.3608
M3 - Article
AN - SCOPUS:105038895177
SN - 2376-5992
VL - 12
JO - PeerJ Computer Science
JF - PeerJ Computer Science
M1 - e3608
ER -