FSD-CLCD: Functional semantic distillation graph learning for cross-language code clone detection

Linghao Zhang; Senlin Luo; Limin Pan; Zhouting Wu; Kun Gong

doi:10.1016/j.engappai.2024.108199

FSD-CLCD: Functional semantic distillation graph learning for cross-language code clone detection

Linghao Zhang, Senlin Luo, Limin Pan^*, Zhouting Wu, Kun Gong

^*Corresponding author for this work

School of Information and Electronics

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

5 Citations (Scopus)

Abstract

Code clone detection can find similar or the same code snippets, which is important in analyzing homologous components, discovering redundant code, and improving software system development and maintenance efficiency. A crucial challenge is to extract more functional semantic similarity from code in heterogeneous conditions, such as a cross-language scenario. Existing methods mainly exploit sequence models with only lexical and statistical features to compare code pairs, which are susceptible to linguistic feature noise and misclassify code pairs that have similar structure dependencies such as control flow. Meanwhile, there are issues with inconsistent node types and a great variation of node numbers while capturing structure-dependent features, resulting in a misaligned distribution of clone pairs, and weakening the detection precision. This work presents a novel cross-language code clone detection method. It represents code with a graph structure based on abstract syntax trees and introduces a global node to strengthen the connection between control flows. Prune the graph structure based on key node protection rules to reduce the impact of linguistic feature noise. Besides, optimize graph matching networks for cross-language abstract syntax trees by using contrastive loss to align the functional semantic distribution of clone pairs. The method distills the invariant functional semantic similarity with a huge discrepancy of the code graph in heterogeneous cross-language conditions. Experiment results show that the proposed method achieves scores of 0.95, 0.98, and 0.96 in terms of precision, recall and F1-score and substantially outperforms the state-of-the-art baselines.

Original language	English
Article number	108199
Journal	Engineering Applications of Artificial Intelligence
Volume	133
DOIs	https://doi.org/10.1016/j.engappai.2024.108199
Publication status	Published - Jul 2024

Keywords

Code clone detection
Contrastive learning
Cross-language
Graph similarity learning

Access to Document

10.1016/j.engappai.2024.108199

Cite this

Zhang, L., Luo, S., Pan, L., Wu, Z., & Gong, K. (2024). FSD-CLCD: Functional semantic distillation graph learning for cross-language code clone detection. Engineering Applications of Artificial Intelligence, 133, Article 108199. https://doi.org/10.1016/j.engappai.2024.108199

@article{b4a11a439e1048249489b68456796ea9,

title = "FSD-CLCD: Functional semantic distillation graph learning for cross-language code clone detection",

abstract = "Code clone detection can find similar or the same code snippets, which is important in analyzing homologous components, discovering redundant code, and improving software system development and maintenance efficiency. A crucial challenge is to extract more functional semantic similarity from code in heterogeneous conditions, such as a cross-language scenario. Existing methods mainly exploit sequence models with only lexical and statistical features to compare code pairs, which are susceptible to linguistic feature noise and misclassify code pairs that have similar structure dependencies such as control flow. Meanwhile, there are issues with inconsistent node types and a great variation of node numbers while capturing structure-dependent features, resulting in a misaligned distribution of clone pairs, and weakening the detection precision. This work presents a novel cross-language code clone detection method. It represents code with a graph structure based on abstract syntax trees and introduces a global node to strengthen the connection between control flows. Prune the graph structure based on key node protection rules to reduce the impact of linguistic feature noise. Besides, optimize graph matching networks for cross-language abstract syntax trees by using contrastive loss to align the functional semantic distribution of clone pairs. The method distills the invariant functional semantic similarity with a huge discrepancy of the code graph in heterogeneous cross-language conditions. Experiment results show that the proposed method achieves scores of 0.95, 0.98, and 0.96 in terms of precision, recall and F1-score and substantially outperforms the state-of-the-art baselines.",

keywords = "Code clone detection, Contrastive learning, Cross-language, Graph similarity learning",

author = "Linghao Zhang and Senlin Luo and Limin Pan and Zhouting Wu and Kun Gong",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Ltd",

year = "2024",

month = jul,

doi = "10.1016/j.engappai.2024.108199",

language = "English",

volume = "133",

journal = "Engineering Applications of Artificial Intelligence",

issn = "0952-1976",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - FSD-CLCD

T2 - Functional semantic distillation graph learning for cross-language code clone detection

AU - Zhang, Linghao

AU - Luo, Senlin

AU - Pan, Limin

AU - Wu, Zhouting

AU - Gong, Kun

PY - 2024/7

Y1 - 2024/7

N2 - Code clone detection can find similar or the same code snippets, which is important in analyzing homologous components, discovering redundant code, and improving software system development and maintenance efficiency. A crucial challenge is to extract more functional semantic similarity from code in heterogeneous conditions, such as a cross-language scenario. Existing methods mainly exploit sequence models with only lexical and statistical features to compare code pairs, which are susceptible to linguistic feature noise and misclassify code pairs that have similar structure dependencies such as control flow. Meanwhile, there are issues with inconsistent node types and a great variation of node numbers while capturing structure-dependent features, resulting in a misaligned distribution of clone pairs, and weakening the detection precision. This work presents a novel cross-language code clone detection method. It represents code with a graph structure based on abstract syntax trees and introduces a global node to strengthen the connection between control flows. Prune the graph structure based on key node protection rules to reduce the impact of linguistic feature noise. Besides, optimize graph matching networks for cross-language abstract syntax trees by using contrastive loss to align the functional semantic distribution of clone pairs. The method distills the invariant functional semantic similarity with a huge discrepancy of the code graph in heterogeneous cross-language conditions. Experiment results show that the proposed method achieves scores of 0.95, 0.98, and 0.96 in terms of precision, recall and F1-score and substantially outperforms the state-of-the-art baselines.

AB - Code clone detection can find similar or the same code snippets, which is important in analyzing homologous components, discovering redundant code, and improving software system development and maintenance efficiency. A crucial challenge is to extract more functional semantic similarity from code in heterogeneous conditions, such as a cross-language scenario. Existing methods mainly exploit sequence models with only lexical and statistical features to compare code pairs, which are susceptible to linguistic feature noise and misclassify code pairs that have similar structure dependencies such as control flow. Meanwhile, there are issues with inconsistent node types and a great variation of node numbers while capturing structure-dependent features, resulting in a misaligned distribution of clone pairs, and weakening the detection precision. This work presents a novel cross-language code clone detection method. It represents code with a graph structure based on abstract syntax trees and introduces a global node to strengthen the connection between control flows. Prune the graph structure based on key node protection rules to reduce the impact of linguistic feature noise. Besides, optimize graph matching networks for cross-language abstract syntax trees by using contrastive loss to align the functional semantic distribution of clone pairs. The method distills the invariant functional semantic similarity with a huge discrepancy of the code graph in heterogeneous cross-language conditions. Experiment results show that the proposed method achieves scores of 0.95, 0.98, and 0.96 in terms of precision, recall and F1-score and substantially outperforms the state-of-the-art baselines.

KW - Code clone detection

KW - Contrastive learning

KW - Cross-language

KW - Graph similarity learning

UR - http://www.scopus.com/inward/record.url?scp=85188610938&partnerID=8YFLogxK

U2 - 10.1016/j.engappai.2024.108199

DO - 10.1016/j.engappai.2024.108199

M3 - Article

AN - SCOPUS:85188610938

SN - 0952-1976

VL - 133

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

M1 - 108199

ER -

FSD-CLCD: Functional semantic distillation graph learning for cross-language code clone detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this