IRC-CLVul: Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined Features

Tianwei Lei; Jingfeng Xue; Yong Wang; Zhenyan Liu

doi:10.3390/electronics12143067

IRC-CLVul: Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined Features

Tianwei Lei, Jingfeng Xue, Yong Wang, Zhenyan Liu^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

The most severe problem in cross-programming languages is feature extraction due to different tokens in different programming languages. To solve this problem, we propose a cross-programming-language vulnerability detection method in this paper, IRC-CLVul, based on intermediate representation and combined features. Specifically, we first converted programs in different programming languages into a unified LLVM intermediate representation (LLVM-IR) to provide a classification basis for different programming languages. Afterwards, we extracted the code sequences and control flow graphs of the samples, used the semantic model to extract the program semantic information and graph structure information, and concatenated them into semantic vectors. Finally, we used Random Forest to learn the concatenated semantic vectors and obtained the classification results. We conducted experiments on 85,811 samples from the Juliet test suite in C, C++, and Java. The results show that our method improved the accuracy by 7% compared with the two baseline algorithms, and the F1 score showed a 12% increase.

源语言	英语
文章编号	3067
期刊	Electronics (Switzerland)
卷	12
期	14
DOI	https://doi.org/10.3390/electronics12143067
出版状态	已出版 - 7月 2023

访问文件

10.3390/electronics12143067

其它文件与链接

链接到 Scopus 的出版物

引用此

Lei, T., Xue, J., Wang, Y., & Liu, Z. (2023). IRC-CLVul: Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined Features. Electronics (Switzerland), 12(14), 文章 3067. https://doi.org/10.3390/electronics12143067

@article{de35ea16d77e435aade8fb964ff20943,

title = "IRC-CLVul: Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined Features",

abstract = "The most severe problem in cross-programming languages is feature extraction due to different tokens in different programming languages. To solve this problem, we propose a cross-programming-language vulnerability detection method in this paper, IRC-CLVul, based on intermediate representation and combined features. Specifically, we first converted programs in different programming languages into a unified LLVM intermediate representation (LLVM-IR) to provide a classification basis for different programming languages. Afterwards, we extracted the code sequences and control flow graphs of the samples, used the semantic model to extract the program semantic information and graph structure information, and concatenated them into semantic vectors. Finally, we used Random Forest to learn the concatenated semantic vectors and obtained the classification results. We conducted experiments on 85,811 samples from the Juliet test suite in C, C++, and Java. The results show that our method improved the accuracy by 7% compared with the two baseline algorithms, and the F1 score showed a 12% increase.",

keywords = "combined features, cross-programming language, intermediate representation, software vulnerability detection, source code vulnerability detection",

author = "Tianwei Lei and Jingfeng Xue and Yong Wang and Zhenyan Liu",

note = "Publisher Copyright: {\textcopyright} 2023 by the authors.",

year = "2023",

month = jul,

doi = "10.3390/electronics12143067",

language = "English",

volume = "12",

journal = "Electronics (Switzerland)",

issn = "2079-9292",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "14",

}

TY - JOUR

T1 - IRC-CLVul

T2 - Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined Features

AU - Lei, Tianwei

AU - Xue, Jingfeng

AU - Wang, Yong

AU - Liu, Zhenyan

PY - 2023/7

Y1 - 2023/7

N2 - The most severe problem in cross-programming languages is feature extraction due to different tokens in different programming languages. To solve this problem, we propose a cross-programming-language vulnerability detection method in this paper, IRC-CLVul, based on intermediate representation and combined features. Specifically, we first converted programs in different programming languages into a unified LLVM intermediate representation (LLVM-IR) to provide a classification basis for different programming languages. Afterwards, we extracted the code sequences and control flow graphs of the samples, used the semantic model to extract the program semantic information and graph structure information, and concatenated them into semantic vectors. Finally, we used Random Forest to learn the concatenated semantic vectors and obtained the classification results. We conducted experiments on 85,811 samples from the Juliet test suite in C, C++, and Java. The results show that our method improved the accuracy by 7% compared with the two baseline algorithms, and the F1 score showed a 12% increase.

AB - The most severe problem in cross-programming languages is feature extraction due to different tokens in different programming languages. To solve this problem, we propose a cross-programming-language vulnerability detection method in this paper, IRC-CLVul, based on intermediate representation and combined features. Specifically, we first converted programs in different programming languages into a unified LLVM intermediate representation (LLVM-IR) to provide a classification basis for different programming languages. Afterwards, we extracted the code sequences and control flow graphs of the samples, used the semantic model to extract the program semantic information and graph structure information, and concatenated them into semantic vectors. Finally, we used Random Forest to learn the concatenated semantic vectors and obtained the classification results. We conducted experiments on 85,811 samples from the Juliet test suite in C, C++, and Java. The results show that our method improved the accuracy by 7% compared with the two baseline algorithms, and the F1 score showed a 12% increase.

KW - combined features

KW - cross-programming language

KW - intermediate representation

KW - software vulnerability detection

KW - source code vulnerability detection

UR - http://www.scopus.com/inward/record.url?scp=85166175772&partnerID=8YFLogxK

U2 - 10.3390/electronics12143067

DO - 10.3390/electronics12143067

M3 - Article

AN - SCOPUS:85166175772

SN - 2079-9292

VL - 12

JO - Electronics (Switzerland)

JF - Electronics (Switzerland)

IS - 14

M1 - 3067

ER -

IRC-CLVul: Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined Features

摘要

访问文件

其它文件与链接

指纹

引用此