TY - JOUR
T1 - IRC-CLVul
T2 - Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined Features
AU - Lei, Tianwei
AU - Xue, Jingfeng
AU - Wang, Yong
AU - Liu, Zhenyan
N1 - Publisher Copyright:
© 2023 by the authors.
PY - 2023/7
Y1 - 2023/7
N2 - The most severe problem in cross-programming languages is feature extraction due to different tokens in different programming languages. To solve this problem, we propose a cross-programming-language vulnerability detection method in this paper, IRC-CLVul, based on intermediate representation and combined features. Specifically, we first converted programs in different programming languages into a unified LLVM intermediate representation (LLVM-IR) to provide a classification basis for different programming languages. Afterwards, we extracted the code sequences and control flow graphs of the samples, used the semantic model to extract the program semantic information and graph structure information, and concatenated them into semantic vectors. Finally, we used Random Forest to learn the concatenated semantic vectors and obtained the classification results. We conducted experiments on 85,811 samples from the Juliet test suite in C, C++, and Java. The results show that our method improved the accuracy by 7% compared with the two baseline algorithms, and the F1 score showed a 12% increase.
AB - The most severe problem in cross-programming languages is feature extraction due to different tokens in different programming languages. To solve this problem, we propose a cross-programming-language vulnerability detection method in this paper, IRC-CLVul, based on intermediate representation and combined features. Specifically, we first converted programs in different programming languages into a unified LLVM intermediate representation (LLVM-IR) to provide a classification basis for different programming languages. Afterwards, we extracted the code sequences and control flow graphs of the samples, used the semantic model to extract the program semantic information and graph structure information, and concatenated them into semantic vectors. Finally, we used Random Forest to learn the concatenated semantic vectors and obtained the classification results. We conducted experiments on 85,811 samples from the Juliet test suite in C, C++, and Java. The results show that our method improved the accuracy by 7% compared with the two baseline algorithms, and the F1 score showed a 12% increase.
KW - combined features
KW - cross-programming language
KW - intermediate representation
KW - software vulnerability detection
KW - source code vulnerability detection
UR - http://www.scopus.com/inward/record.url?scp=85166175772&partnerID=8YFLogxK
U2 - 10.3390/electronics12143067
DO - 10.3390/electronics12143067
M3 - Article
AN - SCOPUS:85166175772
SN - 2079-9292
VL - 12
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 14
M1 - 3067
ER -