TY - JOUR
T1 - Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNN
AU - Peng, Jinxue
AU - Wang, Yong
AU - Xue, Jingfeng
AU - Liu, Zhenyan
N1 - Publisher Copyright:
© 2015 Chinese Institute of Electronics.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Cross-platform binary code similarity detection aims at detecting whether two or more pieces of binary code are similar or not. Existing approaches that combine control flow graphs (CFGs)-based function representation and graph convolutional network (GCN)-based similarity analysis are the best-performing ones. Due to a large amount of convolutional computation and the loss of structural information, the use of convolution networks will inevitably bring problems such as high overhead and sometimes inaccuracy. To address these issues, we propose a fast cross-platform binary code similarity detection framework that takes advantage of natural language processing (NLP) and inductive graph neural network (GNN) for basic blocks embedding and function representation respectively by simulating extracting structural features and temporal features. GNN's node-centric and small batch is a suitable training way for large CFGs, it can greatly reduce computational overhead. Various NLP basic block embedding models and GNNs are evaluated. Experimental results show that the scheme with long short term memory (LSTM) for basic blocks embedding and inductive learning-based GraphSAGE(GAE) for function representation outperforms the state-of-the-art works. In our framework, we can take only 45% overhead. Improve efficiency significantly with a small performance trade-off.
AB - Cross-platform binary code similarity detection aims at detecting whether two or more pieces of binary code are similar or not. Existing approaches that combine control flow graphs (CFGs)-based function representation and graph convolutional network (GCN)-based similarity analysis are the best-performing ones. Due to a large amount of convolutional computation and the loss of structural information, the use of convolution networks will inevitably bring problems such as high overhead and sometimes inaccuracy. To address these issues, we propose a fast cross-platform binary code similarity detection framework that takes advantage of natural language processing (NLP) and inductive graph neural network (GNN) for basic blocks embedding and function representation respectively by simulating extracting structural features and temporal features. GNN's node-centric and small batch is a suitable training way for large CFGs, it can greatly reduce computational overhead. Various NLP basic block embedding models and GNNs are evaluated. Experimental results show that the scheme with long short term memory (LSTM) for basic blocks embedding and inductive learning-based GraphSAGE(GAE) for function representation outperforms the state-of-the-art works. In our framework, we can take only 45% overhead. Improve efficiency significantly with a small performance trade-off.
KW - Binary code similarity detection
KW - Control flow graph
KW - Inductive graph neural network
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85184035544&partnerID=8YFLogxK
U2 - 10.23919/cje.2022.00.228
DO - 10.23919/cje.2022.00.228
M3 - Article
AN - SCOPUS:85184035544
SN - 1022-4653
VL - 33
SP - 128
EP - 138
JO - Chinese Journal of Electronics
JF - Chinese Journal of Electronics
IS - 1
ER -