TY - JOUR
T1 - BinDeep
T2 - A deep learning approach to binary code similarity detection
AU - Tian, Donghai
AU - Jia, Xiaoqi
AU - Ma, Rui
AU - Liu, Shuke
AU - Liu, Wenjing
AU - Hu, Changzhen
N1 - Publisher Copyright:
© 2020 Elsevier Ltd
PY - 2021/4/15
Y1 - 2021/4/15
N2 - Binary code similarity detection (BCSD) plays an important role in malware analysis and vulnerability discovery. Existing methods mainly rely on the expert's knowledge for the BCSD, which may not be reliable in some cases. More importantly, the detection accuracy (or performance) of these methods are not so satisfied. To address these issues, we propose BinDeep, a deep learning approach for binary code similarity detection. This method firstly extracts the instruction sequence from the binary function and then uses the instruction embedding model to vectorize the instruction features. Next, BinDeep applies a Recurrent Neural Network (RNN) deep learning model to identify the specific types of two functions for later comparison. According to the type information, BinDeep selects the corresponding deep learning model for similarity comparison. Specifically, BinDeep uses the Siamese neural networks, which combine the LSTM and CNN to measure the similarities of two target functions. Different from the traditional deep learning model, our hybrid model takes advantage of the CNN spatial structure learning and the LSTM sequence learning. The evaluation shows that our approach can achieve good BCSD between cross-architecture, cross-compiler, cross-optimization, and cross-version binary code.
AB - Binary code similarity detection (BCSD) plays an important role in malware analysis and vulnerability discovery. Existing methods mainly rely on the expert's knowledge for the BCSD, which may not be reliable in some cases. More importantly, the detection accuracy (or performance) of these methods are not so satisfied. To address these issues, we propose BinDeep, a deep learning approach for binary code similarity detection. This method firstly extracts the instruction sequence from the binary function and then uses the instruction embedding model to vectorize the instruction features. Next, BinDeep applies a Recurrent Neural Network (RNN) deep learning model to identify the specific types of two functions for later comparison. According to the type information, BinDeep selects the corresponding deep learning model for similarity comparison. Specifically, BinDeep uses the Siamese neural networks, which combine the LSTM and CNN to measure the similarities of two target functions. Different from the traditional deep learning model, our hybrid model takes advantage of the CNN spatial structure learning and the LSTM sequence learning. The evaluation shows that our approach can achieve good BCSD between cross-architecture, cross-compiler, cross-optimization, and cross-version binary code.
KW - Binary code
KW - CNN
KW - Deep learning
KW - LSTM
KW - Siamese neural network
KW - Similarity comparison
UR - http://www.scopus.com/inward/record.url?scp=85097571806&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2020.114348
DO - 10.1016/j.eswa.2020.114348
M3 - Article
AN - SCOPUS:85097571806
SN - 0957-4174
VL - 168
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 114348
ER -