TY - GEN
T1 - Research on malicious code analysis method based on semi-supervised learning
AU - He, Tingting
AU - Xue, Jingfeng
AU - Fu, Jianwen
AU - Wang, Yong
AU - Shan, Chun
N1 - Publisher Copyright:
© Springer Nature Singapore Pte Ltd. 2017.
PY - 2017
Y1 - 2017
N2 - The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with L1 and L2, and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.
AB - The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with L1 and L2, and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.
KW - Feature processing
KW - K-means LLGC
KW - Malicious code
UR - http://www.scopus.com/inward/record.url?scp=85036466164&partnerID=8YFLogxK
U2 - 10.1007/978-981-10-7080-8_17
DO - 10.1007/978-981-10-7080-8_17
M3 - Conference contribution
AN - SCOPUS:85036466164
SN - 9789811070792
T3 - Communications in Computer and Information Science
SP - 227
EP - 241
BT - Trusted Computing and Information Security - 11th Chinese Conference, CTCIS 2017, Proceedings
A2 - Yan, Fei
A2 - Xu, Ming
A2 - Fu, Shaojing
A2 - Qin, Zheng
PB - Springer Verlag
T2 - 11th Chinese Conference on Trusted Computing and Information Security, CTCIS 2017
Y2 - 14 September 2017 through 17 September 2017
ER -