Research on malicious code analysis method based on semi-supervised learning

Tingting He; Jingfeng Xue; Jianwen Fu; Yong Wang; Chun Shan

doi:10.1007/978-981-10-7080-8_17

Research on malicious code analysis method based on semi-supervised learning

Tingting He, Jingfeng Xue, Jianwen Fu, Yong Wang^*, Chun Shan

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with L₁ and L₂, and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.

Original language	English
Title of host publication	Trusted Computing and Information Security - 11th Chinese Conference, CTCIS 2017, Proceedings
Editors	Fei Yan, Ming Xu, Shaojing Fu, Zheng Qin
Publisher	Springer Verlag
Pages	227-241
Number of pages	15
ISBN (Print)	9789811070792
DOIs	https://doi.org/10.1007/978-981-10-7080-8_17
Publication status	Published - 2017
Event	11th Chinese Conference on Trusted Computing and Information Security, CTCIS 2017 - Changsha, China Duration: 14 Sept 2017 → 17 Sept 2017

Publication series

Name	Communications in Computer and Information Science
Volume	704
ISSN (Print)	1865-0929

Conference

Conference	11th Chinese Conference on Trusted Computing and Information Security, CTCIS 2017
Country/Territory	China
City	Changsha
Period	14/09/17 → 17/09/17

Keywords

Feature processing
K-means LLGC
Malicious code

Access to Document

10.1007/978-981-10-7080-8_17

Cite this

He, T., Xue, J., Fu, J., Wang, Y., & Shan, C. (2017). Research on malicious code analysis method based on semi-supervised learning. In F. Yan, M. Xu, S. Fu, & Z. Qin (Eds.), Trusted Computing and Information Security - 11th Chinese Conference, CTCIS 2017, Proceedings (pp. 227-241). (Communications in Computer and Information Science; Vol. 704). Springer Verlag. https://doi.org/10.1007/978-981-10-7080-8_17

@inproceedings{40d01e83c9a04230b8f5644d6240e339,

title = "Research on malicious code analysis method based on semi-supervised learning",

abstract = "The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with L1 and L2, and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.",

keywords = "Feature processing, K-means LLGC, Malicious code",

author = "Tingting He and Jingfeng Xue and Jianwen Fu and Yong Wang and Chun Shan",

note = "Publisher Copyright: {\textcopyright} Springer Nature Singapore Pte Ltd. 2017.; 11th Chinese Conference on Trusted Computing and Information Security, CTCIS 2017 ; Conference date: 14-09-2017 Through 17-09-2017",

year = "2017",

doi = "10.1007/978-981-10-7080-8_17",

language = "English",

isbn = "9789811070792",

series = "Communications in Computer and Information Science",

publisher = "Springer Verlag",

pages = "227--241",

editor = "Fei Yan and Ming Xu and Shaojing Fu and Zheng Qin",

booktitle = "Trusted Computing and Information Security - 11th Chinese Conference, CTCIS 2017, Proceedings",

address = "Germany",

}

He, T, Xue, J, Fu, J, Wang, Y & Shan, C 2017, Research on malicious code analysis method based on semi-supervised learning. in F Yan, M Xu, S Fu & Z Qin (eds), Trusted Computing and Information Security - 11th Chinese Conference, CTCIS 2017, Proceedings. Communications in Computer and Information Science, vol. 704, Springer Verlag, pp. 227-241, 11th Chinese Conference on Trusted Computing and Information Security, CTCIS 2017, Changsha, China, 14/09/17. https://doi.org/10.1007/978-981-10-7080-8_17

Research on malicious code analysis method based on semi-supervised learning. / He, Tingting; Xue, Jingfeng; Fu, Jianwen et al.
Trusted Computing and Information Security - 11th Chinese Conference, CTCIS 2017, Proceedings. ed. / Fei Yan; Ming Xu; Shaojing Fu; Zheng Qin. Springer Verlag, 2017. p. 227-241 (Communications in Computer and Information Science; Vol. 704).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Research on malicious code analysis method based on semi-supervised learning

AU - He, Tingting

AU - Xue, Jingfeng

AU - Fu, Jianwen

AU - Wang, Yong

AU - Shan, Chun

N1 - Publisher Copyright: © Springer Nature Singapore Pte Ltd. 2017.

PY - 2017

Y1 - 2017

N2 - The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with L1 and L2, and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.

AB - The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with L1 and L2, and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.

KW - Feature processing

KW - K-means LLGC

KW - Malicious code

UR - http://www.scopus.com/inward/record.url?scp=85036466164&partnerID=8YFLogxK

U2 - 10.1007/978-981-10-7080-8_17

DO - 10.1007/978-981-10-7080-8_17

M3 - Conference contribution

AN - SCOPUS:85036466164

SN - 9789811070792

T3 - Communications in Computer and Information Science

SP - 227

EP - 241

BT - Trusted Computing and Information Security - 11th Chinese Conference, CTCIS 2017, Proceedings

A2 - Yan, Fei

A2 - Xu, Ming

A2 - Fu, Shaojing

A2 - Qin, Zheng

PB - Springer Verlag

T2 - 11th Chinese Conference on Trusted Computing and Information Security, CTCIS 2017

Y2 - 14 September 2017 through 17 September 2017

ER -

Research on malicious code analysis method based on semi-supervised learning

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this