Boosting training for PDF malware classifier via active learning

Yuanzhang Li; Xinxin Wang; Zhiwei Shi; Ruyun Zhang; Jingfeng Xue; Zhi Wang

doi:10.1002/int.22451

Boosting training for PDF malware classifier via active learning

Yuanzhang Li, Xinxin Wang, Zhiwei Shi, Ruyun Zhang, Jingfeng Xue^*, Zhi Wang^*

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

23 引用（Scopus）

摘要

Machine learning algorithms are widely used for cybersecurity applications, include spam, malware detection. In these applications, the machine learning model has to face attack by adversarial samples. Therefore, how to train a robust machine learning model with small samples is a very hot research problem. portable document format (PDF) is a widely used file format, and often utilized as a vehicle for malicious behavior. There have been various PDF malware detectors based on machine learning. However, the labeling of large-scale data samples is time-consuming and laborious. This paper aims to reduce the size of training set while maintain the performance of detection. We propose a novel PDF malware detection method, using active learning to boost training. Particularly, we first make clear the meaning of uncertain samples in this paper, and theoretically explain the effectiveness of these uncertain samples for malware detection. Second, we present an active-learning based malware detection model, using mutual agreement analysis to choose the uncertain sample as the data augmentation. The detector is retrained according to the ground truth of the uncertain samples rather than the whole test samples in the previous epoch, which can not only improve the detection performance, but also reduce the training time consumption of the detector. We conduct 10 epochs of retraining experiments for comparison, using the uncertain samples and the whole test samples from the previous epoch respectively as training set augmentation. The experimental results show that our active-learning based model can achieve the same performance as the traditional model in the tenth epoch of retraining, while the former only needs to use one thirtieth of the latter's training samples.

源语言	英语
页（从-至）	2803-2821
页数	19
期刊	International Journal of Intelligent Systems
卷	37
期	4
DOI	https://doi.org/10.1002/int.22451
出版状态	已出版 - 4月 2022

访问文件

10.1002/int.22451

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{aa672efe31aa492ca4de133b228b3209,

title = "Boosting training for PDF malware classifier via active learning",

abstract = "Machine learning algorithms are widely used for cybersecurity applications, include spam, malware detection. In these applications, the machine learning model has to face attack by adversarial samples. Therefore, how to train a robust machine learning model with small samples is a very hot research problem. portable document format (PDF) is a widely used file format, and often utilized as a vehicle for malicious behavior. There have been various PDF malware detectors based on machine learning. However, the labeling of large-scale data samples is time-consuming and laborious. This paper aims to reduce the size of training set while maintain the performance of detection. We propose a novel PDF malware detection method, using active learning to boost training. Particularly, we first make clear the meaning of uncertain samples in this paper, and theoretically explain the effectiveness of these uncertain samples for malware detection. Second, we present an active-learning based malware detection model, using mutual agreement analysis to choose the uncertain sample as the data augmentation. The detector is retrained according to the ground truth of the uncertain samples rather than the whole test samples in the previous epoch, which can not only improve the detection performance, but also reduce the training time consumption of the detector. We conduct 10 epochs of retraining experiments for comparison, using the uncertain samples and the whole test samples from the previous epoch respectively as training set augmentation. The experimental results show that our active-learning based model can achieve the same performance as the traditional model in the tenth epoch of retraining, while the former only needs to use one thirtieth of the latter's training samples.",

keywords = "PDF, active learning, machine learning, malware detection",

author = "Yuanzhang Li and Xinxin Wang and Zhiwei Shi and Ruyun Zhang and Jingfeng Xue and Zhi Wang",

note = "Publisher Copyright: {\textcopyright} 2021 Wiley Periodicals LLC",

year = "2022",

month = apr,

doi = "10.1002/int.22451",

language = "English",

volume = "37",

pages = "2803--2821",

journal = "International Journal of Intelligent Systems",

issn = "0884-8173",

publisher = "John Wiley and Sons Inc.",

number = "4",

}

TY - JOUR

T1 - Boosting training for PDF malware classifier via active learning

AU - Li, Yuanzhang

AU - Wang, Xinxin

AU - Shi, Zhiwei

AU - Zhang, Ruyun

AU - Xue, Jingfeng

AU - Wang, Zhi

PY - 2022/4

Y1 - 2022/4

N2 - Machine learning algorithms are widely used for cybersecurity applications, include spam, malware detection. In these applications, the machine learning model has to face attack by adversarial samples. Therefore, how to train a robust machine learning model with small samples is a very hot research problem. portable document format (PDF) is a widely used file format, and often utilized as a vehicle for malicious behavior. There have been various PDF malware detectors based on machine learning. However, the labeling of large-scale data samples is time-consuming and laborious. This paper aims to reduce the size of training set while maintain the performance of detection. We propose a novel PDF malware detection method, using active learning to boost training. Particularly, we first make clear the meaning of uncertain samples in this paper, and theoretically explain the effectiveness of these uncertain samples for malware detection. Second, we present an active-learning based malware detection model, using mutual agreement analysis to choose the uncertain sample as the data augmentation. The detector is retrained according to the ground truth of the uncertain samples rather than the whole test samples in the previous epoch, which can not only improve the detection performance, but also reduce the training time consumption of the detector. We conduct 10 epochs of retraining experiments for comparison, using the uncertain samples and the whole test samples from the previous epoch respectively as training set augmentation. The experimental results show that our active-learning based model can achieve the same performance as the traditional model in the tenth epoch of retraining, while the former only needs to use one thirtieth of the latter's training samples.

AB - Machine learning algorithms are widely used for cybersecurity applications, include spam, malware detection. In these applications, the machine learning model has to face attack by adversarial samples. Therefore, how to train a robust machine learning model with small samples is a very hot research problem. portable document format (PDF) is a widely used file format, and often utilized as a vehicle for malicious behavior. There have been various PDF malware detectors based on machine learning. However, the labeling of large-scale data samples is time-consuming and laborious. This paper aims to reduce the size of training set while maintain the performance of detection. We propose a novel PDF malware detection method, using active learning to boost training. Particularly, we first make clear the meaning of uncertain samples in this paper, and theoretically explain the effectiveness of these uncertain samples for malware detection. Second, we present an active-learning based malware detection model, using mutual agreement analysis to choose the uncertain sample as the data augmentation. The detector is retrained according to the ground truth of the uncertain samples rather than the whole test samples in the previous epoch, which can not only improve the detection performance, but also reduce the training time consumption of the detector. We conduct 10 epochs of retraining experiments for comparison, using the uncertain samples and the whole test samples from the previous epoch respectively as training set augmentation. The experimental results show that our active-learning based model can achieve the same performance as the traditional model in the tenth epoch of retraining, while the former only needs to use one thirtieth of the latter's training samples.

KW - PDF

KW - active learning

KW - machine learning

KW - malware detection

UR - http://www.scopus.com/inward/record.url?scp=85105707006&partnerID=8YFLogxK

U2 - 10.1002/int.22451

DO - 10.1002/int.22451

M3 - Article

AN - SCOPUS:85105707006

SN - 0884-8173

VL - 37

SP - 2803

EP - 2821

JO - International Journal of Intelligent Systems

JF - International Journal of Intelligent Systems

IS - 4

ER -

Boosting training for PDF malware classifier via active learning

摘要

访问文件

其它文件与链接

指纹

引用此