TY - GEN
T1 - RL4Mal
T2 - 2nd International Conference on Computer Network and Cloud Computing, CNCC 2025
AU - Wang, Liuting
AU - Xue, Jingfeng
AU - Wen, Haoyuan
AU - Wang, Yong
AU - Zhang, Ji
AU - Liu, Zhenyan
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/7/22
Y1 - 2025/7/22
N2 - The increasing volume and variants of malware are posing substantial risks to the security of personal privacy and property, which gradually makes malware classification a hot research topic in machine learning. Compared with traditional malware classification methods requiring much prior knowledge to explicitly construct feature representation, CNN-based methods have more strength for implicitly exploring deep representation and achieved better performance. However, there is an obvious long-tailed distribution problem in existing malware datasets, i.e., a small fraction of families occupies most of the samples, while the sample distribution of most families is relatively sparse. The long-tailed distribution will lead to poor generalization performance on families with fewer sample numbers in conventional CNN-based malware classification. In this paper, we propose a malware classification method based on representation learning, called RL4Mal, which utilizes metric learning for better generalization on the tail families while keeping performance on the head families. Besides designing a weighted metric loss, we also introduce data augmentation and a memory bank to expand negative samples for the loss optimization, further promoting a better construction of the representation space for malware classification. We conduct sufficient experiments on the Malimg dataset, proving that RL4Mal effectively solves the long-tailed problem with 99.47% accuracy in tail families while keeping 99.78% accuracy on average.
AB - The increasing volume and variants of malware are posing substantial risks to the security of personal privacy and property, which gradually makes malware classification a hot research topic in machine learning. Compared with traditional malware classification methods requiring much prior knowledge to explicitly construct feature representation, CNN-based methods have more strength for implicitly exploring deep representation and achieved better performance. However, there is an obvious long-tailed distribution problem in existing malware datasets, i.e., a small fraction of families occupies most of the samples, while the sample distribution of most families is relatively sparse. The long-tailed distribution will lead to poor generalization performance on families with fewer sample numbers in conventional CNN-based malware classification. In this paper, we propose a malware classification method based on representation learning, called RL4Mal, which utilizes metric learning for better generalization on the tail families while keeping performance on the head families. Besides designing a weighted metric loss, we also introduce data augmentation and a memory bank to expand negative samples for the loss optimization, further promoting a better construction of the representation space for malware classification. We conduct sufficient experiments on the Malimg dataset, proving that RL4Mal effectively solves the long-tailed problem with 99.47% accuracy in tail families while keeping 99.78% accuracy on average.
KW - Long-tailed distribution
KW - Malware family classification
KW - Representation learning
UR - https://www.scopus.com/pages/publications/105013623206
U2 - 10.1145/3744451.3744458
DO - 10.1145/3744451.3744458
M3 - Conference contribution
AN - SCOPUS:105013623206
T3 - Proceedings of the 2025 2nd International Conference on Computer Network and Cloud Computing, CNCC 2025
SP - 41
EP - 51
BT - Proceedings of the 2025 2nd International Conference on Computer Network and Cloud Computing, CNCC 2025
PB - Association for Computing Machinery, Inc
Y2 - 11 April 2025 through 13 April 2025
ER -