RL4Mal: Representation learning-based malware classification under long-tailed distribution

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The increasing volume and variants of malware are posing substantial risks to the security of personal privacy and property, which gradually makes malware classification a hot research topic in machine learning. Compared with traditional malware classification methods requiring much prior knowledge to explicitly construct feature representation, CNN-based methods have more strength for implicitly exploring deep representation and achieved better performance. However, there is an obvious long-tailed distribution problem in existing malware datasets, i.e., a small fraction of families occupies most of the samples, while the sample distribution of most families is relatively sparse. The long-tailed distribution will lead to poor generalization performance on families with fewer sample numbers in conventional CNN-based malware classification. In this paper, we propose a malware classification method based on representation learning, called RL4Mal, which utilizes metric learning for better generalization on the tail families while keeping performance on the head families. Besides designing a weighted metric loss, we also introduce data augmentation and a memory bank to expand negative samples for the loss optimization, further promoting a better construction of the representation space for malware classification. We conduct sufficient experiments on the Malimg dataset, proving that RL4Mal effectively solves the long-tailed problem with 99.47% accuracy in tail families while keeping 99.78% accuracy on average.

Original languageEnglish
Title of host publicationProceedings of the 2025 2nd International Conference on Computer Network and Cloud Computing, CNCC 2025
PublisherAssociation for Computing Machinery, Inc
Pages41-51
Number of pages11
ISBN (Electronic)9798400714061
DOIs
Publication statusPublished - 22 Jul 2025
Externally publishedYes
Event2nd International Conference on Computer Network and Cloud Computing, CNCC 2025 - Nanchang, China
Duration: 11 Apr 202513 Apr 2025

Publication series

NameProceedings of the 2025 2nd International Conference on Computer Network and Cloud Computing, CNCC 2025

Conference

Conference2nd International Conference on Computer Network and Cloud Computing, CNCC 2025
Country/TerritoryChina
CityNanchang
Period11/04/2513/04/25

Keywords

  • Long-tailed distribution
  • Malware family classification
  • Representation learning

Fingerprint

Dive into the research topics of 'RL4Mal: Representation learning-based malware classification under long-tailed distribution'. Together they form a unique fingerprint.

Cite this