Multimodal Dual-Embedding Networks for Malware Open-Set Recognition

Jingcai Guo; Han Wang; Yuanyuan Xu; Wenchao Xu; Yufeng Zhan; Yuxia Sun; Song Guo

doi:10.1109/TNNLS.2024.3373809

Multimodal Dual-Embedding Networks for Malware Open-Set Recognition

Jingcai Guo, Han Wang, Yuanyuan Xu, Wenchao Xu, Yufeng Zhan, Yuxia Sun^*, Song Guo

^*Corresponding author for this work

School of Automation

Research output: Contribution to journal › Article › peer-review

Abstract

Malware open-set recognition (MOSR) is an emerging research domain that aims at jointly classifying malware samples from known families and detecting the ones from novel unknown families, respectively. Existing works mostly rely on a well-trained classifier considering the predicted probabilities of each known family with a threshold-based detection to achieve the MOSR. However, our observation reveals that the feature distributions of malware samples are extremely similar to each other even between known and unknown families. Thus, the obtained classifier may produce overly high probabilities of testing unknown samples toward known families and degrade the model performance. In this article, we propose the multi\modal dual-embedding networks, dubbed MDENet, to take advantage of comprehensive malware features from different modalities to enhance the diversity of malware feature space, which is more representative and discriminative for down-stream recognition. Concretely, we first generate a malware image for each observed sample based on their numeric features using our proposed numeric encoder with a re- designed multiscale CNN structure, which can better explore their statistical and spatial correlations. Besides, we propose to organize tokenized malware features into a sentence for each sample considering its behaviors and dynamics, and utilize language models as the textual encoder to transform it into a representable and computable textual vector. Such parallel multimodal encoders can fuse the above two components to enhance the feature diversity. Last, to further guarantee the open-set recognition (OSR), we dually embed the fused multimodal representation into one primary space and an associated sub-space, i.e., discriminative and exclusive spaces, with contrastive sampling and \rho -bounded enclosing sphere regularizations, which resort to classification and detection, respectively. Moreover, we also enrich our previously proposed large-scaled malware dataset MAL-100 with multimodal characteristics and contribute an improved version dubbed MAL-100+. Experimental results on the widely used malware dataset Mailing and the proposed MAL-100+ demonstrate the effectiveness of our method.

Original language	English
Pages (from-to)	4545-4559
Number of pages	15
Journal	IEEE Transactions on Neural Networks and Learning Systems
Volume	36
Issue number	3
DOIs	https://doi.org/10.1109/TNNLS.2024.3373809
Publication status	Published - 2025

Keywords

Classification
cyber-security
malware recognition
multimodal analysis
neural networks

Access to Document

10.1109/TNNLS.2024.3373809

Cite this

Guo, J., Wang, H., Xu, Y., Xu, W., Zhan, Y., Sun, Y., & Guo, S. (2025). Multimodal Dual-Embedding Networks for Malware Open-Set Recognition. IEEE Transactions on Neural Networks and Learning Systems, 36(3), 4545-4559. https://doi.org/10.1109/TNNLS.2024.3373809

@article{e72150d0ea234ca899685940cb80a663,

title = "Multimodal Dual-Embedding Networks for Malware Open-Set Recognition",

abstract = "Malware open-set recognition (MOSR) is an emerging research domain that aims at jointly classifying malware samples from known families and detecting the ones from novel unknown families, respectively. Existing works mostly rely on a well-trained classifier considering the predicted probabilities of each known family with a threshold-based detection to achieve the MOSR. However, our observation reveals that the feature distributions of malware samples are extremely similar to each other even between known and unknown families. Thus, the obtained classifier may produce overly high probabilities of testing unknown samples toward known families and degrade the model performance. In this article, we propose the multi\modal dual-embedding networks, dubbed MDENet, to take advantage of comprehensive malware features from different modalities to enhance the diversity of malware feature space, which is more representative and discriminative for down-stream recognition. Concretely, we first generate a malware image for each observed sample based on their numeric features using our proposed numeric encoder with a re- designed multiscale CNN structure, which can better explore their statistical and spatial correlations. Besides, we propose to organize tokenized malware features into a sentence for each sample considering its behaviors and dynamics, and utilize language models as the textual encoder to transform it into a representable and computable textual vector. Such parallel multimodal encoders can fuse the above two components to enhance the feature diversity. Last, to further guarantee the open-set recognition (OSR), we dually embed the fused multimodal representation into one primary space and an associated sub-space, i.e., discriminative and exclusive spaces, with contrastive sampling and \rho -bounded enclosing sphere regularizations, which resort to classification and detection, respectively. Moreover, we also enrich our previously proposed large-scaled malware dataset MAL-100 with multimodal characteristics and contribute an improved version dubbed MAL-100+. Experimental results on the widely used malware dataset Mailing and the proposed MAL-100+ demonstrate the effectiveness of our method.",

keywords = "Classification, cyber-security, malware recognition, multimodal analysis, neural networks",

author = "Jingcai Guo and Han Wang and Yuanyuan Xu and Wenchao Xu and Yufeng Zhan and Yuxia Sun and Song Guo",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.",

year = "2025",

doi = "10.1109/TNNLS.2024.3373809",

language = "English",

volume = "36",

pages = "4545--4559",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "3",

}

TY - JOUR

T1 - Multimodal Dual-Embedding Networks for Malware Open-Set Recognition

AU - Guo, Jingcai

AU - Wang, Han

AU - Xu, Yuanyuan

AU - Xu, Wenchao

AU - Zhan, Yufeng

AU - Sun, Yuxia

AU - Guo, Song

PY - 2025

Y1 - 2025

N2 - Malware open-set recognition (MOSR) is an emerging research domain that aims at jointly classifying malware samples from known families and detecting the ones from novel unknown families, respectively. Existing works mostly rely on a well-trained classifier considering the predicted probabilities of each known family with a threshold-based detection to achieve the MOSR. However, our observation reveals that the feature distributions of malware samples are extremely similar to each other even between known and unknown families. Thus, the obtained classifier may produce overly high probabilities of testing unknown samples toward known families and degrade the model performance. In this article, we propose the multi\modal dual-embedding networks, dubbed MDENet, to take advantage of comprehensive malware features from different modalities to enhance the diversity of malware feature space, which is more representative and discriminative for down-stream recognition. Concretely, we first generate a malware image for each observed sample based on their numeric features using our proposed numeric encoder with a re- designed multiscale CNN structure, which can better explore their statistical and spatial correlations. Besides, we propose to organize tokenized malware features into a sentence for each sample considering its behaviors and dynamics, and utilize language models as the textual encoder to transform it into a representable and computable textual vector. Such parallel multimodal encoders can fuse the above two components to enhance the feature diversity. Last, to further guarantee the open-set recognition (OSR), we dually embed the fused multimodal representation into one primary space and an associated sub-space, i.e., discriminative and exclusive spaces, with contrastive sampling and \rho -bounded enclosing sphere regularizations, which resort to classification and detection, respectively. Moreover, we also enrich our previously proposed large-scaled malware dataset MAL-100 with multimodal characteristics and contribute an improved version dubbed MAL-100+. Experimental results on the widely used malware dataset Mailing and the proposed MAL-100+ demonstrate the effectiveness of our method.

AB - Malware open-set recognition (MOSR) is an emerging research domain that aims at jointly classifying malware samples from known families and detecting the ones from novel unknown families, respectively. Existing works mostly rely on a well-trained classifier considering the predicted probabilities of each known family with a threshold-based detection to achieve the MOSR. However, our observation reveals that the feature distributions of malware samples are extremely similar to each other even between known and unknown families. Thus, the obtained classifier may produce overly high probabilities of testing unknown samples toward known families and degrade the model performance. In this article, we propose the multi\modal dual-embedding networks, dubbed MDENet, to take advantage of comprehensive malware features from different modalities to enhance the diversity of malware feature space, which is more representative and discriminative for down-stream recognition. Concretely, we first generate a malware image for each observed sample based on their numeric features using our proposed numeric encoder with a re- designed multiscale CNN structure, which can better explore their statistical and spatial correlations. Besides, we propose to organize tokenized malware features into a sentence for each sample considering its behaviors and dynamics, and utilize language models as the textual encoder to transform it into a representable and computable textual vector. Such parallel multimodal encoders can fuse the above two components to enhance the feature diversity. Last, to further guarantee the open-set recognition (OSR), we dually embed the fused multimodal representation into one primary space and an associated sub-space, i.e., discriminative and exclusive spaces, with contrastive sampling and \rho -bounded enclosing sphere regularizations, which resort to classification and detection, respectively. Moreover, we also enrich our previously proposed large-scaled malware dataset MAL-100 with multimodal characteristics and contribute an improved version dubbed MAL-100+. Experimental results on the widely used malware dataset Mailing and the proposed MAL-100+ demonstrate the effectiveness of our method.

KW - Classification

KW - cyber-security

KW - malware recognition

KW - multimodal analysis

KW - neural networks

UR - http://www.scopus.com/inward/record.url?scp=86000426064&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2024.3373809

DO - 10.1109/TNNLS.2024.3373809

M3 - Article

AN - SCOPUS:86000426064

SN - 2162-237X

VL - 36

SP - 4545

EP - 4559

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 3

ER -

Multimodal Dual-Embedding Networks for Malware Open-Set Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this