Efficient Classification of Malicious URLs: M-BERT - A Modified BERT Variant for Enhanced Semantic Understanding

Boyang Yu; Fei Tang; Daji Ergu; Rui Zeng; Bo Ma; Fangyao Liu

doi:10.1109/ACCESS.2024.3357095

Efficient Classification of Malicious URLs: M-BERT - A Modified BERT Variant for Enhanced Semantic Understanding

Boyang Yu, Fei Tang, Daji Ergu, Rui Zeng, Bo Ma, Fangyao Liu^*

^*Corresponding author for this work

Southwest University for Nationalities

Research output: Contribution to journal › Article › peer-review

4 Citations (Scopus)

Abstract

Malicious websites present a substantial threat to the security and privacy of individuals using the internet. Traditional approaches for identifying these malicious sites have struggled to keep pace with evolving attack strategies. In recent years, language models have emerged as a potential solution for effectively detecting and categorizing malicious websites. This study introduces a novel Bidirectional Encoder Representations from Transformers (BERT) model, based on the Transformer encoder architecture, designed to capture pertinent characteristics of malicious web addresses. Additionally, large-scale language models are employed for training, dataset assessment, and interpretability analysis. The evaluation results demonstrate the effectiveness of the large language model in accurately classifying malicious websites, achieving an impressive precision rate of 94.42%. This performance surpasses that of existing language models. Furthermore, the interpretability analysis sheds light on the decision-making process of the model, enhancing our understanding of its classification outcomes. In conclusion, the proposed BERT model, built on the Transformer encoder architecture, exhibits robust performance and interpretability in the identification of malicious websites. It holds promise as a solution to bolster the security of network users and mitigate the risks associated with malicious online activities.

Original language	English
Pages (from-to)	13453-13468
Number of pages	16
Journal	IEEE Access
Volume	12
DOIs	https://doi.org/10.1109/ACCESS.2024.3357095
Publication status	Published - 2024
Externally published	Yes

Keywords

deep learning
fraudulent URL classification
Natural language processing

Access to Document

10.1109/ACCESS.2024.3357095

Cite this

@article{9aa20b91cc864dc4832bc47bcb3ca024,

title = "Efficient Classification of Malicious URLs: M-BERT - A Modified BERT Variant for Enhanced Semantic Understanding",

abstract = "Malicious websites present a substantial threat to the security and privacy of individuals using the internet. Traditional approaches for identifying these malicious sites have struggled to keep pace with evolving attack strategies. In recent years, language models have emerged as a potential solution for effectively detecting and categorizing malicious websites. This study introduces a novel Bidirectional Encoder Representations from Transformers (BERT) model, based on the Transformer encoder architecture, designed to capture pertinent characteristics of malicious web addresses. Additionally, large-scale language models are employed for training, dataset assessment, and interpretability analysis. The evaluation results demonstrate the effectiveness of the large language model in accurately classifying malicious websites, achieving an impressive precision rate of 94.42%. This performance surpasses that of existing language models. Furthermore, the interpretability analysis sheds light on the decision-making process of the model, enhancing our understanding of its classification outcomes. In conclusion, the proposed BERT model, built on the Transformer encoder architecture, exhibits robust performance and interpretability in the identification of malicious websites. It holds promise as a solution to bolster the security of network users and mitigate the risks associated with malicious online activities.",

keywords = "deep learning, fraudulent URL classification, Natural language processing",

author = "Boyang Yu and Fei Tang and Daji Ergu and Rui Zeng and Bo Ma and Fangyao Liu",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2024",

doi = "10.1109/ACCESS.2024.3357095",

language = "English",

volume = "12",

pages = "13453--13468",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Efficient Classification of Malicious URLs

T2 - M-BERT - A Modified BERT Variant for Enhanced Semantic Understanding

AU - Yu, Boyang

AU - Tang, Fei

AU - Ergu, Daji

AU - Zeng, Rui

AU - Ma, Bo

AU - Liu, Fangyao

PY - 2024

Y1 - 2024

N2 - Malicious websites present a substantial threat to the security and privacy of individuals using the internet. Traditional approaches for identifying these malicious sites have struggled to keep pace with evolving attack strategies. In recent years, language models have emerged as a potential solution for effectively detecting and categorizing malicious websites. This study introduces a novel Bidirectional Encoder Representations from Transformers (BERT) model, based on the Transformer encoder architecture, designed to capture pertinent characteristics of malicious web addresses. Additionally, large-scale language models are employed for training, dataset assessment, and interpretability analysis. The evaluation results demonstrate the effectiveness of the large language model in accurately classifying malicious websites, achieving an impressive precision rate of 94.42%. This performance surpasses that of existing language models. Furthermore, the interpretability analysis sheds light on the decision-making process of the model, enhancing our understanding of its classification outcomes. In conclusion, the proposed BERT model, built on the Transformer encoder architecture, exhibits robust performance and interpretability in the identification of malicious websites. It holds promise as a solution to bolster the security of network users and mitigate the risks associated with malicious online activities.

AB - Malicious websites present a substantial threat to the security and privacy of individuals using the internet. Traditional approaches for identifying these malicious sites have struggled to keep pace with evolving attack strategies. In recent years, language models have emerged as a potential solution for effectively detecting and categorizing malicious websites. This study introduces a novel Bidirectional Encoder Representations from Transformers (BERT) model, based on the Transformer encoder architecture, designed to capture pertinent characteristics of malicious web addresses. Additionally, large-scale language models are employed for training, dataset assessment, and interpretability analysis. The evaluation results demonstrate the effectiveness of the large language model in accurately classifying malicious websites, achieving an impressive precision rate of 94.42%. This performance surpasses that of existing language models. Furthermore, the interpretability analysis sheds light on the decision-making process of the model, enhancing our understanding of its classification outcomes. In conclusion, the proposed BERT model, built on the Transformer encoder architecture, exhibits robust performance and interpretability in the identification of malicious websites. It holds promise as a solution to bolster the security of network users and mitigate the risks associated with malicious online activities.

KW - deep learning

KW - fraudulent URL classification

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85183962059&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2024.3357095

DO - 10.1109/ACCESS.2024.3357095

M3 - Article

AN - SCOPUS:85183962059

SN - 2169-3536

VL - 12

SP - 13453

EP - 13468

JO - IEEE Access

JF - IEEE Access

ER -

Efficient Classification of Malicious URLs: M-BERT - A Modified BERT Variant for Enhanced Semantic Understanding

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this