TY - JOUR
T1 - Efficient Classification of Malicious URLs
T2 - M-BERT - A Modified BERT Variant for Enhanced Semantic Understanding
AU - Yu, Boyang
AU - Tang, Fei
AU - Ergu, Daji
AU - Zeng, Rui
AU - Ma, Bo
AU - Liu, Fangyao
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - Malicious websites present a substantial threat to the security and privacy of individuals using the internet. Traditional approaches for identifying these malicious sites have struggled to keep pace with evolving attack strategies. In recent years, language models have emerged as a potential solution for effectively detecting and categorizing malicious websites. This study introduces a novel Bidirectional Encoder Representations from Transformers (BERT) model, based on the Transformer encoder architecture, designed to capture pertinent characteristics of malicious web addresses. Additionally, large-scale language models are employed for training, dataset assessment, and interpretability analysis. The evaluation results demonstrate the effectiveness of the large language model in accurately classifying malicious websites, achieving an impressive precision rate of 94.42%. This performance surpasses that of existing language models. Furthermore, the interpretability analysis sheds light on the decision-making process of the model, enhancing our understanding of its classification outcomes. In conclusion, the proposed BERT model, built on the Transformer encoder architecture, exhibits robust performance and interpretability in the identification of malicious websites. It holds promise as a solution to bolster the security of network users and mitigate the risks associated with malicious online activities.
AB - Malicious websites present a substantial threat to the security and privacy of individuals using the internet. Traditional approaches for identifying these malicious sites have struggled to keep pace with evolving attack strategies. In recent years, language models have emerged as a potential solution for effectively detecting and categorizing malicious websites. This study introduces a novel Bidirectional Encoder Representations from Transformers (BERT) model, based on the Transformer encoder architecture, designed to capture pertinent characteristics of malicious web addresses. Additionally, large-scale language models are employed for training, dataset assessment, and interpretability analysis. The evaluation results demonstrate the effectiveness of the large language model in accurately classifying malicious websites, achieving an impressive precision rate of 94.42%. This performance surpasses that of existing language models. Furthermore, the interpretability analysis sheds light on the decision-making process of the model, enhancing our understanding of its classification outcomes. In conclusion, the proposed BERT model, built on the Transformer encoder architecture, exhibits robust performance and interpretability in the identification of malicious websites. It holds promise as a solution to bolster the security of network users and mitigate the risks associated with malicious online activities.
KW - deep learning
KW - fraudulent URL classification
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85183962059&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3357095
DO - 10.1109/ACCESS.2024.3357095
M3 - Article
AN - SCOPUS:85183962059
SN - 2169-3536
VL - 12
SP - 13453
EP - 13468
JO - IEEE Access
JF - IEEE Access
ER -