TY - GEN
T1 - Bypassing LLM Safeguards
T2 - 14th International Conference on Computer Engineering and Networks, CENet 2024
AU - Wang, Shaohuang
AU - Geng, Ruijing
AU - Lei, Shuai
AU - Lv, Yanfei
AU - Zhang, Huaping
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - In the field of cybersecurity, the emergence of Large Language Models (LLMs) has opened up a new domain of potential risks. Although these models show an impressive level of capability across a range of applications, they are not free from the possibility of creating content that could negatively impact both social and digital safety. This paper examines the implications of In-Context Learning based command attacks, a burgeoning threat to the security and ethical integrity of LLMs. We introduce the In-Context Tense Attack(ITA) framework, a novel approach that employs harmful examples to undermine the integrity of LLMs. Our theoretical analysis elucidates how a constrained set of context examples can significantly influence the security mechanisms of LLMs. Through rigorous experimentation, we have substantiated the potency of ITA in elevating the success rate of jailbreaking prompts. On the PKU-Alignment/SafeRLHF dataset, ITA achieved a remarkable 92.99% increase in Accuracy, a 73.36% improvement in Rouge-L, and a 27.01% enhancement in Bleu-4 scores. Similarly, on the NVIDIA/Aegis-Safety dataset, ITA demonstrated a 72.03% increase in Accuracy, an 80.87% rise in Rouge-L, and a 40.24% boost in Bleu-4 scores. These results underscore the effectiveness of ITA in manipulating LLMs to generate harmful outputs, thereby highlighting the necessity for more robust security measures.
AB - In the field of cybersecurity, the emergence of Large Language Models (LLMs) has opened up a new domain of potential risks. Although these models show an impressive level of capability across a range of applications, they are not free from the possibility of creating content that could negatively impact both social and digital safety. This paper examines the implications of In-Context Learning based command attacks, a burgeoning threat to the security and ethical integrity of LLMs. We introduce the In-Context Tense Attack(ITA) framework, a novel approach that employs harmful examples to undermine the integrity of LLMs. Our theoretical analysis elucidates how a constrained set of context examples can significantly influence the security mechanisms of LLMs. Through rigorous experimentation, we have substantiated the potency of ITA in elevating the success rate of jailbreaking prompts. On the PKU-Alignment/SafeRLHF dataset, ITA achieved a remarkable 92.99% increase in Accuracy, a 73.36% improvement in Rouge-L, and a 27.01% enhancement in Bleu-4 scores. Similarly, on the NVIDIA/Aegis-Safety dataset, ITA demonstrated a 72.03% increase in Accuracy, an 80.87% rise in Rouge-L, and a 40.24% boost in Bleu-4 scores. These results underscore the effectiveness of ITA in manipulating LLMs to generate harmful outputs, thereby highlighting the necessity for more robust security measures.
KW - Cyber-Security
KW - In-Context Learning
KW - Jailbreak
KW - LLM Safety
UR - http://www.scopus.com/inward/record.url?scp=105006813070&partnerID=8YFLogxK
U2 - 10.1007/978-981-96-4245-8_23
DO - 10.1007/978-981-96-4245-8_23
M3 - Conference contribution
AN - SCOPUS:105006813070
SN - 9789819642441
T3 - Lecture Notes in Electrical Engineering
SP - 266
EP - 277
BT - Proceedings of the 14th International Conference on Computer Engineering and Networks - Volume III
A2 - Yin, Guangqiang
A2 - Liu, Xiaodong
A2 - Su, Jian
A2 - Yang, Yangzhao
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 18 October 2024 through 21 October 2024
ER -