A survey on the safety of large language model: Classification, evaluation, attribution, mitigation and prospect

  • Heyan Huang
  • , Silin Li
  • , Tianwei Lan
  • , Yuli Qiu
  • , Zeming Liu
  • , Jiashu Yao
  • , Li Zeng
  • , Yingyu Shan
  • , Xiaoming Shi
  • , Yuhang Guo

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Large language models can provide answers comparable to human levels in multiple fields. It demonstrates a wealth of emergent capabilities in fields and tasks that have not been trained. However, at present, there are many hidden dangers in artificial intelligence system based on large language model. The artificial intelligence systems based on large language model have many potential safety hazard. For example, large language models are vulnerable to undetectable attacks, including intricately elusive ones. The content generated by those models may have problems such as illegality, leaks, hatred, bias, errors, etc. Whafs more, in practical applications, the abuse of large language models is also an important issue. The content generated by the model may cause troubles at multiple levels such as countries, social groups, and fields. This paper aims to deeply explore and classify the safety risks feced by large language models, review existing evaluation methods, study the causal mechanisms behind the safety risks, and summarizes existing solutions. Specifically, this paper identifies 10 safety risks of large language models and categorizes them into two aspects: The safety risks of the model itself and the safety risks of the generated content. What's more, this paper systematically analyzes the safety risks of the large language model itself from two perspectives of life cycle and hazard level, and introduces the methods for risk assessment of existing large language models, the causes for occurrence of safety risks of large language model and corresponding mitigation methods. The safety risk of large language models is an important issue that needs to be solved urgently.

Original languageEnglish
Pages (from-to)2-32
Number of pages31
JournalCAAI Transactions on Intelligent Systems
Volume20
Issue number1
DOIs
Publication statusPublished - 2025

Keywords

  • generated content safety
  • large language model
  • model safety
  • safety classification
  • safety research prospect
  • safety risk attribution
  • safety risk evaluation
  • safety risk mitigation measures

Fingerprint

Dive into the research topics of 'A survey on the safety of large language model: Classification, evaluation, attribution, mitigation and prospect'. Together they form a unique fingerprint.

Cite this