跳到主要导航 跳到搜索 跳到主要内容

Tibetan-LLaMA 2: Large Language Model for Tibetan

  • Jiu Sha
  • , Mengxiao Zhu*
  • , Chong Feng
  • , Jizhuoma Ci
  • *此作品的通讯作者
  • Minzu University of China
  • North China University of Technology

科研成果: 期刊稿件文章同行评审

摘要

Large language models (LLMs), such as ChatGPT and LLama, have shown remarkable capability in a wide range of natural language tasks. However, the current LLMs are mainly concentrated in resource-rich languages, such as English and Chinese. For low-resource language such as Tibetan, research and applications related to LLMs are still in their infancy. To address the existing gap, we present a method to enhance LLaMA with the ability to understand and generate Tibetan text, as well as to follow instructions. This is achieved by creating large-scale unsupervised pre-training and supervised fine-tuning datasets, which mitigate the limited availability of Tibetan data. Additionally, we expand LLaMA’s vocabulary by incorporating Tibetan tokens through Unigram tokenization, thereby improving both its encoding efficiency and semantic understanding of Tibetan. Furthermore, we conduct secondary pre-training and fine-tune the model using the constructed datasets, thereby enhancing its capability to interpret and execute instructions effectively. To verify the effectiveness of the model, we establish ten evaluation benchmarks for Tibetan. The experimental results indicate that the proposed model significantly enhances the LLaMA’s proficiency in understanding and generating Tibetan content. To promote further research, we release our model and inference resources at https://github.com/Shajiu/Tibetan-LLaMA-2.

源语言英语
文章编号141
期刊ACM Transactions on Asian and Low-Resource Language Information Processing
24
12
DOI
出版状态已出版 - 11 12月 2025

指纹

探究 'Tibetan-LLaMA 2: Large Language Model for Tibetan' 的科研主题。它们共同构成独一无二的指纹。

引用此