Tibetan-LLaMA 2: Large Language Model for Tibetan

  • Jiu Sha
  • , Mengxiao Zhu*
  • , Chong Feng
  • , Jizhuoma Ci
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Large language models (LLMs), such as ChatGPT and LLama, have shown remarkable capability in a wide range of natural language tasks. However, the current LLMs are mainly concentrated in resource-rich languages, such as English and Chinese. For low-resource language such as Tibetan, research and applications related to LLMs are still in their infancy. To address the existing gap, we present a method to enhance LLaMA with the ability to understand and generate Tibetan text, as well as to follow instructions. This is achieved by creating large-scale unsupervised pre-training and supervised fine-tuning datasets, which mitigate the limited availability of Tibetan data. Additionally, we expand LLaMA’s vocabulary by incorporating Tibetan tokens through Unigram tokenization, thereby improving both its encoding efficiency and semantic understanding of Tibetan. Furthermore, we conduct secondary pre-training and fine-tune the model using the constructed datasets, thereby enhancing its capability to interpret and execute instructions effectively. To verify the effectiveness of the model, we establish ten evaluation benchmarks for Tibetan. The experimental results indicate that the proposed model significantly enhances the LLaMA’s proficiency in understanding and generating Tibetan content. To promote further research, we release our model and inference resources at https://github.com/Shajiu/Tibetan-LLaMA-2.

Original languageEnglish
Article number141
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume24
Issue number12
DOIs
Publication statusPublished - 11 Dec 2025

Keywords

  • Large language models
  • Low-resource language
  • Tibetan data

Fingerprint

Dive into the research topics of 'Tibetan-LLaMA 2: Large Language Model for Tibetan'. Together they form a unique fingerprint.

Cite this