TY - JOUR
T1 - Tibetan-LLaMA 2
T2 - Large Language Model for Tibetan
AU - Sha, Jiu
AU - Zhu, Mengxiao
AU - Feng, Chong
AU - Ci, Jizhuoma
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/12/11
Y1 - 2025/12/11
N2 - Large language models (LLMs), such as ChatGPT and LLama, have shown remarkable capability in a wide range of natural language tasks. However, the current LLMs are mainly concentrated in resource-rich languages, such as English and Chinese. For low-resource language such as Tibetan, research and applications related to LLMs are still in their infancy. To address the existing gap, we present a method to enhance LLaMA with the ability to understand and generate Tibetan text, as well as to follow instructions. This is achieved by creating large-scale unsupervised pre-training and supervised fine-tuning datasets, which mitigate the limited availability of Tibetan data. Additionally, we expand LLaMA’s vocabulary by incorporating Tibetan tokens through Unigram tokenization, thereby improving both its encoding efficiency and semantic understanding of Tibetan. Furthermore, we conduct secondary pre-training and fine-tune the model using the constructed datasets, thereby enhancing its capability to interpret and execute instructions effectively. To verify the effectiveness of the model, we establish ten evaluation benchmarks for Tibetan. The experimental results indicate that the proposed model significantly enhances the LLaMA’s proficiency in understanding and generating Tibetan content. To promote further research, we release our model and inference resources at https://github.com/Shajiu/Tibetan-LLaMA-2.
AB - Large language models (LLMs), such as ChatGPT and LLama, have shown remarkable capability in a wide range of natural language tasks. However, the current LLMs are mainly concentrated in resource-rich languages, such as English and Chinese. For low-resource language such as Tibetan, research and applications related to LLMs are still in their infancy. To address the existing gap, we present a method to enhance LLaMA with the ability to understand and generate Tibetan text, as well as to follow instructions. This is achieved by creating large-scale unsupervised pre-training and supervised fine-tuning datasets, which mitigate the limited availability of Tibetan data. Additionally, we expand LLaMA’s vocabulary by incorporating Tibetan tokens through Unigram tokenization, thereby improving both its encoding efficiency and semantic understanding of Tibetan. Furthermore, we conduct secondary pre-training and fine-tune the model using the constructed datasets, thereby enhancing its capability to interpret and execute instructions effectively. To verify the effectiveness of the model, we establish ten evaluation benchmarks for Tibetan. The experimental results indicate that the proposed model significantly enhances the LLaMA’s proficiency in understanding and generating Tibetan content. To promote further research, we release our model and inference resources at https://github.com/Shajiu/Tibetan-LLaMA-2.
KW - Large language models
KW - Low-resource language
KW - Tibetan data
UR - https://www.scopus.com/pages/publications/105024759069
U2 - 10.1145/3776748
DO - 10.1145/3776748
M3 - Article
AN - SCOPUS:105024759069
SN - 2375-4699
VL - 24
JO - ACM Transactions on Asian and Low-Resource Language Information Processing
JF - ACM Transactions on Asian and Low-Resource Language Information Processing
IS - 12
M1 - 141
ER -