TY - JOUR
T1 - Multi-level Chunk-based Constituent-To-Dependency Treebank Transformation for Tibetan Dependency Parsing
AU - Shi, Shumin
AU - Luo, Dan
AU - Wu, Xing
AU - Long, Congjun
AU - Huang, Heyan
N1 - Publisher Copyright:
© 2021 Association for Computing Machinery.
PY - 2021/4
Y1 - 2021/4
N2 - Dependency parsing is an important task for Natural Language Processing (NLP). However, a mature parser requires a large treebank for training, which is still extremely costly to create. Tibetan is a kind of extremely low-resource language for NLP, there is no available Tibetan dependency treebank, which is currently obtained by manual annotation. Furthermore, there are few related kinds of research on the construction of treebank. We propose a novel method of multi-level chunk-based syntactic parsing to complete constituent-To-dependency treebank conversion for Tibetan under scarce conditions. Our method mines more dependencies of Tibetan sentences, builds a high-quality Tibetan dependency tree corpus, and makes fuller use of the inherent laws of the language itself. We train the dependency parsing models on the dependency treebank obtained by the preliminary transformation. The model achieves 86.5% accuracy, 96% LAS, and 97.85% UAS, which exceeds the optimal results of existing conversion methods. The experimental results show that our method has the potential to use a low-resource setting, which means we not only solve the problem of scarce Tibetan dependency treebank but also avoid needless manual annotation. The method embodies the regularity of strong knowledge-guided linguistic analysis methods, which is of great significance to promote the research of Tibetan information processing.
AB - Dependency parsing is an important task for Natural Language Processing (NLP). However, a mature parser requires a large treebank for training, which is still extremely costly to create. Tibetan is a kind of extremely low-resource language for NLP, there is no available Tibetan dependency treebank, which is currently obtained by manual annotation. Furthermore, there are few related kinds of research on the construction of treebank. We propose a novel method of multi-level chunk-based syntactic parsing to complete constituent-To-dependency treebank conversion for Tibetan under scarce conditions. Our method mines more dependencies of Tibetan sentences, builds a high-quality Tibetan dependency tree corpus, and makes fuller use of the inherent laws of the language itself. We train the dependency parsing models on the dependency treebank obtained by the preliminary transformation. The model achieves 86.5% accuracy, 96% LAS, and 97.85% UAS, which exceeds the optimal results of existing conversion methods. The experimental results show that our method has the potential to use a low-resource setting, which means we not only solve the problem of scarce Tibetan dependency treebank but also avoid needless manual annotation. The method embodies the regularity of strong knowledge-guided linguistic analysis methods, which is of great significance to promote the research of Tibetan information processing.
KW - Knowledge-driven
KW - Low-resource dependency parsing
KW - Multi-level chunk mechanism
KW - Tibetan dependency trees
UR - http://www.scopus.com/inward/record.url?scp=85105725378&partnerID=8YFLogxK
U2 - 10.1145/3424247
DO - 10.1145/3424247
M3 - Article
AN - SCOPUS:85105725378
SN - 2375-4699
VL - 20
JO - ACM Transactions on Asian and Low-Resource Language Information Processing
JF - ACM Transactions on Asian and Low-Resource Language Information Processing
IS - 2
M1 - 26
ER -