TY - GEN
T1 - Addressing Syntactic Divergence in Low-Resource Neural Machine Translation via Language Independent Word Reordering
AU - Yixi, Jiangcan
AU - Su, Chao
AU - Shi, Shumin
AU - Zhao, Xiaobing
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Neural machine translation using the combination of parallel and synthetic corpus has achieved impressive translation performance for several language pairs, where the synthetic corpus is typically generated by back-translating the monolingual target sentences. However, the quality of the synthetic corpus is poor in low-resource scenarios, which reduces the contribution of data augmentation methods such as back translation to the translation quality, especially for syntactically distant language pairs. In this paper, we propose a novel solution which uses a language independent word reordering method to address syntactic divergences between the target and source languages. The method indirectly converts the word order of the target language to the source language using an assisting language that has a similar word order to the source language and has sufficient sentence pairs with the target language. A higher quality synthetic corpus can be obtained by translating source-ordered monolingual target sentences using a bilingual dictionary. The synthetic corpus and the parallel corpus are merged to train a more powerful NMT model. Experiments on real low-resource Tibetan-Chinese, Uyghur-Chinese and Mongolian-Chinese show that our method achieves significant improvements over other semi-supervised methods. Our word reordering method avoids problems such as insufficient reordering training data and immature lexical analysers.
AB - Neural machine translation using the combination of parallel and synthetic corpus has achieved impressive translation performance for several language pairs, where the synthetic corpus is typically generated by back-translating the monolingual target sentences. However, the quality of the synthetic corpus is poor in low-resource scenarios, which reduces the contribution of data augmentation methods such as back translation to the translation quality, especially for syntactically distant language pairs. In this paper, we propose a novel solution which uses a language independent word reordering method to address syntactic divergences between the target and source languages. The method indirectly converts the word order of the target language to the source language using an assisting language that has a similar word order to the source language and has sufficient sentence pairs with the target language. A higher quality synthetic corpus can be obtained by translating source-ordered monolingual target sentences using a bilingual dictionary. The synthetic corpus and the parallel corpus are merged to train a more powerful NMT model. Experiments on real low-resource Tibetan-Chinese, Uyghur-Chinese and Mongolian-Chinese show that our method achieves significant improvements over other semi-supervised methods. Our word reordering method avoids problems such as insufficient reordering training data and immature lexical analysers.
KW - Language Independent
KW - Low-Resource Neural Machine Translation
KW - Syntactic Divergence
KW - Word Reordering
UR - http://www.scopus.com/inward/record.url?scp=105008368362&partnerID=8YFLogxK
U2 - 10.1007/978-981-96-5123-8_8
DO - 10.1007/978-981-96-5123-8_8
M3 - Conference contribution
AN - SCOPUS:105008368362
SN - 9789819651221
T3 - Communications in Computer and Information Science
SP - 103
EP - 124
BT - Intelligent Multilingual Information Processing - 1st International Conference, IMLIP 2024, Proceedings
A2 - Zhang, Huaping
A2 - Shang, Jianyun
A2 - Su, Jinsong
PB - Springer Science and Business Media Deutschland GmbH
T2 - 1st International Conference on Intelligent Multilingual Information Processing, IMLIP 2024
Y2 - 16 November 2024 through 17 November 2024
ER -