Abstract
Back-translation (BT) has been widely used and become one of standard techniques for data augmentation in Neural Machine Translation (NMT), BT has proven to be beneficial for improving the performance of translation effectively, especially for low-resource scenarios. While most previous works related to BT mainly focus on European languages with high relatedness, few of them study less-related languages in other areas around the world. In this paper, we choose the language pair with less relatedness in Asia: Chinese and Vietnamese, to investigate the impacts of BT on extremely low-resource machine translation between them. We first discuss the similarities and differences between the two languages, then evaluate and compare the effects of different sizes of back-translated data on NMT and Statistical Machine Translation (SMT) models for Chinese-Vietnamese and Vietnamese-Chinese, with both character-based and word-based settings, and conduct further analysis on the translation outputs from several aspects. Some conclusions from previous works are partially confirmed and we also draw some new findings and conclusions, which are beneficial to understand BT further and deeper for translation between less-related low-resource languages.
Original language | English |
---|---|
Article number | 9129718 |
Pages (from-to) | 119931-119939 |
Number of pages | 9 |
Journal | IEEE Access |
Volume | 8 |
DOIs | |
Publication status | Published - 2020 |
Keywords
- Back-translation
- Chinese
- Vietnamese
- low-resource languages
- machine translation