Deterministic Reversible Data Augmentation for Neural Machine Translation

Jiashu Yao; Heyan Huang; Zeming Liu; Yuhang Guo

Deterministic Reversible Data Augmentation for Neural Machine Translation

Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

Original language	English
Title of host publication	62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference
Editors	Lun-Wei Ku, Andre Martins, Vivek Srikumar
Publisher	Association for Computational Linguistics (ACL)
Pages	8075-8089
Number of pages	15
ISBN (Electronic)	9798891760998
Publication status	Published - 2024
Event	Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Hybrid, Bangkok, Thailand Duration: 11 Aug 2024 → 16 Aug 2024

Publication series

Name	Proceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)	0736-587X

Conference

Conference	Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Country/Territory	Thailand
City	Hybrid, Bangkok
Period	11/08/24 → 16/08/24

Cite this

Yao, J., Huang, H., Liu, Z., & Guo, Y. (2024). Deterministic Reversible Data Augmentation for Neural Machine Translation. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference (pp. 8075-8089). (Proceedings of the Annual Meeting of the Association for Computational Linguistics). Association for Computational Linguistics (ACL).

Yao, Jiashu ; Huang, Heyan ; Liu, Zeming et al. / Deterministic Reversible Data Augmentation for Neural Machine Translation. 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. editor / Lun-Wei Ku ; Andre Martins ; Vivek Srikumar. Association for Computational Linguistics (ACL), 2024. pp. 8075-8089 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

@inproceedings{a4d40955f5f04e21938adc1730641885,

title = "Deterministic Reversible Data Augmentation for Neural Machine Translation",

abstract = "Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.",

author = "Jiashu Yao and Heyan Huang and Zeming Liu and Yuhang Guo",

note = "Publisher Copyright: {\textcopyright} 2024 Association for Computational Linguistics.; Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 ; Conference date: 11-08-2024 Through 16-08-2024",

year = "2024",

language = "English",

series = "Proceedings of the Annual Meeting of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics (ACL)",

pages = "8075--8089",

editor = "Lun-Wei Ku and Andre Martins and Vivek Srikumar",

booktitle = "62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference",

address = "United States",

}

Yao, J, Huang, H, Liu, Z & Guo, Y 2024, Deterministic Reversible Data Augmentation for Neural Machine Translation. in L-W Ku, A Martins & V Srikumar (eds), 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL), pp. 8075-8089, Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, Hybrid, Bangkok, Thailand, 11/08/24.

Deterministic Reversible Data Augmentation for Neural Machine Translation. / Yao, Jiashu; Huang, Heyan; Liu, Zeming et al.
62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. ed. / Lun-Wei Ku; Andre Martins; Vivek Srikumar. Association for Computational Linguistics (ACL), 2024. p. 8075-8089 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Deterministic Reversible Data Augmentation for Neural Machine Translation

AU - Yao, Jiashu

AU - Huang, Heyan

AU - Liu, Zeming

AU - Guo, Yuhang

PY - 2024

Y1 - 2024

N2 - Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

AB - Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

UR - http://www.scopus.com/inward/record.url?scp=85205300547&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85205300547

T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics

SP - 8075

EP - 8089

BT - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference

A2 - Ku, Lun-Wei

A2 - Martins, Andre

A2 - Srikumar, Vivek

PB - Association for Computational Linguistics (ACL)

T2 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024

Y2 - 11 August 2024 through 16 August 2024

ER -

Yao J, Huang H, Liu Z, Guo Y. Deterministic Reversible Data Augmentation for Neural Machine Translation. In Ku LW, Martins A, Srikumar V, editors, 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. Association for Computational Linguistics (ACL). 2024. p. 8075-8089. (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

Deterministic Reversible Data Augmentation for Neural Machine Translation

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this