Deterministic Reversible Data Augmentation for Neural Machine Translation

Jiashu Yao; Heyan Huang; Zeming Liu; Yuhang Guo

Deterministic Reversible Data Augmentation for Neural Machine Translation

Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

源语言	英语
主期刊名	62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference
编辑	Lun-Wei Ku, Andre Martins, Vivek Srikumar
出版商	Association for Computational Linguistics (ACL)
页	8075-8089
页数	15
ISBN（电子版）	9798891760998
出版状态	已出版 - 2024
活动	Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Hybrid, Bangkok, 泰国期限: 11 8月 2024 → 16 8月 2024

出版系列

姓名	Proceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN（印刷版）	0736-587X

会议

会议	Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
国家/地区	泰国
市	Hybrid, Bangkok
时期	11/08/24 → 16/08/24

其它文件与链接

链接到 Scopus 的出版物

引用此

Yao, J., Huang, H., Liu, Z., & Guo, Y. (2024). Deterministic Reversible Data Augmentation for Neural Machine Translation. 在 L.-W. Ku, A. Martins, & V. Srikumar (编辑), 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference (页码 8075-8089). (Proceedings of the Annual Meeting of the Association for Computational Linguistics). Association for Computational Linguistics (ACL).

Yao, Jiashu ; Huang, Heyan ; Liu, Zeming 等. / Deterministic Reversible Data Augmentation for Neural Machine Translation. 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. 编辑 / Lun-Wei Ku ; Andre Martins ; Vivek Srikumar. Association for Computational Linguistics (ACL), 2024. 页码 8075-8089 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

@inproceedings{a4d40955f5f04e21938adc1730641885,

title = "Deterministic Reversible Data Augmentation for Neural Machine Translation",

abstract = "Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.",

author = "Jiashu Yao and Heyan Huang and Zeming Liu and Yuhang Guo",

note = "Publisher Copyright: {\textcopyright} 2024 Association for Computational Linguistics.; Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 ; Conference date: 11-08-2024 Through 16-08-2024",

year = "2024",

language = "English",

series = "Proceedings of the Annual Meeting of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics (ACL)",

pages = "8075--8089",

editor = "Lun-Wei Ku and Andre Martins and Vivek Srikumar",

booktitle = "62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference",

address = "United States",

}

Yao, J, Huang, H, Liu, Z & Guo, Y 2024, Deterministic Reversible Data Augmentation for Neural Machine Translation. 在 L-W Ku, A Martins & V Srikumar (编辑), 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL), 页码 8075-8089, Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, Hybrid, Bangkok, 泰国, 11/08/24.

Deterministic Reversible Data Augmentation for Neural Machine Translation. / Yao, Jiashu; Huang, Heyan; Liu, Zeming 等.
62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. 编辑 / Lun-Wei Ku; Andre Martins; Vivek Srikumar. Association for Computational Linguistics (ACL), 2024. 页码 8075-8089 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Deterministic Reversible Data Augmentation for Neural Machine Translation

AU - Yao, Jiashu

AU - Huang, Heyan

AU - Liu, Zeming

AU - Guo, Yuhang

PY - 2024

Y1 - 2024

N2 - Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

AB - Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.

UR - http://www.scopus.com/inward/record.url?scp=85205300547&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85205300547

T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics

SP - 8075

EP - 8089

BT - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference

A2 - Ku, Lun-Wei

A2 - Martins, Andre

A2 - Srikumar, Vivek

PB - Association for Computational Linguistics (ACL)

T2 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024

Y2 - 11 August 2024 through 16 August 2024

ER -

Yao J, Huang H, Liu Z, Guo Y. Deterministic Reversible Data Augmentation for Neural Machine Translation. 在 Ku LW, Martins A, Srikumar V, 编辑, 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference. Association for Computational Linguistics (ACL). 2024. 页码 8075-8089. (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

Deterministic Reversible Data Augmentation for Neural Machine Translation

摘要

出版系列

会议

其它文件与链接

指纹

引用此