Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

Yu Bai; Heyan Huang; Kai Fan; Yang Gao; Yiming Zhu; Jiaao Zhan; Zewen Chi; Boxing Chen

doi:10.1145/3477495.3532071

Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

Yu Bai, Heyan Huang, Kai Fan^*, Yang Gao, Yiming Zhu, Jiaao Zhan, Zewen Chi, Boxing Chen

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

9 引用（Scopus）

摘要

Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https: //github.com/ybai-nlp/CLS_CR.

源语言	英语
主期刊名	SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
出版商	Association for Computing Machinery, Inc
页	1087-1097
页数	11
ISBN（电子版）	9781450387323
DOI	https://doi.org/10.1145/3477495.3532071
出版状态	已出版 - 6 7月 2022
活动	45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022 - Madrid, 西班牙期限: 11 7月 2022 → 15 7月 2022

出版系列

姓名	SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

会议

会议	45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022
国家/地区	西班牙
市	Madrid
时期	11/07/22 → 15/07/22

访问文件

10.1145/3477495.3532071

其它文件与链接

链接到 Scopus 的出版物

引用此

Bai, Y., Huang, H., Fan, K., Gao, Y., Zhu, Y., Zhan, J., Chi, Z., & Chen, B. (2022). Unifying Cross-lingual Summarization and Machine Translation with Compression Rate. 在 SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (页码 1087-1097). (SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval). Association for Computing Machinery, Inc. https://doi.org/10.1145/3477495.3532071

Bai, Yu ; Huang, Heyan ; Fan, Kai 等. / Unifying Cross-lingual Summarization and Machine Translation with Compression Rate. SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2022. 页码 1087-1097 (SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval).

@inproceedings{fc99b1e439a64d78ba135fa22425cc13,

title = "Unifying Cross-lingual Summarization and Machine Translation with Compression Rate",

abstract = "Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https: //github.com/ybai-nlp/CLS_CR.",

keywords = "compression rate, cross-lingual summarization, machine translation",

author = "Yu Bai and Heyan Huang and Kai Fan and Yang Gao and Yiming Zhu and Jiaao Zhan and Zewen Chi and Boxing Chen",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022 ; Conference date: 11-07-2022 Through 15-07-2022",

year = "2022",

month = jul,

day = "6",

doi = "10.1145/3477495.3532071",

language = "English",

series = "SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval",

publisher = "Association for Computing Machinery, Inc",

pages = "1087--1097",

booktitle = "SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

Bai, Y, Huang, H, Fan, K, Gao, Y, Zhu, Y, Zhan, J, Chi, Z & Chen, B 2022, Unifying Cross-lingual Summarization and Machine Translation with Compression Rate. 在 SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc, 页码 1087-1097, 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022, Madrid, 西班牙, 11/07/22. https://doi.org/10.1145/3477495.3532071

Unifying Cross-lingual Summarization and Machine Translation with Compression Rate. / Bai, Yu; Huang, Heyan; Fan, Kai 等.
SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2022. 页码 1087-1097 (SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

AU - Bai, Yu

AU - Huang, Heyan

AU - Fan, Kai

AU - Gao, Yang

AU - Zhu, Yiming

AU - Zhan, Jiaao

AU - Chi, Zewen

AU - Chen, Boxing

PY - 2022/7/6

Y1 - 2022/7/6

N2 - Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https: //github.com/ybai-nlp/CLS_CR.

AB - Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https: //github.com/ybai-nlp/CLS_CR.

KW - compression rate

KW - cross-lingual summarization

KW - machine translation

UR - http://www.scopus.com/inward/record.url?scp=85135048457&partnerID=8YFLogxK

U2 - 10.1145/3477495.3532071

DO - 10.1145/3477495.3532071

M3 - Conference contribution

AN - SCOPUS:85135048457

T3 - SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

SP - 1087

EP - 1097

BT - SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - Association for Computing Machinery, Inc

T2 - 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022

Y2 - 11 July 2022 through 15 July 2022

ER -

Bai Y, Huang H, Fan K, Gao Y, Zhu Y, Zhan J 等. Unifying Cross-lingual Summarization and Machine Translation with Compression Rate. 在 SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2022. 页码 1087-1097. (SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval). doi: 10.1145/3477495.3532071

Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此