TY - GEN
T1 - Unifying Cross-lingual Summarization and Machine Translation with Compression Rate
AU - Bai, Yu
AU - Huang, Heyan
AU - Fan, Kai
AU - Gao, Yang
AU - Zhu, Yiming
AU - Zhan, Jiaao
AU - Chi, Zewen
AU - Chen, Boxing
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/7/6
Y1 - 2022/7/6
N2 - Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https: //github.com/ybai-nlp/CLS_CR.
AB - Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https: //github.com/ybai-nlp/CLS_CR.
KW - compression rate
KW - cross-lingual summarization
KW - machine translation
UR - http://www.scopus.com/inward/record.url?scp=85135048457&partnerID=8YFLogxK
U2 - 10.1145/3477495.3532071
DO - 10.1145/3477495.3532071
M3 - Conference contribution
AN - SCOPUS:85135048457
T3 - SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1087
EP - 1097
BT - SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022
Y2 - 11 July 2022 through 15 July 2022
ER -