Introducing bidirectional attention for autoregressive models in abstractive summarization

Jianfei Zhao; Xin Sun; Chong Feng

doi:10.1016/j.ins.2024.121497

Introducing bidirectional attention for autoregressive models in abstractive summarization

Jianfei Zhao, Xin Sun, Chong Feng^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.

源语言	英语
文章编号	121497
期刊	Information Sciences
卷	689
DOI	https://doi.org/10.1016/j.ins.2024.121497
出版状态	已出版 - 1月 2025

访问文件

10.1016/j.ins.2024.121497

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{f276d11f83a042da88377667048a09b1,

title = "Introducing bidirectional attention for autoregressive models in abstractive summarization",

abstract = "Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.",

keywords = "Abstractive summarization, Autoregressive model, Contrastive learning, Reinforcement learning",

author = "Jianfei Zhao and Xin Sun and Chong Feng",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Inc.",

year = "2025",

month = jan,

doi = "10.1016/j.ins.2024.121497",

language = "English",

volume = "689",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Introducing bidirectional attention for autoregressive models in abstractive summarization

AU - Zhao, Jianfei

AU - Sun, Xin

AU - Feng, Chong

PY - 2025/1

Y1 - 2025/1

N2 - Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.

AB - Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.

KW - Abstractive summarization

KW - Autoregressive model

KW - Contrastive learning

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85204768944&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2024.121497

DO - 10.1016/j.ins.2024.121497

M3 - Article

AN - SCOPUS:85204768944

SN - 0020-0255

VL - 689

JO - Information Sciences

JF - Information Sciences

M1 - 121497

ER -

Introducing bidirectional attention for autoregressive models in abstractive summarization

摘要

访问文件

其它文件与链接

指纹

引用此