Introducing bidirectional attention for autoregressive models in abstractive summarization

Jianfei Zhao; Xin Sun; Chong Feng

doi:10.1016/j.ins.2024.121497

Introducing bidirectional attention for autoregressive models in abstractive summarization

Jianfei Zhao, Xin Sun, Chong Feng^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.

Original language	English
Article number	121497
Journal	Information Sciences
Volume	689
DOIs	https://doi.org/10.1016/j.ins.2024.121497
Publication status	Published - Jan 2025

Keywords

Abstractive summarization
Autoregressive model
Contrastive learning
Reinforcement learning

Access to Document

10.1016/j.ins.2024.121497

Cite this

@article{f276d11f83a042da88377667048a09b1,

title = "Introducing bidirectional attention for autoregressive models in abstractive summarization",

abstract = "Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.",

keywords = "Abstractive summarization, Autoregressive model, Contrastive learning, Reinforcement learning",

author = "Jianfei Zhao and Xin Sun and Chong Feng",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Inc.",

year = "2025",

month = jan,

doi = "10.1016/j.ins.2024.121497",

language = "English",

volume = "689",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Introducing bidirectional attention for autoregressive models in abstractive summarization

AU - Zhao, Jianfei

AU - Sun, Xin

AU - Feng, Chong

PY - 2025/1

Y1 - 2025/1

N2 - Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.

AB - Abstractive summarization methods typically follow the autoregressive paradigm using the causal masks in the decoder for training and inference efficiency. However, this approach leads to a constant context throughout the generation process, which conflicting with the bidirectional characteristics of natural language. Although previous attempts have been made to incorporate bidirectional attention in the decoding process through non-autoregressive approach, the evaluation results are not comparable to the autoregressive methods. To bring bidirectional attention to the autoregressive process while maintaining superior performance, we propose the global autoregressive paradigm, which takes the outputs of the autoregressive process as additional inputs in the subsequent global iteration. Specifically, we build a bidirectional decoder alongside the original encoder and decoder to capture the bidirectional context of the outputs. This context is updated after each autoregressive decoding iteration. The decoder then integrates the updated context into subsequent autoregressive decoding steps, enhancing the generative process with a more comprehensive and authentic context. Additionally, we use contrastive learning to train the model to extract reliable features from the bidirectional context and apply reinforcement learning to improve the model's utilization of this context. We evaluate our method on CNN/DM, XSum, and NYT datasets, and the results highlight the significance of the bidirectional context. Our method achieves the best performance in terms of ROUGE-2 on CNN/DM (23.96), and performs comparably on XSum (25.45) and NYT (27.91). It also outperforms all the baselines in terms of BERTScore, with a score of 89.96 on CNN/DM, 92.70 on XSum, and 90.04 on NYT. Furthermore, our method can perform better with a larger beam size.

KW - Abstractive summarization

KW - Autoregressive model

KW - Contrastive learning

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85204768944&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2024.121497

DO - 10.1016/j.ins.2024.121497

M3 - Article

AN - SCOPUS:85204768944

SN - 0020-0255

VL - 689

JO - Information Sciences

JF - Information Sciences

M1 - 121497

ER -

Introducing bidirectional attention for autoregressive models in abstractive summarization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this