Jointly Learning Topics in Sentence Embedding for Document Summarization

Yang Gao; Yue Xu; Heyan Huang; Qian Liu; Linjing Wei; Luyang Liu

doi:10.1109/TKDE.2019.2892430

Jointly Learning Topics in Sentence Embedding for Document Summarization

Yang Gao^*, Yue Xu, Heyan Huang, Qian Liu, Linjing Wei, Luyang Liu

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

26 引用（Scopus）

摘要

Summarization systems for various applications, such as opinion mining, online news services, and answering questions, have attracted increasing attention in recent years. These tasks are complicated, and a classic representation using bag-of-words does not adequately meet the comprehensive needs of applications that rely on sentence extraction. In this paper, we focus on representing sentences as continuous vectors as a basis for measuring relevance between user needs and candidate sentences in source documents. Embedding models based on distributed vector representations are often used in the summarization community because, through cosine similarity, they simplify sentence relevance when comparing two sentences or a sentence/query and a document. However, the vector-based embedding models do not typically account for the salience of a sentence, and this is a very necessary part of document summarization. To incorporate sentence salience, we developed a model, called CCTSenEmb, that learns latent discriminative Gaussian topics in the embedding space and extended the new framework by seamlessly incorporating both topic and sentence embedding into one summarization system. To facilitate the semantic coherence between sentences in the framework of prediction-based tasks for sentence embedding, the CCTSenEmb further considers the associations between neighboring sentences. As a result, this novel sentence embedding framework combines sentence representations, word-based content, and topic assignments to predict the representation of the next sentence. A series of experiments with the DUC datasets validate CCTSenEmb's efficacy in document summarization in a query-focused extraction-based setting and an unsupervised ILP-based setting.

源语言	英语
文章编号	8611098
页（从-至）	688-699
页数	12
期刊	IEEE Transactions on Knowledge and Data Engineering
卷	32
期	4
DOI	https://doi.org/10.1109/TKDE.2019.2892430
出版状态	已出版 - 1 4月 2020

访问文件

10.1109/TKDE.2019.2892430

其它文件与链接

链接到 Scopus 的出版物

引用此

Gao, Y., Xu, Y., Huang, H., Liu, Q., Wei, L., & Liu, L. (2020). Jointly Learning Topics in Sentence Embedding for Document Summarization. IEEE Transactions on Knowledge and Data Engineering, 32(4), 688-699. 文章 8611098. https://doi.org/10.1109/TKDE.2019.2892430

@article{2dea0639a1e7447a82634a6a52e2a381,

title = "Jointly Learning Topics in Sentence Embedding for Document Summarization",

abstract = "Summarization systems for various applications, such as opinion mining, online news services, and answering questions, have attracted increasing attention in recent years. These tasks are complicated, and a classic representation using bag-of-words does not adequately meet the comprehensive needs of applications that rely on sentence extraction. In this paper, we focus on representing sentences as continuous vectors as a basis for measuring relevance between user needs and candidate sentences in source documents. Embedding models based on distributed vector representations are often used in the summarization community because, through cosine similarity, they simplify sentence relevance when comparing two sentences or a sentence/query and a document. However, the vector-based embedding models do not typically account for the salience of a sentence, and this is a very necessary part of document summarization. To incorporate sentence salience, we developed a model, called CCTSenEmb, that learns latent discriminative Gaussian topics in the embedding space and extended the new framework by seamlessly incorporating both topic and sentence embedding into one summarization system. To facilitate the semantic coherence between sentences in the framework of prediction-based tasks for sentence embedding, the CCTSenEmb further considers the associations between neighboring sentences. As a result, this novel sentence embedding framework combines sentence representations, word-based content, and topic assignments to predict the representation of the next sentence. A series of experiments with the DUC datasets validate CCTSenEmb's efficacy in document summarization in a query-focused extraction-based setting and an unsupervised ILP-based setting.",

keywords = "Gaussian topics, Sentence embedding, and salience, relevance, summarization",

author = "Yang Gao and Yue Xu and Heyan Huang and Qian Liu and Linjing Wei and Luyang Liu",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2020",

month = apr,

day = "1",

doi = "10.1109/TKDE.2019.2892430",

language = "English",

volume = "32",

pages = "688--699",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "4",

}

TY - JOUR

T1 - Jointly Learning Topics in Sentence Embedding for Document Summarization

AU - Gao, Yang

AU - Xu, Yue

AU - Huang, Heyan

AU - Liu, Qian

AU - Wei, Linjing

AU - Liu, Luyang

PY - 2020/4/1

Y1 - 2020/4/1

N2 - Summarization systems for various applications, such as opinion mining, online news services, and answering questions, have attracted increasing attention in recent years. These tasks are complicated, and a classic representation using bag-of-words does not adequately meet the comprehensive needs of applications that rely on sentence extraction. In this paper, we focus on representing sentences as continuous vectors as a basis for measuring relevance between user needs and candidate sentences in source documents. Embedding models based on distributed vector representations are often used in the summarization community because, through cosine similarity, they simplify sentence relevance when comparing two sentences or a sentence/query and a document. However, the vector-based embedding models do not typically account for the salience of a sentence, and this is a very necessary part of document summarization. To incorporate sentence salience, we developed a model, called CCTSenEmb, that learns latent discriminative Gaussian topics in the embedding space and extended the new framework by seamlessly incorporating both topic and sentence embedding into one summarization system. To facilitate the semantic coherence between sentences in the framework of prediction-based tasks for sentence embedding, the CCTSenEmb further considers the associations between neighboring sentences. As a result, this novel sentence embedding framework combines sentence representations, word-based content, and topic assignments to predict the representation of the next sentence. A series of experiments with the DUC datasets validate CCTSenEmb's efficacy in document summarization in a query-focused extraction-based setting and an unsupervised ILP-based setting.

AB - Summarization systems for various applications, such as opinion mining, online news services, and answering questions, have attracted increasing attention in recent years. These tasks are complicated, and a classic representation using bag-of-words does not adequately meet the comprehensive needs of applications that rely on sentence extraction. In this paper, we focus on representing sentences as continuous vectors as a basis for measuring relevance between user needs and candidate sentences in source documents. Embedding models based on distributed vector representations are often used in the summarization community because, through cosine similarity, they simplify sentence relevance when comparing two sentences or a sentence/query and a document. However, the vector-based embedding models do not typically account for the salience of a sentence, and this is a very necessary part of document summarization. To incorporate sentence salience, we developed a model, called CCTSenEmb, that learns latent discriminative Gaussian topics in the embedding space and extended the new framework by seamlessly incorporating both topic and sentence embedding into one summarization system. To facilitate the semantic coherence between sentences in the framework of prediction-based tasks for sentence embedding, the CCTSenEmb further considers the associations between neighboring sentences. As a result, this novel sentence embedding framework combines sentence representations, word-based content, and topic assignments to predict the representation of the next sentence. A series of experiments with the DUC datasets validate CCTSenEmb's efficacy in document summarization in a query-focused extraction-based setting and an unsupervised ILP-based setting.

KW - Gaussian topics

KW - Sentence embedding

KW - and salience

KW - relevance

KW - summarization

UR - http://www.scopus.com/inward/record.url?scp=85081652165&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2019.2892430

DO - 10.1109/TKDE.2019.2892430

M3 - Article

AN - SCOPUS:85081652165

SN - 1041-4347

VL - 32

SP - 688

EP - 699

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 4

M1 - 8611098

ER -

Jointly Learning Topics in Sentence Embedding for Document Summarization

摘要

访问文件

其它文件与链接

指纹

引用此