PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems

Tian Lan; Xian Ling Mao; Wei Wei; Xiaoyan Gao; Heyan Huang

doi:10.1145/3423168

PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems

Tian Lan, Xian Ling Mao, Wei Wei, Xiaoyan Gao, Heyan Huang

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

21 引用（Scopus）

摘要

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1

源语言	英语
文章编号	3423168
期刊	ACM Transactions on Information Systems
卷	39
期	1
DOI	https://doi.org/10.1145/3423168
出版状态	已出版 - 11月 2020

访问文件

10.1145/3423168

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{6b738148cbda434ebb5b4ea44b3835f3,

title = "PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems",

abstract = "Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1",

keywords = "Open-domain, automatic evaluation, generative dialogue systems",

author = "Tian Lan and Mao, {Xian Ling} and Wei Wei and Xiaoyan Gao and Heyan Huang",

note = "Publisher Copyright: {\textcopyright} 2020 ACM.",

year = "2020",

month = nov,

doi = "10.1145/3423168",

language = "English",

volume = "39",

journal = "ACM Transactions on Information Systems",

issn = "1046-8188",

publisher = "Association for Computing Machinery (ACM)",

number = "1",

}

TY - JOUR

T1 - PONE

T2 - A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems

AU - Lan, Tian

AU - Mao, Xian Ling

AU - Wei, Wei

AU - Gao, Xiaoyan

AU - Huang, Heyan

PY - 2020/11

Y1 - 2020/11

N2 - Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1

AB - Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1

KW - Open-domain

KW - automatic evaluation

KW - generative dialogue systems

UR - http://www.scopus.com/inward/record.url?scp=85097352565&partnerID=8YFLogxK

U2 - 10.1145/3423168

DO - 10.1145/3423168

M3 - Article

AN - SCOPUS:85097352565

SN - 1046-8188

VL - 39

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

IS - 1

M1 - 3423168

ER -

PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems

摘要

访问文件

其它文件与链接

指纹

引用此