PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems

Tian Lan, Xian Ling Mao, Wei Wei, Xiaoyan Gao, Heyan Huang

科研成果: 期刊稿件文章同行评审

21 引用 (Scopus)

摘要

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1

源语言英语
文章编号3423168
期刊ACM Transactions on Information Systems
39
1
DOI
出版状态已出版 - 11月 2020

指纹

探究 'PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems' 的科研主题。它们共同构成独一无二的指纹。

引用此