PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems

Tian Lan, Xian Ling Mao, Wei Wei, Xiaoyan Gao, Heyan Huang

Research output: Contribution to journalArticlepeer-review

26 Citations (Scopus)

Abstract

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1

Original languageEnglish
Article number3423168
JournalACM Transactions on Information Systems
Volume39
Issue number1
DOIs
Publication statusPublished - Nov 2020

Keywords

  • Open-domain
  • automatic evaluation
  • generative dialogue systems

Fingerprint

Dive into the research topics of 'PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems'. Together they form a unique fingerprint.

Cite this