TY - JOUR
T1 - PONE
T2 - A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems
AU - Lan, Tian
AU - Mao, Xian Ling
AU - Wei, Wei
AU - Gao, Xiaoyan
AU - Huang, Heyan
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/11
Y1 - 2020/11
N2 - Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1
AB - Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them is still a big challenge. As far as we know, there are three kinds of automatic evaluations for open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics is more effective. In this article, we first measure systematically all kinds of metrics to check which kind is best. Extensive experiments demonstrate that learning-based metrics are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains extremely imbalanced and low-quality samples to train a score model. To address this issue, we propose a novel learning-based metric that significantly improves the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that PONE significantly outperforms the state-of-the-art learning-based evaluation method. Besides, we have publicly released the codes of our proposed metric and state-of-the-art baselines.1
KW - Open-domain
KW - automatic evaluation
KW - generative dialogue systems
UR - http://www.scopus.com/inward/record.url?scp=85097352565&partnerID=8YFLogxK
U2 - 10.1145/3423168
DO - 10.1145/3423168
M3 - Article
AN - SCOPUS:85097352565
SN - 1046-8188
VL - 39
JO - ACM Transactions on Information Systems
JF - ACM Transactions on Information Systems
IS - 1
M1 - 3423168
ER -