Revisiting Grammatical Error Correction Evaluation and Beyond

Peiyuan Gong; Xuebo Liu; Heyan Huang; Min Zhang

Revisiting Grammatical Error Correction Evaluation and Beyond

Peiyuan Gong, Xuebo Liu^*, Heyan Huang, Min Zhang

^*此作品的通讯作者

计算机学院

科研成果: 会议稿件 › 论文 › 同行评审

6 引用（Scopus）

摘要

Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore and BARTScore) have been widely used in several sentence generation tasks (e.g., machine translation and text summarization) due to their better correlation with human judgments over traditional overlap-based methods. Although PT-based methods have become the de facto standard for training grammatical error correction (GEC) systems, GEC evaluation still does not benefit from pretrained knowledge. This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We first find that arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory correlation results because of the excessive attention to inessential systems outputs (e.g., unchanged parts). To alleviate the limitation, we propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M², which only uses PT-based metrics to score those corrected parts. Experimental results on the CoNLL14 evaluation task show that PT-M² significantly outperforms existing methods, achieving a new state-of-the-art result of 0.949 Pearson correlation. Further analysis reveals that PT-M² is robust to evaluate competitive GEC systems. Source code and scripts are freely available at https://github.com/pygongnlp/PT-M2.

源语言	英语
页	6891-6902
页数	12
出版状态	已出版 - 2022
活动	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, 阿拉伯联合酋长国期限: 7 12月 2022 → 11 12月 2022

会议

会议	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
国家/地区	阿拉伯联合酋长国
市	Abu Dhabi
时期	7/12/22 → 11/12/22

其它文件与链接

链接到 Scopus 的出版物

引用此

@conference{7133ecb895584954a51b7d0fed51f8d1,

title = "Revisiting Grammatical Error Correction Evaluation and Beyond",

abstract = "Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore and BARTScore) have been widely used in several sentence generation tasks (e.g., machine translation and text summarization) due to their better correlation with human judgments over traditional overlap-based methods. Although PT-based methods have become the de facto standard for training grammatical error correction (GEC) systems, GEC evaluation still does not benefit from pretrained knowledge. This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We first find that arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory correlation results because of the excessive attention to inessential systems outputs (e.g., unchanged parts). To alleviate the limitation, we propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2, which only uses PT-based metrics to score those corrected parts. Experimental results on the CoNLL14 evaluation task show that PT-M2 significantly outperforms existing methods, achieving a new state-of-the-art result of 0.949 Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate competitive GEC systems. Source code and scripts are freely available at https://github.com/pygongnlp/PT-M2.",

author = "Peiyuan Gong and Xuebo Liu and Heyan Huang and Min Zhang",

note = "Publisher Copyright: {\textcopyright} 2022 Association for Computational Linguistics.; 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2022",

year = "2022",

language = "English",

pages = "6891--6902",

}

TY - CONF

T1 - Revisiting Grammatical Error Correction Evaluation and Beyond

AU - Gong, Peiyuan

AU - Liu, Xuebo

AU - Huang, Heyan

AU - Zhang, Min

PY - 2022

Y1 - 2022

N2 - Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore and BARTScore) have been widely used in several sentence generation tasks (e.g., machine translation and text summarization) due to their better correlation with human judgments over traditional overlap-based methods. Although PT-based methods have become the de facto standard for training grammatical error correction (GEC) systems, GEC evaluation still does not benefit from pretrained knowledge. This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We first find that arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory correlation results because of the excessive attention to inessential systems outputs (e.g., unchanged parts). To alleviate the limitation, we propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2, which only uses PT-based metrics to score those corrected parts. Experimental results on the CoNLL14 evaluation task show that PT-M2 significantly outperforms existing methods, achieving a new state-of-the-art result of 0.949 Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate competitive GEC systems. Source code and scripts are freely available at https://github.com/pygongnlp/PT-M2.

AB - Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore and BARTScore) have been widely used in several sentence generation tasks (e.g., machine translation and text summarization) due to their better correlation with human judgments over traditional overlap-based methods. Although PT-based methods have become the de facto standard for training grammatical error correction (GEC) systems, GEC evaluation still does not benefit from pretrained knowledge. This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We first find that arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory correlation results because of the excessive attention to inessential systems outputs (e.g., unchanged parts). To alleviate the limitation, we propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2, which only uses PT-based metrics to score those corrected parts. Experimental results on the CoNLL14 evaluation task show that PT-M2 significantly outperforms existing methods, achieving a new state-of-the-art result of 0.949 Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate competitive GEC systems. Source code and scripts are freely available at https://github.com/pygongnlp/PT-M2.

UR - http://www.scopus.com/inward/record.url?scp=85149442636&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85149442636

SP - 6891

EP - 6902

T2 - 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

Y2 - 7 December 2022 through 11 December 2022

ER -

Revisiting Grammatical Error Correction Evaluation and Beyond

摘要

会议

其它文件与链接

指纹

引用此