Automatic Commit Message Generation: A Critical Review and Directions for Future Work

Yuxia Zhang; Zhiqing Qiu; Klaas Jan Stol; Wenhui Zhu; Jiaxin Zhu; Yingchen Tian; Hui Liu

doi:10.1109/TSE.2024.3364675

Automatic Commit Message Generation: A Critical Review and Directions for Future Work

Yuxia Zhang, Zhiqing Qiu, Klaas Jan Stol, Wenhui Zhu, Jiaxin Zhu, Yingchen Tian, Hui Liu^*

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Commit messages are critical for code comprehension and software maintenance. Writing a high-quality message requires skill and effort. To support developers and reduce their effort on this task, several approaches have been proposed to automatically generate commit messages. Despite the promising performance reported, we have identified three significant and prevalent threats in these automated approaches: 1) the datasets used to train and evaluate these approaches contain a considerable amount of 'noise'; 2) current approaches only consider commits of a limited diff size; and 3) current approaches can only generate the subject of a commit message, not the message body. The first limitation may let the models 'learn' inappropriate messages in the training stage, and also lead to inflated performance results in their evaluation. The other two threats can considerably weaken the practical usability of these approaches. Further, with the rapid emergence of large language models (LLMs) that show superior performance in many software engineering tasks, it is worth asking: can LLMs address the challenge of long diffs and whole message generation? This article first reports the results of an empirical study to assess the impact of these three threats on the performance of the state-of-the-art auto generators of commit messages. We collected commit data of the Top 1,000 most-starred Java projects in GitHub and systematically removed noisy commits with bot-submitted and meaningless messages. We then compared the performance of four approaches representative of the state-of-the-art before and after the removal of noisy messages, or with different lengths of commit diffs. We also conducted a qualitative survey with developers to investigate their perspectives on simply generating message subjects. Finally, we evaluate the performance of two representative LLMs, namely UniXcoder and ChatGPT, in generating more practical commit messages. The results demonstrate that generating commit messages is of great practical value, considerable work is needed to mature the current state-of-the-art, and LLMs can be an avenue worth trying to address the current limitations. Our analyses provide insights for future work to achieve better performance in practice.

源语言	英语
页（从-至）	816-835
页数	20
期刊	IEEE Transactions on Software Engineering
卷	50
期	4
DOI	https://doi.org/10.1109/TSE.2024.3364675
出版状态	已出版 - 1 4月 2024

访问文件

10.1109/TSE.2024.3364675

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a410e2095c59483ab07582f98169499e,

title = "Automatic Commit Message Generation: A Critical Review and Directions for Future Work",

abstract = "Commit messages are critical for code comprehension and software maintenance. Writing a high-quality message requires skill and effort. To support developers and reduce their effort on this task, several approaches have been proposed to automatically generate commit messages. Despite the promising performance reported, we have identified three significant and prevalent threats in these automated approaches: 1) the datasets used to train and evaluate these approaches contain a considerable amount of 'noise'; 2) current approaches only consider commits of a limited diff size; and 3) current approaches can only generate the subject of a commit message, not the message body. The first limitation may let the models 'learn' inappropriate messages in the training stage, and also lead to inflated performance results in their evaluation. The other two threats can considerably weaken the practical usability of these approaches. Further, with the rapid emergence of large language models (LLMs) that show superior performance in many software engineering tasks, it is worth asking: can LLMs address the challenge of long diffs and whole message generation? This article first reports the results of an empirical study to assess the impact of these three threats on the performance of the state-of-the-art auto generators of commit messages. We collected commit data of the Top 1,000 most-starred Java projects in GitHub and systematically removed noisy commits with bot-submitted and meaningless messages. We then compared the performance of four approaches representative of the state-of-the-art before and after the removal of noisy messages, or with different lengths of commit diffs. We also conducted a qualitative survey with developers to investigate their perspectives on simply generating message subjects. Finally, we evaluate the performance of two representative LLMs, namely UniXcoder and ChatGPT, in generating more practical commit messages. The results demonstrate that generating commit messages is of great practical value, considerable work is needed to mature the current state-of-the-art, and LLMs can be an avenue worth trying to address the current limitations. Our analyses provide insights for future work to achieve better performance in practice.",

keywords = "Commit-based software development, benchmark, commit message generation, open collaboration",

author = "Yuxia Zhang and Zhiqing Qiu and Stol, {Klaas Jan} and Wenhui Zhu and Jiaxin Zhu and Yingchen Tian and Hui Liu",

note = "Publisher Copyright: {\textcopyright} 1976-2012 IEEE.",

year = "2024",

month = apr,

day = "1",

doi = "10.1109/TSE.2024.3364675",

language = "English",

volume = "50",

pages = "816--835",

journal = "IEEE Transactions on Software Engineering",

issn = "0098-5589",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "4",

}

TY - JOUR

T1 - Automatic Commit Message Generation

T2 - A Critical Review and Directions for Future Work

AU - Zhang, Yuxia

AU - Qiu, Zhiqing

AU - Stol, Klaas Jan

AU - Zhu, Wenhui

AU - Zhu, Jiaxin

AU - Tian, Yingchen

AU - Liu, Hui

PY - 2024/4/1

Y1 - 2024/4/1

N2 - Commit messages are critical for code comprehension and software maintenance. Writing a high-quality message requires skill and effort. To support developers and reduce their effort on this task, several approaches have been proposed to automatically generate commit messages. Despite the promising performance reported, we have identified three significant and prevalent threats in these automated approaches: 1) the datasets used to train and evaluate these approaches contain a considerable amount of 'noise'; 2) current approaches only consider commits of a limited diff size; and 3) current approaches can only generate the subject of a commit message, not the message body. The first limitation may let the models 'learn' inappropriate messages in the training stage, and also lead to inflated performance results in their evaluation. The other two threats can considerably weaken the practical usability of these approaches. Further, with the rapid emergence of large language models (LLMs) that show superior performance in many software engineering tasks, it is worth asking: can LLMs address the challenge of long diffs and whole message generation? This article first reports the results of an empirical study to assess the impact of these three threats on the performance of the state-of-the-art auto generators of commit messages. We collected commit data of the Top 1,000 most-starred Java projects in GitHub and systematically removed noisy commits with bot-submitted and meaningless messages. We then compared the performance of four approaches representative of the state-of-the-art before and after the removal of noisy messages, or with different lengths of commit diffs. We also conducted a qualitative survey with developers to investigate their perspectives on simply generating message subjects. Finally, we evaluate the performance of two representative LLMs, namely UniXcoder and ChatGPT, in generating more practical commit messages. The results demonstrate that generating commit messages is of great practical value, considerable work is needed to mature the current state-of-the-art, and LLMs can be an avenue worth trying to address the current limitations. Our analyses provide insights for future work to achieve better performance in practice.

AB - Commit messages are critical for code comprehension and software maintenance. Writing a high-quality message requires skill and effort. To support developers and reduce their effort on this task, several approaches have been proposed to automatically generate commit messages. Despite the promising performance reported, we have identified three significant and prevalent threats in these automated approaches: 1) the datasets used to train and evaluate these approaches contain a considerable amount of 'noise'; 2) current approaches only consider commits of a limited diff size; and 3) current approaches can only generate the subject of a commit message, not the message body. The first limitation may let the models 'learn' inappropriate messages in the training stage, and also lead to inflated performance results in their evaluation. The other two threats can considerably weaken the practical usability of these approaches. Further, with the rapid emergence of large language models (LLMs) that show superior performance in many software engineering tasks, it is worth asking: can LLMs address the challenge of long diffs and whole message generation? This article first reports the results of an empirical study to assess the impact of these three threats on the performance of the state-of-the-art auto generators of commit messages. We collected commit data of the Top 1,000 most-starred Java projects in GitHub and systematically removed noisy commits with bot-submitted and meaningless messages. We then compared the performance of four approaches representative of the state-of-the-art before and after the removal of noisy messages, or with different lengths of commit diffs. We also conducted a qualitative survey with developers to investigate their perspectives on simply generating message subjects. Finally, we evaluate the performance of two representative LLMs, namely UniXcoder and ChatGPT, in generating more practical commit messages. The results demonstrate that generating commit messages is of great practical value, considerable work is needed to mature the current state-of-the-art, and LLMs can be an avenue worth trying to address the current limitations. Our analyses provide insights for future work to achieve better performance in practice.

KW - Commit-based software development

KW - benchmark

KW - commit message generation

KW - open collaboration

UR - http://www.scopus.com/inward/record.url?scp=85187275304&partnerID=8YFLogxK

U2 - 10.1109/TSE.2024.3364675

DO - 10.1109/TSE.2024.3364675

M3 - Article

AN - SCOPUS:85187275304

SN - 0098-5589

VL - 50

SP - 816

EP - 835

JO - IEEE Transactions on Software Engineering

JF - IEEE Transactions on Software Engineering

IS - 4

ER -

Automatic Commit Message Generation: A Critical Review and Directions for Future Work

摘要

访问文件

其它文件与链接

指纹

引用此