🔍🤖Evaluating Generated Commit Messages with Large Language Models

Commit messages are essential in software development as they serve to document and explain code changes. Yet, their quality often falls short in practice, with studies showing significant proportions of empty or inadequate messages. While automated commit message generation has advanced significantly, particularly with Large Language Models (LLMs), evaluation of generated messages remains challenging. Traditional reference-based automatic metrics like BLEU, ROUGE-L, and METEOR have notable limitations in assessing commit message quality, as they assume a one-to-one mapping between code changes and commit messages, leading researchers to rely on resource-intensive human evaluation. This study investigates the potential of LLMs as automated evaluators for commit message quality. Through systematic experimentation with various prompt strategies and state-of-the-art LLMs, we demonstrate that LLMs combining Chain-of-Thought reasoning with few-shot demonstrations achieve near human-level evaluation proficiency. Our LLM-based evaluator significantly outperforms traditional metrics while maintaining acceptable reproducibility, robustness, and fairness levels despite some inherent variability.

Usage As an evaluation package

To facilitate the evaluation of generated commit message in the future, we have developed this replication package as a user-friendly Python package. It's very easy to use the LLM-metrics introduced in this paper to evaluate commit messages.

First, install this package (PyPI support will be available in the future):

pip3 install -e .

Next, users can directly use the interface provided by cmeval.evaluate_commits. Users can simply use this interface to apply LLM-metrics to evaluate their commit message dataset. For example, if we have 10 commit messages that need to be evaluated:

commit_data = [
    {"diff": "+++ a/file.py", "message": "fix: error handling"},
    ...
    {"diff": "--- b/test.py", "message": "docs: update readme"},
]

Users can simply use Python code similar to the following to evaluate them:

from cmeval import evaluate_commits

what_scores, why_scores = evaluate_commits(
    commit_data,
    num_runs=3,
    commit_diff_key="diff",
    commit_message_key="message",
    openai_model="gpt-4-0613",
    openai_apikey="openai-api-key",
)

print(what_scores, why_scores)

In this example, these 10 commit messages will be evaluated by GPT-4 with the default Few-shot+CoT prompting strategy (which has been proven effective in the paper). Each commit will be evaluated 3 times, and the results will be averaged to return the final What and Why scores for each commit.

We also support features such as reading data from various file formats, configuring different models, various prompting strategies, and user-customized prompts, etc.

Reproduction

We provide all the data, code, and scripts to facilitate reproducing the results of this study.

First, you need to install the necessary dependencies:

pip3 install -e .

Dataset

Our dataset contains 200 Python and 200 Java commits, sampled from commitbench_long. We use LLMs to generate messages for each commit, using the following script:

cd cli
python3 llm_generate_cli.py \
    --prompt_strategy random \
    --data_path /opt/tiger/cmg_eval/datasets/commits.csv \
    --commit_message_key developer_message \
    --commit_diff_key git_diff_u3 \
    --save_dir /opt/tiger/cmg_eval/datasets/commits_with_generated.csv

The prompt used for generation is available at cmeval/llm_metrics/prompts/generate_prompts.py.

We release the detailed dataset datasets/human_ratings.csv, which includes the source of each commit (repo, hash), the scores given by each volunteer in the 'What' and 'Why' dimensions, and the final average score of each commit message in the 'What' and 'Why' dimensions. datasets/spearman_matrix.py is the script we used to calculate the Spearman correlation between pairs of volunteers and to export the figure.

datasets/commits.csv is simply a different view of the data from datasets/human_ratings.csv. It contains 400 rows, each row including various information about a commit: the source of the commit, the commit's diff, the developer-authored message (after our cleaning process, as detailed in Section 3.1 in the paper), the LLM-generated message, and the final average scores given by human experts for both the developer-authored message and the LLM-generated message in the 'What' and 'Why' dimensions.

RQ1: Are LLM-based evaluators comparable to human evaluators in commit message evaluation?

The directory RQs/RQ1/ contains scripts such as run-gpt-4-0613.sh, run-Llama-3.3-70B-Instruct.sh, etc., which are used to score the developer-authored messages in datasets/commits.csv using different LLMs and different prompt strategies. Each combination is evaluated three times, and the average value is taken.

Different prompt strategies (Few-shot, CoT, Few-shot+CoT) are available at cmeval/llm_metrics/prompts/unified_prompts.py, cmeval/llm_metrics/prompts/what_prompts.py and cmeval/llm_metrics/prompts/why_prompts.py. In unified_prompts.py, we ask the LLM to evaluate 'What' and 'Why' dimensions in a single run. what_prompts.py and why_prompts.py are for separate evaluations where we ask the LLM to evaluate one dimension at a time.

To reproduce these results, if you are using a local GPU for inference, we use VLLM to deploy local models and utilize its OpenAI-compatible API server for requests. First, you need to start your VLLM server locally:

vllm serve Qwen/QwQ-32B --port 8000 --dtype auto --api-key abc123 --tensor-parallel-size 8

Here, we use Qwen/QwQ-32B as an example. Set tensor-parallel-size according to your available GPU count, then you can run the script to use open-source LLMs to evaluate commit messages.

If you want to use GPT-4 or other models through an API provider, you should set cmeval.config.openai_url and cmeval.config.openai_apikey according to your API provider. You can directly modify the default value in cmeval/config.py, or pass additional arguments in the CLI. For example, if you want to use GPT-4 with a Zero-shot approach provided by OpenAI:

python3 llm_evaluate_cli.py --dimension unified \
  --prompt_strategy zero-shot-noncot \
  --data_path /opt/tiger/cmg_eval/datasets/commits.csv \
  --commit_message_key developer_message \
  --commit_diff_key git_diff_u3 \
  --save_dir /opt/tiger/cmg_eval/RQs/RQ1/gpt-4-0613/unified_zero-shot-noncot_git_diff_u3_v1 \
  --openai_model gpt-4-0613 \
  --openai_temperature 0.25 \
  --openai_url https://api.openai.com/v1 \
  --openai_apikey your_api_key

We also include the results in RQs/RQ1/results, where each file contains the results of a model with a specific prompt strategy, including the results of three runs and the final average scores. RQs/RQ1/analyse_and_export_latex.py is the script we used to analyze the correlation between different LLMs with different prompt strategies and human experts, as well as to export Table 2 in the paper.

RQ2: What is the performance of the LLM-based evaluator compared to automatic evaluation metrics?

RQs/RQ2/evaluate_auto_metrics.sh evaluates 400 commits using traditional automatic metrics (BLEU, ROUGE, METEOR). RQs/RQ2/evaluate_llm_metrics.sh evaluates 400 commits using GPT-4 with Few-shot+CoT prompt strategy under unified evaluation. RQs/RQ2/all_metrics_scores.csv provides the scores for each commit under all different metrics. RQs/RQ2/analyse_and_export_latex.py performs correlation analysis on all metrics and exports Tables 3 and 4 in the paper.

To reproduce these results, you could simply:

cd RQs/RQ2
bash evaluate_auto_metrics.sh
bash evaluate_llm_metrics.sh
python3 analyse_and_export_latex.py

We conducted a thematic analysis on 102 examples that received high human ratings but low METEOR scores. The thematic analysis results and related scripts are available in RQs/RQ2/thematic_analysis.

RQ3: Can the performance of the LLM-based evaluator keep stable?

In RQ3.1, we tested the stability of different LLMs under different temperature settings. The related scripts are available in RQs/RQ3/s1_stable. To reproduce, you can simply:

cd RQs/RQ3/s1_stable
bash run-gpt-4-0613.sh
bash run-Llama-3.3-70B-Instruct.sh
bash run-Qwen2.5-72B-Instruct.sh
bash run-QwQ-32B.sh
python3 analyse_and_export_latex.py

Note: When running different LLMs, you should ensure that your local environment has already deployed the corresponding model's OpenAI-compatible server using VLLM or other frameworks. If you use an API provider, please ensure that the relevant URL and API key are correctly configured.

In RQ3.2, we tested the stability of different LLMs in evaluating commit messages with the same semantics but different expressions. The related scripts are available in RQs/RQ3/s2_disturb. To reproduce, you can simply:

cd RQs/RQ3/s2_disturb
bash run-gpt-4-0613.sh
bash run-Llama-3.3-70B-Instruct.sh
bash run-Qwen2.5-72B-Instruct.sh
bash run-QwQ-32B.sh
python3 analyse_and_export_latex.py

RQs/RQ3/s2_disturb/generate_triplets.sh is the script we used to generate two semantically equivalent but differently expressed versions of a developer commit message. The detailed prompts are available in cmeval/llm_metrics/prompts/disturb_prompts.py.

In RQ3.3, we conducted a potential bias analysis based on the results of RQ3.2. To reproduce RQ3.3, ensure that the scripts for RQ3.2 have been successfully executed, then you can simply:

python3 analyse_and_export_latex.py

RQs/RQ3/s3_bias/analyse_and_export_latex.py is the script we used to analyze potential bias. Additionally, extra analysis for RQ3.3 is also available in extra_analysis.py.

Appendix of Automated Metrics Used in Commit Message Evaluation

As detailed in Section 2.2 of the paper, the automated metrics used for commit message evaluation include BLEU, ROUGE, METEOR, and CIDEr. Although the calculation principles and formulas of these metrics have been well documented in their respective papers, we provide an additional appendix with a more detailed introduction to these metrics than what is covered in Section 2.2 for user reference. This appendix is available in auto_metrics.pdf. We recommend downloading and using your PDF viewer to open this file, since the PDF rendering on anonymous.4open.science has some issues.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
RQs		RQs
cli		cli
cmeval		cmeval
datasets		datasets
.gitignore		.gitignore
README.md		README.md
auto_metrics.pdf		auto_metrics.pdf
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔍🤖Evaluating Generated Commit Messages with Large Language Models

Usage As an evaluation package

Reproduction

Dataset

RQ1: Are LLM-based evaluators comparable to human evaluators in commit message evaluation?

RQ2: What is the performance of the LLM-based evaluator compared to automatic evaluation metrics?

RQ3: Can the performance of the LLM-based evaluator keep stable?

Appendix of Automated Metrics Used in Commit Message Evaluation

About

Uh oh!

Releases

Packages

Languages

bitse-lab/awesome-git-commit

Folders and files

Latest commit

History

Repository files navigation

🔍🤖Evaluating Generated Commit Messages with Large Language Models

Usage As an evaluation package

Reproduction

Dataset

RQ1: Are LLM-based evaluators comparable to human evaluators in commit message evaluation?

RQ2: What is the performance of the LLM-based evaluator compared to automatic evaluation metrics?

RQ3: Can the performance of the LLM-based evaluator keep stable?

Appendix of Automated Metrics Used in Commit Message Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages