Enhancing the ECSER pipeline: Evaluating Large Language Models in SE Research for Classification

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

The various research papers in the field of Software Engineering (SE) that use classification algorithms, LLMs, or other machine learning methods to obtain their results differ in how many and which evaluation metrics are reported, whether or not significance tests are performed, and which steps are taken to aid re- producibility. The ECSER pipeline was designed to mitigate this issue by providing a step-by-step pipeline that researchers can use to report their results when classification algorithms are used, and the pipeline was empirically shown to be effective in replicating the findings of several studies, as well as producing additional findings and occasionally contradicting the findings of the original papers. However, the ECSER pipeline was designed for evaluating simple classifiers and gives no specific recommendations for LLM classifiers, despite LLMs being an increasingly popular choice for classification tasks. This thesis expands the ECSER pipeline for the use of LLMs by adding recommendations from LLM4SE research and related fields. First, an exploratory mapping study was conducted of SE studies released in 2024 that use LLM classifiers, summarising which steps were taken and which metrics were (or weren’t) reported, showing lacklustre reporting standards and a pattern of unsubstantiated conclusions. To address these issues, we designed the ECSER-LLM pipeline, an enhanced version of the ECSER pipeline for the use of LLMs, which includes recommendations for prompt design and comparison, and the evaluation of the calibration, fairness, robustness and sustainability of LLMs. Lastly, in order to evaluate the applicability of the new pipeline, we conducted a replication study on the prediction of Python tests without execution using the pipeline, showing that using the pipeline improved the replicability and comprehensiveness of the results compared to the original paper, while unable to corroborate some of the original conclusions.

Keywords

LLMs; classification; LLM; Software Engineering; SE

Citation