AI-Driven Code Quality Assessment and Enhancement

This thesis investigates whether large language models (LLMs) can improve the quality of existing industrial code and, more specifically, whether making an LLM “project-aware” leads to more contextually relevant refactoring suggestions than those produced by a general-purpose model. It addresses a gap in systematic refactoring evaluations for project-adapted LLMs in enterprise settings where behaviour preservation and organisation-specific coding standards constrain acceptable changes. A custom supervised benchmark is constructed from a proprietary VB.NET codebase by collaboratively annotating 530 snippet-level refactoring tasks with a codebook and gold refactorings across five issue categories: project “house rule” violations (blending_issue), common_smell, cyclomatic_complexity, naming_consistency, and readability. A project-aware model is obtained by parameter-efficiently fine-tuning StarCoder2-3B with LoRA on project refactorings, and it is compared against three baselines: a deterministic rule-based refactoring engine implementing selected house rules, a prompt-only StarCoder2Base configuration (to isolate the effect of fine-tuning), and the general-purpose Gemini model prompted with the same VB.NET refactoring instructions. Systems are evaluated on a held-out test set of 100 labelled snippets using VB.NET-aware text similarity to the gold standards, a suite of lightweight static indicators (including cyclomatic complexity, nesting depth, import-hygiene violations, inline separator usage, and whitespace-related patterns), per-category success rates defined by improvements in primary indicators, and paired non-parametric significance tests on per-snippet metric deltas. The results show that the rule-based baseline delivers strong and predictable gains for mechanical hygiene, achieving a 95% success rate on blending_issue items while remaining essentially neutral on deeper structural metrics. Gemini produces modest but targeted structural improvements, including being the only system with non-zero success on cyclomatic_complexity items (15%) and the only model that lowers maximum nesting depth on average, but it also tends to introduce inline separators and does not match the baseline's import-order hygiene. The project-aware StarCoder2 configuration rarely improves the primary smell indicators and often increases inline separator metrics; paired tests further indicate that project-specific fine-tuning yields only limited, statistically non-significant changes relative to the prompt-only StarCoder2Base setup. Across models, proxy indicators show no consistent resolution of naming_consistency, readability, or common_smell items. Overall, the thesis shows that, under the evaluated conditions, learned refactorers are not drop-in replacements for rule-based tools, and it argues for hybrid refactoring pipelines that use rule-based passes for reliable style enforcement and apply LLM-based transformations selectively, gated by static analysis and supported by human review.

URI

https://studenttheses.uu.nl/handle/20.500.12932/51146

AI-Driven Code Quality Assessment and Enhancement

Files

Publication date

Authors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI