Automated Scoring Systems for Open-Ended Questions in Dutch Education
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
Automated Essay Scoring (AES) systems are increasingly used in educational settings, yet most are limited to assigning holistic scores and are focused on English-language contexts. This thesis investigates the feasibility of building explainable, multi-aspect AES models that align with rubric-defined criteria across three aspects, content, language, and presentation, for Dutch open-ended responses.
We compare fine-tuned BERT-based transformer models with GPT-4o used in a prompt-based inference setting. RobBERT, a Dutch-specific model, performed best on the language aspect, while content and presentation were more difficult to model due to rubric variability and data limitations. SHAP analysis confirmed that its predictions were aligned with the approapriate rubric elements. GPT-4o achieved competitive results, particularly on content and language, with few-shot prompting and justification requests improving alignment. Expert review revealed inconsistencies in human annotations, calling into question the reliability of gold-standard annotations and highlighting the need for explainable AES systems.
The results suggest that a hybrid approach, combining the robustness of fine-tuned models for well-defined aspects with the flexibility of prompt-based generative LLMs for more interpretive dimensions, may be best suited for practical deployment in Dutch educational settings.
Keywords
NLP; Transformer models; LLMs; Automated Essay Scoring; Multi-aspect Scoring; SHAP