Reclaiming the Clinician’s Time: Refining Prompt Engineering for Automated Medical Reporting: the case of Pre-Operative Screening in Anesthesiology
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
Healthcare workers face increasing burnout due to the administrative burden of report
writing, which consumes more than half of their working time. The rapid digitalization of
data and the widespread adoption of electronic medical records (EMRs) offers a chance
to automate medical documentation. Recent advancements in Artifical Intelligence (AI),
particularly the rise of Large Language Models (LLMs), offer promising solutions for the
complex task of automated medical reporting.
This study explored how prompt templates can be designed and validated to guide
LLMs in generating high-quality medical reports from doctor–patient consultation tran-
scripts in the context of POS anesthesiology. We followed the Design Science Methodology,
where prompt templates were iteratively designed and refined. Finally, a multi-layered
prompt template was developed using five prompting techniques. The dataset consisted
of 143 anonymized transcripts from POS consultations at UMCU. The generated reports
were evaluated using both automated quantitative metrics and a qualitative assessment
conducted by an anesthesiologist from UMC Utrecht.
Results showed that carefully designed prompt engineering effectively guided LLMs to
produce clinically accurate and structured medical reports. BERTScore results for free-
text fields ranged between 0.81 and 0.88, and classification tasks exceeded 80% accuracy
in most fields. Expert evaluation rated most outputs as “very” or “extremely” helpful,
confirming their clinical usefulness.
Several challenges emerged during the study. The model sometimes produced excess
or redundant information (Eg. negative symptoms). Since it only used consultation
transcripts, it missed details found in other data sources like the patient questionnaires.
Variability in doctor documentation styles made evaluation harder. Lastly, some metrics,
such as BLEURT and ROUGE-L, were less effective at capturing the quality of abstractive
summaries.
This research highlights the effectiveness of a layered, iterative approach to prompt
engineering for automated medical reporting. It emphasizes the importance of domain
expertise and clear instructions in guiding LLM outputs, and demonstrates the feasibility
of using structured prompting techniques for complex medical reporting tasks.
Keywords
automated medical reporting; medical dialogue summarisation; prompt engineering; LLM; evaluation metrics, preoperative screening (POS); anesthesiology