Reclaiming the Clinician’s Time: Refining Prompt Engineering for Automated Medical Reporting: the case of Pre-Operative Screening in Anesthesiology

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Healthcare workers face increasing burnout due to the administrative burden of report writing, which consumes more than half of their working time. The rapid digitalization of data and the widespread adoption of electronic medical records (EMRs) offers a chance to automate medical documentation. Recent advancements in Artifical Intelligence (AI), particularly the rise of Large Language Models (LLMs), offer promising solutions for the complex task of automated medical reporting. This study explored how prompt templates can be designed and validated to guide LLMs in generating high-quality medical reports from doctor–patient consultation tran- scripts in the context of POS anesthesiology. We followed the Design Science Methodology, where prompt templates were iteratively designed and refined. Finally, a multi-layered prompt template was developed using five prompting techniques. The dataset consisted of 143 anonymized transcripts from POS consultations at UMCU. The generated reports were evaluated using both automated quantitative metrics and a qualitative assessment conducted by an anesthesiologist from UMC Utrecht. Results showed that carefully designed prompt engineering effectively guided LLMs to produce clinically accurate and structured medical reports. BERTScore results for free- text fields ranged between 0.81 and 0.88, and classification tasks exceeded 80% accuracy in most fields. Expert evaluation rated most outputs as “very” or “extremely” helpful, confirming their clinical usefulness. Several challenges emerged during the study. The model sometimes produced excess or redundant information (Eg. negative symptoms). Since it only used consultation transcripts, it missed details found in other data sources like the patient questionnaires. Variability in doctor documentation styles made evaluation harder. Lastly, some metrics, such as BLEURT and ROUGE-L, were less effective at capturing the quality of abstractive summaries. This research highlights the effectiveness of a layered, iterative approach to prompt engineering for automated medical reporting. It emphasizes the importance of domain expertise and clear instructions in guiding LLM outputs, and demonstrates the feasibility of using structured prompting techniques for complex medical reporting tasks.

Keywords

automated medical reporting; medical dialogue summarisation; prompt engineering; LLM; evaluation metrics, preoperative screening (POS); anesthesiology

Citation