Is it feasible to use large language models (LLMs) and embedding models for automatically mapping therapeutic indications and orphan designations of EMA approved medicines between 1995-2024 to disease terminology systems (SNOMED CT, ICD-11 and Orphanet Nomenclature)?
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
In the European Union, information on medicines is offered by different organizations on online platforms, with the European Medicines Agency (EMA) website as the primary source. An important key component of one regulatory document, the Summary of Product Characteristics (SmPC) is the therapeutic indication, that outlines for which diseases and in which conditions a medicine is authorised. A medicine can also target a rare disease (“orphan condition”), giving it an orphan designation. However, the therapeutic indication can also contain eight other elements that vary in wording, making it difficult to parse and challenging to standardize, decreasing accessibility. Well-structured information on orphan conditions and the contents of therapeutic indications is often lacking. To address this issue, a structured insight into therapeutic indications with orphan designations is needed, which can be done by mapping against standardized disease terminology systems. These systems allow for a structured overview of diseases by linking medical descriptions to disease codes. However, high number of approved medicines make this time-consuming and therefore an automatic approach to map therapeutic indications would be beneficial. To achieve this, Natural Language Processing (NLP) techniques could be used. The goal of this research is therefore to investigate the feasibility of using NLP techniques to automatically map therapeutic indications of medicines with orphan designations to disease terminology systems.
We investigated two main tasks to achieve our goal: information extraction and mapping. We selected a number of models to perform these two components and three disease terminology systems to map the indications to: ICD-11, Orphanet and SNOMED CT. Feasible models were tested to perform these two tasks on therapeutic indications from four different sets of medicines. Two test sets focused on extracting and mapping the target disease/condition, while two other sets focused on extracting and mapping eight other elements with the selected model. We evaluated the performance of the models by comparing the extracted and mapped results with manually constructed reference sets. If this was not possible, then we compared the overlap in results to analyse consistency.
We found that the ‘bigger’ models like Claude.ai-3.5_Sonnet, ChatGPT-4o and Gemini-Flash_2.0 achieved similar performance and correctly extracted over 80% of the target diseases/conditions. The mapping models were on average able to correctly map 65-70% of the target disease/condition. ICD-11 and Orphanet had the highest consistency in mapping output, even when differences were present in mapping input. The other elements were, however, more challenging to map. We found that long answers including many distinct terms were mapped incorrectly more frequently, while shorter answers produced more correct results.
Overall, the selected models show high potential for developing an automatic mapping pipeline that can help structure and parse therapeutic indications with orphan designations. While it is impractical to select one ‘best’ performing model, due to rapid evolution of artificial intelligence (AI) technology, our research identified feasible methods and recommendations for future improvement. These include prompt optimisation, breaking down complex elements, exploring multiple possible mapping outputs, training models and designing an adaptative pipeline that can implement the constant changes of AI models.
Keywords
mappen, therapeutic indication, orphan designation, Orphanet, SNOMED CT, ICD-11, NLP