In a recent study published in the JAMA Network Open, researchers evaluated the accuracy and safety of large language models (LLMs) in answering medical oncology examination questions.
Study: Performance of Large Language Models on Medical Oncology Examination Questions. Image Credit: BOY ANTHONY/Shutterstock.com
Background
LLMs have the potential to revolutionize healthcare by assisting clinicians with tasks and interacting with patients. These models, trained on vast text corpora, can be fine-tuned to answer questions with human-like responses.
LLMs encode extensive medical knowledge and have shown the ability to pass the United States (US) Medical Licensing Examination, demonstrating comprehension and reasoning. However, their performance varies across medical subspecialties.
With rapidly evolving knowledge and high publication volume, medical oncology presents a unique challenge.
Further research is needed to ensure that LLMs can reliably and safely apply their medical knowledge to dynamic and specialized fields like medical oncology, improving clinician support and patient care.
About the study
The present study, conducted from May 28 to October 11, 2023, followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines and did not require ethics board approval or informed consent due to the lack of human participants.
American Society of Clinical Oncology (ASCO)’s publicly accessible question bank provided 52 multiple-choice questions, each with one correct answer and explanatory references. Similarly, the European Society for Medical Oncology (ESMO) Examination Trial Questions from 2021 and 2022 provided 75 questions after excluding image-based ones, with answers developed by oncologists.
To ensure unbiased testing, 20 original questions were created by oncologists, maintaining a multiple-choice format.
Chat Generative Pre-trained Transformer (ChatGPT)-3.5 and ChatGPT-4 were used to answer these questions, labeled consistently for comparison. Six open-source LLMs, including Biomedical Mistral-7B Domain Adapted for Retrieval and Evaluation (BioMistral-7B DARE), tailored for biomedical domains, were also evaluated.
Responses were recorded with explanations classified into a four-level error scale. Statistical analysis, conducted in R version 4.3.0, tested accuracy, error distribution, and agreement between oncologists.
The study used binomial distribution, McNemar test, Fisher test, weighted κ, and Wilcoxon rank sum test, with a 2-sided P value of .05, indicating statistical significance.
Study results
The evaluation of LLMs across 147 examination questions included 52 from ASCO, 75 from ESMO, and 20 original questions. Hematology was the most common category (15.0%), but the questions spanned various topics.
ESMO questions were more general, addressing mechanisms and toxic effects of systemic therapies. Notably, 27.9% of questions required knowledge from evidence published from 2018 onwards. LLMs provided prose answers to all questions, with proprietary LLM 2 needing prompts for specific answers in 22.4% of cases.
A selected ASCO question involved a 62-year-old woman with metastatic breast cancer presenting with symptoms of a pulmonary embolism. Proprietary LLM 2 correctly identified the best treatment as low molecular weight heparin or a direct oral anticoagulant, considering the patient’s cancer and travel history.
Another ASCO question described a 61-year-old woman with metastatic colon cancer experiencing neuropathy from her chemotherapy regimen. The LLM recommended switching to targeted therapy with encorafenib and cetuximab, given the presence of a B-Raf proto-oncogene, serine/threonine kinase (BRAF) V600E mutation, and its side effects.
Proprietary LLM 2 demonstrated the highest accuracy, correctly answering 85.0% of questions (125 out of 147), significantly outperforming random answering and other models. The performance was consistent across ASCO (80.8%), ESMO (88.0%), and original questions (85.0%).
When given a second attempt, 54.5% of initially incorrect answers were corrected. Proprietary LLM 1 and the best open-source LLM, Mixture of Mistral-8x7B version 0.1 (Mixtral-8x7B-v0.1), had lower accuracies of 60.5% and 59.2%, respectively. BioMistral-7B DARE, tuned for biomedical domains, had an accuracy of 33.6%.
Qualitative evaluation of the prose answers by clinicians showed that proprietary LLM 2 provided correct and error-free answers for 83.7% of the questions.
Incorrect answers were more frequent when questions required knowledge of recent publications, with errors in knowledge recall, reasoning, and reading comprehension identified.
Clinicians classified 63.6% of errors as having a medium likelihood of causing harm, with a high likelihood in 18.2% of cases. No hallucinations were observed in the LLM responses.
Conclusions
In this study, LLMs performed exceptionally well on medical oncology exam-style questions intended for trainees nearing clinical practice. Proprietary LLM 2 correctly answered 85.0% of multiple-choice questions and provided accurate explanations, showcasing its substantial medical oncology knowledge and reasoning abilities.
However, incorrect answers, particularly those involving recent publications, raised significant safety concerns. Proprietary LLM 2 outperformed its predecessor, proprietary LLM 1, and demonstrated superior accuracy compared to other LLMs.
The study revealed that while LLMs’ capabilities are improving, errors in information retrieval, especially with newer evidence, pose risks. Enhanced training and frequent updates are essential for maintaining up-to-date medical oncology knowledge in LLMs.