-
Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations, Clinical Applications, and Ethical Considerations
Authors:
Rui Yang,
Matthew Yu Heng Wong,
Huitao Li,
Xin Li,
Wentao Zhu,
Jingchi Liao,
Kunyu Yu,
Jonathan Chong Kai Liew,
Weihao Xuan,
Yingjian Chen,
Yuhe Ke,
Jasmine Chiat Ling Ong,
Douglas Teodoro,
Chuan Hong,
Daniel Shi Wei Ting,
Nan Liu
Abstract:
The rapid growth of medical knowledge and increasing complexity of clinical practice pose challenges. In this context, large language models (LLMs) have demonstrated value; however, inherent limitations remain. Retrieval-augmented generation (RAG) technologies show potential to enhance their clinical applicability. This study reviewed RAG applications in medicine. We found that research primarily…
▽ More
The rapid growth of medical knowledge and increasing complexity of clinical practice pose challenges. In this context, large language models (LLMs) have demonstrated value; however, inherent limitations remain. Retrieval-augmented generation (RAG) technologies show potential to enhance their clinical applicability. This study reviewed RAG applications in medicine. We found that research primarily relied on publicly available data, with limited application in private data. For retrieval, approaches commonly relied on English-centric embedding models, while LLMs were mostly generic, with limited use of medical-specific LLMs. For evaluation, automated metrics evaluated generation quality and task performance, whereas human evaluation focused on accuracy, completeness, relevance, and fluency, with insufficient attention to bias and safety. RAG applications were concentrated on question answering, report generation, text summarization, and information extraction. Overall, medical RAG remains at an early stage, requiring advances in clinical validation, cross-linguistic adaptation, and support for low-resource settings to enable trustworthy and responsible global use.
△ Less
Submitted 13 November, 2025; v1 submitted 8 November, 2025;
originally announced November 2025.
-
Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications
Authors:
Mingxuan Liu,
Yuhe Ke,
Wentao Zhu,
Mayli Mertens,
Yilin Ning,
Jingchi Liao,
Chuan Hong,
Daniel Shu Wei Ting,
Yifan Peng,
Danielle S. Bitterman,
Marcus Eng Hock Ong,
Nan Liu
Abstract:
The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case…
▽ More
The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models' judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender's relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.
△ Less
Submitted 7 October, 2025;
originally announced October 2025.
-
EVLF-FM: Explainable Vision Language Foundation Model for Medicine
Authors:
Yang Bai,
Haoran Cheng,
Yang Zhou,
Jun Zhou,
Arun Thirunavukarasu,
Yuhe Ke,
Jie Yao,
Kanae Fukutsu,
Chrystie Wan Ning Quek,
Ashley Hong,
Laura Gutierrez,
Zhen Ling Teo,
Darren Shu Jeng Ting,
Brian T. Soetikno,
Christopher S. Nielsen,
Tobias Elze,
Zengxiang Li,
Linh Le Dinh,
Hiok Hong Chan,
Victor Koh,
Marcus Tan,
Kelvin Z. Li,
Leonard Yip,
Ching Yu Cheng,
Yih Chung Tham
, et al. (18 additional authors not shown)
Abstract:
Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM enc…
▽ More
Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.
△ Less
Submitted 28 September, 2025;
originally announced September 2025.
-
Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)
Authors:
Yang Zhou,
Chrystie Wan Ning Quek,
Jun Zhou,
Yan Wang,
Yang Bai,
Yuhe Ke,
Jie Yao,
Laura Gutierrez,
Zhen Ling Teo,
Darren Shu Jeng Ting,
Brian T. Soetikno,
Christopher S. Nielsen,
Tobias Elze,
Zengxiang Li,
Linh Le Dinh,
Lionel Tim-Ee Cheng,
Tran Nguyen Tuan Anh,
Chee Leong Cheng,
Tien Yin Wong,
Nan Liu,
Iain Beehuat Tan,
Tony Kiat Hon Lim,
Rick Siow Mong Goh,
Yong Liu,
Daniel Shu Wei Ting
Abstract:
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio…
▽ More
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
The Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine
Authors:
Rui Yang,
Huitao Li,
Matthew Yu Heng Wong,
Yuhe Ke,
Xin Li,
Kunyu Yu,
Jingchi Liao,
Jonathan Chong Kai Liew,
Sabarinath Vinod Nair,
Jasmine Chiat Ling Ong,
Irene Li,
Douglas Teodoro,
Chuan Hong,
Daniel Shu Wei Ting,
Nan Liu
Abstract:
Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extract…
▽ More
Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Real-world Deployment and Evaluation of PErioperative AI CHatbot (PEACH) -- a Large Language Model Chatbot for Perioperative Medicine
Authors:
Yu He Ke,
Liyuan Jin,
Kabilan Elangovan,
Bryan Wen Xi Ong,
Chin Yang Oh,
Jacqueline Sim,
Kenny Wei-Tsen Loh,
Chai Rick Soh,
Jonathan Ming Hua Cheng,
Aaron Kwang Yang Lee,
Daniel Shu Wei Ting,
Nan Liu,
Hairil Rizal Abdullah
Abstract:
Large Language Models (LLMs) are emerging as powerful tools in healthcare, particularly for complex, domain-specific tasks. This study describes the development and evaluation of the PErioperative AI CHatbot (PEACH), a secure LLM-based system integrated with local perioperative guidelines to support preoperative clinical decision-making. PEACH was embedded with 35 institutional perioperative proto…
▽ More
Large Language Models (LLMs) are emerging as powerful tools in healthcare, particularly for complex, domain-specific tasks. This study describes the development and evaluation of the PErioperative AI CHatbot (PEACH), a secure LLM-based system integrated with local perioperative guidelines to support preoperative clinical decision-making. PEACH was embedded with 35 institutional perioperative protocols in the secure Claude 3.5 Sonet LLM framework within Pair Chat (developed by Singapore Government) and tested in a silent deployment with real-world data. Accuracy, safety, and usability were assessed. Deviations and hallucinations were categorized based on potential harm, and user feedback was evaluated using the Technology Acceptance Model (TAM). Updates were made after the initial silent deployment to amend one protocol.
In 240 real-world clinical iterations, PEACH achieved a first-generation accuracy of 97.5% (78/80) and an overall accuracy of 96.7% (232/240) across three iterations. The updated PEACH demonstrated improved accuracy of 97.9% (235/240), with a statistically significant difference from the null hypothesis of 95% accuracy (p = 0.018, 95% CI: 0.952-0.991). Minimal hallucinations and deviations were observed (both 1/240 and 2/240, respectively). Clinicians reported that PEACH expedited decisions in 95% of cases, and inter-rater reliability ranged from kappa 0.772-0.893 within PEACH and 0.610-0.784 among attendings.
PEACH is an accurate, adaptable tool that enhances consistency and efficiency in perioperative decision-making. Future research should explore its scalability across specialties and its impact on clinical outcomes.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness
Authors:
Yu He Ke,
Liyuan Jin,
Kabilan Elangovan,
Hairil Rizal Abdullah,
Nan Liu,
Alex Tiong Heng Sia,
Chai Rick Soh,
Joshua Yi Min Tung,
Jasmine Chiat Ling Ong,
Chang-Fu Kuo,
Shao-Chun Wu,
Vesela P. Kovacheva,
Daniel Shu Wei Ting
Abstract:
Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We devel…
▽ More
Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated answers serving as comparisons. The LLM-RAG models generated responses within 20 seconds, significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no hallucinations and producing correct instructions comparable to clinicians. Results were consistent across both local and international guidelines. This study demonstrates the potential of LLM-RAG models for preoperative healthcare tasks, highlighting their efficiency, scalability, and reliability.
△ Less
Submitted 10 October, 2024;
originally announced October 2024.
-
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
Authors:
Yang Bai,
Yang Zhou,
Jun Zhou,
Rick Siow Mong Goh,
Daniel Shu Wei Ting,
Yong Liu
Abstract:
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and fine-tuning. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs). VITask employs t…
▽ More
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and fine-tuning. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs). VITask employs three key strategies: exemplar prompting (EP), response distribution alignment (RDA), and contrastive response tuning (CRT) to improve the task-specific performance of VLMs by adjusting their response distributions. EP allows TSM features to guide VLMs, while RDA enables VLMs to adapt without TSMs during inference by learning from exemplar-prompted models. CRT further optimizes the ranking of correct image-response pairs, thereby reducing the risk of generating undesired responses. Experiments on 12 medical diagnosis datasets across 9 imaging modalities show that VITask outperforms both vanilla instruction-tuned VLMs and TSMs, showcasing its ability to integrate complementary features from both models effectively. Additionally, VITask offers practical advantages such as flexible TSM integration and robustness to incomplete instructions, making it a versatile and efficient solution for task-specific VLM tuning. Our code are available at https://github.com/baiyang4/VITask.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Lightweight Large Language Model for Medication Enquiry: Med-Pal
Authors:
Kabilan Elangovan,
Jasmine Chiat Ling Ong,
Liyuan Jin,
Benjamin Jun Jie Seng,
Yu Heng Kwan,
Lit Soo Tan,
Ryan Jian Zhong,
Justina Koi Li Ma,
YuHe Ke,
Nan Liu,
Kathleen M Giacomini,
Daniel Shu Wei Ting
Abstract:
Large Language Models (LLMs) have emerged as a potential solution to assist digital health development with patient education, commonly medication-related enquires. We trained and validated Med-Pal, a medication domain-specific LLM-chatbot fine-tuned with a fine-grained and expert curated dataset from a selection of five light-weighted open-source LLMs of smaller parameter size (7 billion or less)…
▽ More
Large Language Models (LLMs) have emerged as a potential solution to assist digital health development with patient education, commonly medication-related enquires. We trained and validated Med-Pal, a medication domain-specific LLM-chatbot fine-tuned with a fine-grained and expert curated dataset from a selection of five light-weighted open-source LLMs of smaller parameter size (7 billion or less) regarding computational constraints and prioritizing operational efficiency. A multi-disciplinary team performed a clinical evaluation of LLMs responses using the SCORE criteria, focusing on safety, accuracy, bias, reproducibility, and ease of understanding. Best performing light-weighted LLM was chosen as Med-Pal for further engineering with guard-railing using adversarial prompting. Med-Pal and existing light-weighted LLMs, including pretrained Biomistral and finetuned Meerkat, were validated on an independent dataset on a broad range of medication-related questions (231 in total), 12 different question types across 14 different medication classes. Mistral-7b emerged as the top performer among selected lightweight LLMs, achieving the highest median score of 14 and 71.9% high-quality responses in accuracy and safety domains, hence chosen as the backbone LLM for Med-Pal. When compared against Biomistral, Med-pal outperformed in generating responses appropriate for patient communication, with significant reductions bias and errors typical of general LLMs. Comparable performance was observed when comparing Med-Pal with Meerkat. Med-Pal showcases the feasibility of developing and employing fine-tuned light-weighted LLMs to enhance digital health communications.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Retrieval-Augmented Generation for Generative Artificial Intelligence in Medicine
Authors:
Rui Yang,
Yilin Ning,
Emilia Keppo,
Mingxuan Liu,
Chuan Hong,
Danielle S Bitterman,
Jasmine Chiat Ling Ong,
Daniel Shu Wei Ting,
Nan Liu
Abstract:
Generative artificial intelligence (AI) has brought revolutionary innovations in various fields, including medicine. However, it also exhibits limitations. In response, retrieval-augmented generation (RAG) provides a potential solution, enabling models to generate more accurate contents by leveraging the retrieval of external knowledge. With the rapid advancement of generative AI, RAG can pave the…
▽ More
Generative artificial intelligence (AI) has brought revolutionary innovations in various fields, including medicine. However, it also exhibits limitations. In response, retrieval-augmented generation (RAG) provides a potential solution, enabling models to generate more accurate contents by leveraging the retrieval of external knowledge. With the rapid advancement of generative AI, RAG can pave the way for connecting this transformative technology with medical applications and is expected to bring innovations in equity, reliability, and personalization to health care.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Towards Clinical AI Fairness: Filling Gaps in the Puzzle
Authors:
Mingxuan Liu,
Yilin Ning,
Salinelat Teixayavong,
Xiaoxuan Liu,
Mayli Mertens,
Yuqing Shang,
Xin Li,
Di Miao,
Jie Xu,
Daniel Shu Wei Ting,
Lionel Tim-Ee Cheng,
Jasmine Chiat Ling Ong,
Zhen Ling Teo,
Ting Fang Tan,
Narrendar RaviChandran,
Fei Wang,
Leo Anthony Celi,
Marcus Eng Hock Ong,
Nan Liu
Abstract:
The ethical integration of Artificial Intelligence (AI) in healthcare necessitates addressing fairness-a concept that is highly context-specific across medical fields. Extensive studies have been conducted to expand the technical components of AI fairness, while tremendous calls for AI fairness have been raised from healthcare. Despite this, a significant disconnect persists between technical adva…
▽ More
The ethical integration of Artificial Intelligence (AI) in healthcare necessitates addressing fairness-a concept that is highly context-specific across medical fields. Extensive studies have been conducted to expand the technical components of AI fairness, while tremendous calls for AI fairness have been raised from healthcare. Despite this, a significant disconnect persists between technical advancements and their practical clinical applications, resulting in a lack of contextualized discussion of AI fairness in clinical settings. Through a detailed evidence gap analysis, our review systematically pinpoints several deficiencies concerning both healthcare data and the provided AI fairness solutions. We highlight the scarcity of research on AI fairness in many medical domains where AI technology is increasingly utilized. Additionally, our analysis highlights a substantial reliance on group fairness, aiming to ensure equality among demographic groups from a macro healthcare system perspective; in contrast, individual fairness, focusing on equity at a more granular level, is frequently overlooked. To bridge these gaps, our review advances actionable strategies for both the healthcare and AI research communities. Beyond applying existing AI fairness methods in healthcare, we further emphasize the importance of involving healthcare professionals to refine AI fairness concepts and methods to ensure contextually relevant and ethically sound AI applications in healthcare.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4
Authors:
Ting Fang Tan,
Kabilan Elangovan,
Liyuan Jin,
Yao Jie,
Li Yong,
Joshua Lim,
Stanley Poh,
Wei Yan Ng,
Daniel Lim,
Yuhe Ke,
Nan Liu,
Daniel Shu Wei Ting
Abstract:
Purpose: To assess the alignment of GPT-4-based evaluation to human clinician experts, for the evaluation of responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions, divided into fine-tuning (368; 92%), and testing (40; 8%). We find…
▽ More
Purpose: To assess the alignment of GPT-4-based evaluation to human clinician experts, for the evaluation of responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions, divided into fine-tuning (368; 92%), and testing (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset, additional 8 glaucoma QnA pairs were included. 200 responses to the testing dataset were generated by 5 fine-tuned LLMs for evaluation. A customized clinical evaluation rubric was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding. GPT-4 evaluation was then compared against ranking by 5 clinicians for clinical alignment. Results: Among all fine-tuned LLMs, GPT-3.5 scored the highest (87.1%), followed by LLAMA2-13b (80.9%), LLAMA2-13b-chat (75.5%), LLAMA2-7b-Chat (70%) and LLAMA2-7b (68.8%) based on the GPT-4 evaluation. GPT-4 evaluation demonstrated significant agreement with human clinician rankings, with Spearman and Kendall Tau correlation coefficients of 0.90 and 0.80 respectively; while correlation based on Cohen Kappa was more modest at 0.50. Notably, qualitative analysis and the glaucoma sub-analysis revealed clinical inaccuracies in the LLM-generated responses, which were appropriately identified by the GPT-4 evaluation. Conclusion: The notable clinical alignment of GPT-4 evaluation highlighted its potential to streamline the clinical evaluation of LLM chatbot responses to healthcare-related queries. By complementing the existing clinician-dependent manual grading, this efficient and automated evaluation could assist the validation of future developments in LLM applications for healthcare.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Development and Testing of a Novel Large Language Model-Based Clinical Decision Support Systems for Medication Safety in 12 Clinical Specialties
Authors:
Jasmine Chiat Ling Ong,
Liyuan Jin,
Kabilan Elangovan,
Gilbert Yong San Lim,
Daniel Yan Zheng Lim,
Gerald Gui Ren Sng,
Yuhe Ke,
Joshua Yi Min Tung,
Ryan Jian Zhong,
Christopher Ming Yao Koh,
Keane Zhi Hao Lee,
Xiang Chen,
Jack Kian Chng,
Aung Than,
Ken Junyang Goh,
Daniel Shu Wei Ting
Abstract:
Importance: We introduce a novel Retrieval Augmented Generation (RAG)-Large Language Model (LLM) framework as a Clinical Decision Support Systems (CDSS) to support safe medication prescription.
Objective: To evaluate the efficacy of LLM-based CDSS in correctly identifying medication errors in different patient case vignettes from diverse medical and surgical sub-disciplines, against a human expe…
▽ More
Importance: We introduce a novel Retrieval Augmented Generation (RAG)-Large Language Model (LLM) framework as a Clinical Decision Support Systems (CDSS) to support safe medication prescription.
Objective: To evaluate the efficacy of LLM-based CDSS in correctly identifying medication errors in different patient case vignettes from diverse medical and surgical sub-disciplines, against a human expert panel derived ground truth. We compared performance for under 2 different CDSS practical healthcare integration modalities: LLM-based CDSS alone (fully autonomous mode) vs junior pharmacist + LLM-based CDSS (co-pilot, assistive mode).
Design, Setting, and Participants: Utilizing a RAG model with state-of-the-art medically-related LLMs (GPT-4, Gemini Pro 1.0 and Med-PaLM 2), this study used 61 prescribing error scenarios embedded into 23 complex clinical vignettes across 12 different medical and surgical specialties. A multidisciplinary expert panel assessed these cases for Drug-Related Problems (DRPs) using the PCNE classification and graded severity / potential for harm using revised NCC MERP medication error index. We compared.
Results RAG-LLM performed better compared to LLM alone. When employed in a co-pilot mode, accuracy, recall, and F1 scores were optimized, indicating effectiveness in identifying moderate to severe DRPs. The accuracy of DRP detection with RAG-LLM improved in several categories but at the expense of lower precision.
Conclusions This study established that a RAG-LLM based CDSS significantly boosts the accuracy of medication error identification when used alongside junior pharmacists (co-pilot), with notable improvements in detecting severe DRPs. This study also illuminates the comparative performance of current state-of-the-art LLMs in RAG-based CDSS systems.
△ Less
Submitted 17 February, 2024; v1 submitted 29 January, 2024;
originally announced February 2024.
-
Development and Testing of Retrieval Augmented Generation in Large Language Models -- A Case Study Report
Authors:
YuHe Ke,
Liyuan Jin,
Kabilan Elangovan,
Hairil Rizal Abdullah,
Nan Liu,
Alex Tiong Heng Sia,
Chai Rick Soh,
Joshua Yi Min Tung,
Jasmine Chiat Ling Ong,
Daniel Shu Wei Ting
Abstract:
Purpose: Large Language Models (LLMs) hold significant promise for medical applications. Retrieval Augmented Generation (RAG) emerges as a promising approach for customizing domain knowledge in LLMs. This case study presents the development and evaluation of an LLM-RAG pipeline tailored for healthcare, focusing specifically on preoperative medicine.
Methods: We developed an LLM-RAG model using 3…
▽ More
Purpose: Large Language Models (LLMs) hold significant promise for medical applications. Retrieval Augmented Generation (RAG) emerges as a promising approach for customizing domain knowledge in LLMs. This case study presents the development and evaluation of an LLM-RAG pipeline tailored for healthcare, focusing specifically on preoperative medicine.
Methods: We developed an LLM-RAG model using 35 preoperative guidelines and tested it against human-generated responses, with a total of 1260 responses evaluated. The RAG process involved converting clinical documents into text using Python-based frameworks like LangChain and Llamaindex, and processing these texts into chunks for embedding and retrieval. Vector storage techniques and selected embedding models to optimize data retrieval, using Pinecone for vector storage with a dimensionality of 1536 and cosine similarity for loss metrics. Human-generated answers, provided by junior doctors, were used as a comparison.
Results: The LLM-RAG model generated answers within an average of 15-20 seconds, significantly faster than the 10 minutes typically required by humans. Among the basic LLMs, GPT4.0 exhibited the best accuracy of 80.1%. This accuracy was further increased to 91.4% when the model was enhanced with RAG. Compared to the human-generated instructions, which had an accuracy of 86.3%, the performance of the GPT4.0 RAG model demonstrated non-inferiority (p=0.610).
Conclusions: In this case study, we demonstrated a LLM-RAG model for healthcare implementation. The pipeline shows the advantages of grounded knowledge, upgradability, and scalability as important aspects of healthcare LLM deployment.
△ Less
Submitted 29 January, 2024;
originally announced February 2024.
-
Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias
Authors:
Yu He Ke,
Rui Yang,
Sui An Lie,
Taylor Xin Yi Lim,
Hairil Rizal Abdullah,
Daniel Shu Wei Ting,
Nan Liu
Abstract:
Background: Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field.
Objective: This study explores the role of large language models (LLMs) in mitigating these biases through the utilization of a multi-agent framework. We simulate the clinical decisi…
▽ More
Background: Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field.
Objective: This study explores the role of large language models (LLMs) in mitigating these biases through the utilization of a multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy.
Methods: A total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 to facilitate interactions among four simulated agents to replicate clinical team dynamics. Each agent has a distinct role: 1) To make the final diagnosis after considering the discussions, 2) The devil's advocate and correct confirmation and anchoring bias, 3) The tutor and facilitator of the discussion to reduce premature closure bias, and 4) To record and summarize the findings. A total of 80 simulations were evaluated for the accuracy of initial diagnosis, top differential diagnosis and final two differential diagnoses.
Results: In a total of 80 responses evaluating both initial and final diagnoses, the initial diagnosis had an accuracy of 0% (0/80), but following multi-agent discussions, the accuracy for the top differential diagnosis increased to 71.3% (57/80), and for the final two differential diagnoses, to 80.0% (64/80).
Conclusions: The framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. The LLM-driven multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios.
△ Less
Submitted 12 May, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist
Authors:
Yilin Ning,
Salinelat Teixayavong,
Yuqing Shang,
Julian Savulescu,
Vaishaanth Nagaraj,
Di Miao,
Mayli Mertens,
Daniel Shu Wei Ting,
Jasmine Chiat Ling Ong,
Mingxuan Liu,
Jiuwen Cao,
Michael Dunn,
Roger Vaughan,
Marcus Eng Hock Ong,
Joseph Jao-Yiu Sung,
Eric J Topol,
Nan Liu
Abstract:
The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been use…
▽ More
The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been used to synthesise data (e.g., images) for research and practical purposes, which resolved some ethical issues and exposed others. We conduct a scoping review of ethical discussions on GenAI in healthcare to comprehensively analyse gaps in the current research, and further propose to reduce the gaps by developing a checklist for comprehensive assessment and transparent documentation of ethical discussions in GenAI research. The checklist can be readily integrated into the current peer review and publication system to enhance GenAI research, and may be used for ethics-related disclosures for GenAI-powered products, healthcare applications of such products and beyond.
△ Less
Submitted 23 February, 2024; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Towards clinical AI fairness: A translational perspective
Authors:
Mingxuan Liu,
Yilin Ning,
Salinelat Teixayavong,
Mayli Mertens,
Jie Xu,
Daniel Shu Wei Ting,
Lionel Tim-Ee Cheng,
Jasmine Chiat Ling Ong,
Zhen Ling Teo,
Ting Fang Tan,
Ravi Chandran Narrendar,
Fei Wang,
Leo Anthony Celi,
Marcus Eng Hock Ong,
Nan Liu
Abstract:
Artificial intelligence (AI) has demonstrated the ability to extract insights from data, but the issue of fairness remains a concern in high-stakes fields such as healthcare. Despite extensive discussion and efforts in algorithm development, AI fairness and clinical concerns have not been adequately addressed. In this paper, we discuss the misalignment between technical and clinical perspectives o…
▽ More
Artificial intelligence (AI) has demonstrated the ability to extract insights from data, but the issue of fairness remains a concern in high-stakes fields such as healthcare. Despite extensive discussion and efforts in algorithm development, AI fairness and clinical concerns have not been adequately addressed. In this paper, we discuss the misalignment between technical and clinical perspectives of AI fairness, highlight the barriers to AI fairness' translation to healthcare, advocate multidisciplinary collaboration to bridge the knowledge gap, and provide possible solutions to address the clinical concerns pertaining to AI fairness.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
Federated and distributed learning applications for electronic health records and structured medical data: A scoping review
Authors:
Siqi Li,
Pinyan Liu,
Gustavo G. Nascimento,
Xinru Wang,
Fabio Renato Manzolli Leite,
Bibhas Chakraborty,
Chuan Hong,
Yilin Ning,
Feng Xie,
Zhen Ling Teo,
Daniel Shu Wei Ting,
Hamed Haddadi,
Marcus Eng Hock Ong,
Marco Aurélio Peres,
Nan Liu
Abstract:
Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medi…
▽ More
Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations and discusses potential innovations. We searched five databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from three primary perspectives, including data quality, modeling strategies, and FL frameworks. Out of the 1160 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis. The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study
Authors:
Yilin Ning,
Siqi Li,
Marcus Eng Hock Ong,
Feng Xie,
Bibhas Chakraborty,
Daniel Shu Wei Ting,
Nan Liu
Abstract:
Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors, but such 'black box' variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach usin…
▽ More
Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors, but such 'black box' variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission, ShapleyVIC selected 6 of 41 candidate variables to create a well-performing model, which had similar performance to a 16-variable model from machine-learning-based ranking.
△ Less
Submitted 10 January, 2022;
originally announced January 2022.
-
Shapley variable importance clouds for interpretable machine learning
Authors:
Yilin Ning,
Marcus Eng Hock Ong,
Bibhas Chakraborty,
Benjamin Alan Goldstein,
Daniel Shu Wei Ting,
Roger Vaughan,
Nan Liu
Abstract:
Interpretable machine learning has been focusing on explaining final models that optimize performance. The current state-of-the-art is the Shapley additive explanations (SHAP) that locally explains variable impact on individual predictions, and it is recently extended for a global assessment across the dataset. Recently, Dong and Rudin proposed to extend the investigation to models from the same c…
▽ More
Interpretable machine learning has been focusing on explaining final models that optimize performance. The current state-of-the-art is the Shapley additive explanations (SHAP) that locally explains variable impact on individual predictions, and it is recently extended for a global assessment across the dataset. Recently, Dong and Rudin proposed to extend the investigation to models from the same class as the final model that are "good enough", and identified a previous overclaim of variable importance based on a single model. However, this method does not directly integrate with existing Shapley-based interpretations. We close this gap by proposing a Shapley variable importance cloud that pools information across good models to avoid biased assessments in SHAP analyses of final models, and communicate the findings via novel visualizations. We demonstrate the additional insights gain compared to conventional explanations and Dong and Rudin's method using criminal justice and electronic medical records data.
△ Less
Submitted 5 October, 2021;
originally announced October 2021.
-
Multi-Instance Multi-Scale CNN for Medical Image Classification
Authors:
Shaohua Li,
Yong Liu,
Xiuchao Sui,
Cheng Chen,
Gabriel Tjio,
Daniel Shu Wei Ting,
Rick Siow Mong Goh
Abstract:
Deep learning for medical image classification faces three major challenges: 1) the number of annotated medical images for training are usually small; 2) regions of interest (ROIs) are relatively small with unclear boundaries in the whole medical images, and may appear in arbitrary positions across the x,y (and also z in 3D images) dimensions. However often only labels of the whole images are anno…
▽ More
Deep learning for medical image classification faces three major challenges: 1) the number of annotated medical images for training are usually small; 2) regions of interest (ROIs) are relatively small with unclear boundaries in the whole medical images, and may appear in arbitrary positions across the x,y (and also z in 3D images) dimensions. However often only labels of the whole images are annotated, and localized ROIs are unavailable; and 3) ROIs in medical images often appear in varying sizes (scales). We approach these three challenges with a Multi-Instance Multi-Scale (MIMS) CNN: 1) We propose a multi-scale convolutional layer, which extracts patterns of different receptive fields with a shared set of convolutional kernels, so that scale-invariant patterns are captured by this compact set of kernels. As this layer contains only a small number of parameters, training on small datasets becomes feasible; 2) We propose a "top-k pooling" to aggregate the feature maps in varying scales from multiple spatial dimensions, allowing the model to be trained using weak annotations within the multiple instance learning (MIL) framework. Our method is shown to perform well on three classification tasks involving two 3D and two 2D medical image datasets.
△ Less
Submitted 22 October, 2019; v1 submitted 4 July, 2019;
originally announced July 2019.