RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale
Authors:
Jason Holmes,
Yuexing Hao,
Mariana Borras-Osorio,
Federico Mastroleo,
Santiago Romero Brufau,
Valentina Carducci,
Katie M Van Abel,
David M Routman,
Andrew Y. K. Foong,
Liv M Muller,
Satomi Shiraishi,
Daniel K Ebner,
Daniel J Ma,
Sameer R Keole,
Samir H Patel,
Mirek Fatyga,
Martin Bues,
Brad J Stish,
Yolanda I Garces,
Michelle A Neben Wittich,
Robert L Foote,
Sujay A Vora,
Nadia N Laack,
Mark R Waddle,
Wei Liu
Abstract:
Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers…
▽ More
Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers of increasing complexity: (1) a structured quality assurance (QA) tier, assessing the accurate retrieval of demographic and radiotherapy treatment plan details, followed by (2) a complex clinical outcomes labeling tier involving determination of mandibular osteoradionecrosis (ORN) in head-and-neck cancer patients and detection of cancer recurrence in independent prostate and head-and-neck cancer cohorts requiring combined interpretation of structured and unstructured patient data. The QA tier establishes foundational trust in structured-data retrieval, a critical prerequisite for successful complex clinical outcome labeling.
△ Less
Submitted 12 December, 2025; v1 submitted 29 September, 2025;
originally announced September 2025.
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Authors:
Jiageng Wu,
Bowen Gu,
Ren Zhou,
Kevin Xie,
Doug Snyder,
Yixing Jiang,
Valentina Carducci,
Richard Wyss,
Rishi J Desai,
Emily Alsentzer,
Leo Anthony Celi,
Adam Rodman,
Sebastian Schneeweiss,
Jonathan H. Chen,
Santiago Romero-Brufau,
Kueiyu Joshua Lin,
Jie Yang
Abstract:
Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medi…
▽ More
Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world clinical data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications, including triage and referral, consultation, information extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini series, and Qwen3 series) under various inference strategies. Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding.
The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
△ Less
Submitted 29 March, 2026; v1 submitted 28 April, 2025;
originally announced April 2025.