A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness
Authors:
Brent Winslow,
Jacqueline Shreibati,
Javier Perez,
Hao-Wei Su,
Nichole Young-Lin,
Nova Hammerquist,
Daniel McDuff,
Jason Guss,
Jenny Vafeiadou,
Nick Cain,
Alex Lin,
Erik Schenck,
Shiva Rajagopal,
Jia-Ru Chung,
Anusha Venkatakrishnan,
Amy Armento Lee,
Maryam Karimzadehgan,
Qingyou Meng,
Rythm Agarwal,
Aravind Natarajan,
Tracy Giest
Abstract:
The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of…
▽ More
The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of LLMs applied to personal health and wellness. First, the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data, is described. Subsequently, the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework is introduced as an end-to-end operational methodology that integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing, into an iterative development lifecycle. Through the application of this framework to the Fitbit Insights explorer in a staged deployment involving over 13,000 consented users, challenges not apparent during initial testing were systematically identified. This process guided targeted improvements to the system and demonstrated the necessity of combining isolated technical evaluations with real-world user feedback. Finally, a comprehensive, actionable approach is established for the responsible development and deployment of LLM-powered health applications, providing a standardized methodology to foster innovation while ensuring emerging technologies are safe, effective, and trustworthy for users.
△ Less
Submitted 23 October, 2025;
originally announced December 2025.
The Anatomy of a Personal Health Agent
Authors:
A. Ali Heydari,
Ken Gu,
Vidya Srinivas,
Hong Yu,
Zhihan Zhang,
Yuwei Zhang,
Akshay Paruchuri,
Qian He,
Hamid Palangi,
Nova Hammerquist,
Ahmed A. Metwally,
Brent Winslow,
Yubin Kim,
Kumar Ayush,
Yuzhe Yang,
Girish Narayanswamy,
Maxwell A. Xu,
Jake Garrison,
Amy Armento Lee,
Jenny Vafeiadou,
Ben Graef,
Isaac R. Galatzer-Levy,
Erik Schenck,
Andrew Barakat,
Javier Perez
, et al. (13 additional authors not shown)
Abstract:
Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason…
▽ More
Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end-users' needs when interacting with such an assistant, we conducted an in-depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub-agent: (1) a data science agent that analyzes personal time-series wearable and health record data, (2) a health domain expert agent that integrates users' health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users' progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi-agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub-agent and the multi-agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.
△ Less
Submitted 18 September, 2025; v1 submitted 27 August, 2025;
originally announced August 2025.