-
LLMs can construct powerful representations and streamline sample-efficient supervised learning
Authors:
Ilker Demirel,
Lawrence Shi,
Zeshan Hussain,
David Sontag
Abstract:
As real-world datasets become increasingly complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data for downstream tasks, such as time-series, free text, and structured records, often requires non-trivial domain-specific engineering. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but div…
▽ More
As real-world datasets become increasingly complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data for downstream tasks, such as time-series, free text, and structured records, often requires non-trivial domain-specific engineering. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data. Beyond performance, rubrics offer several advantages for operational healthcare settings such as being easy to audit, cost-effectiveness to deploy at scale, and they can be converted to tabular representations that unlock a swath of machine learning techniques.
△ Less
Submitted 21 March, 2026; v1 submitted 12 March, 2026;
originally announced March 2026.
-
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Authors:
Rulin Shao,
Akari Asai,
Shannon Zejiang Shen,
Hamish Ivison,
Varsha Kishore,
Jingming Zhuo,
Xinran Zhao,
Molly Park,
Samuel G. Finlayson,
David Sontag,
Tyler Murray,
Sewon Min,
Pradeep Dasigi,
Luca Soldaini,
Faeze Brahman,
Wen-tau Yih,
Tongshuang Wu,
Luke Zettlemoyer,
Yoon Kim,
Hannaneh Hajishirzi,
Pang Wei Koh
Abstract:
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and…
▽ More
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
△ Less
Submitted 26 November, 2025; v1 submitted 24 November, 2025;
originally announced November 2025.
-
Autoencoding Dynamics: Topological Limitations and Capabilities
Authors:
Matthew D. Kvalheim,
Eduardo D. Sontag
Abstract:
Given a "data manifold" $M\subset \mathbb{R}^n$ and "latent space" $\mathbb{R}^\ell$, an autoencoder is a pair of continuous maps consisting of an "encoder" $E\colon \mathbb{R}^n\to \mathbb{R}^\ell$ and "decoder" $D\colon \mathbb{R}^\ell\to \mathbb{R}^n$ such that the "round trip" map $D\circ E$ is as close as possible to the identity map $\mbox{id}_M$ on $M$. We present various topological limita…
▽ More
Given a "data manifold" $M\subset \mathbb{R}^n$ and "latent space" $\mathbb{R}^\ell$, an autoencoder is a pair of continuous maps consisting of an "encoder" $E\colon \mathbb{R}^n\to \mathbb{R}^\ell$ and "decoder" $D\colon \mathbb{R}^\ell\to \mathbb{R}^n$ such that the "round trip" map $D\circ E$ is as close as possible to the identity map $\mbox{id}_M$ on $M$. We present various topological limitations and capabilites inherent to the search for an autoencoder, and describe capabilities for autoencoding dynamical systems having $M$ as an invariant manifold.
△ Less
Submitted 10 November, 2025; v1 submitted 6 November, 2025;
originally announced November 2025.
-
Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Authors:
Shannon Zejiang Shen,
Valerie Chen,
Ken Gu,
Alexis Ross,
Zixian Ma,
Jillian Ross,
Alex Gu,
Chenglei Si,
Wayne Chi,
Andi Peng,
Jocelyn J Shen,
Ameet Talwalkar,
Tongshuang Wu,
David Sontag
Abstract:
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs…
▽ More
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
△ Less
Submitted 30 October, 2025; v1 submitted 29 October, 2025;
originally announced October 2025.
-
Learning Genetic Circuit Modules with Neural Networks: Full Version
Authors:
Jichi Wang,
Eduardo D. Sontag,
Domitilla Del Vecchio
Abstract:
In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules' input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system's input/output mapping. Learning the modules' input/output functions is…
▽ More
In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules' input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system's input/output mapping. Learning the modules' input/output functions is also necessary for designing new systems from different composition architectures. Here, we propose a modular learning framework, which incorporates prior knowledge of the system's compositional structure to (a) identify the composing modules' input/output functions from the system's input/output data and (b) achieve this by using a reduced amount of data compared to what would be required without knowledge of the compositional structure. To achieve this, we introduce the notion of modular identifiability, which allows recovery of modules' input/output functions from a subset of the system's input/output data, and provide theoretical guarantees on a class of systems motivated by genetic circuits. We demonstrate the theory on computational studies showing that a neural network (NNET) that accounts for the compositional structure can learn the composing modules' input/output functions and predict the system's output on inputs outside of the training set distribution. By contrast, a neural network that is agnostic of the structure is unable to predict on inputs that fall outside of the training set distribution. By reducing the need for experimental data and allowing module identification, this framework offers the potential to ease the design of synthetic biological circuits and of multi-module systems more generally.
△ Less
Submitted 29 March, 2026; v1 submitted 23 September, 2025;
originally announced September 2025.
-
Some remarks on gradient dominance and LQR policy optimization
Authors:
Eduardo D. Sontag
Abstract:
Solutions of optimization problems, including policy optimization in reinforcement learning, typically rely upon some variant of gradient descent. There has been much recent work in the machine learning, control, and optimization communities applying the Polyak-Łojasiewicz Inequality (PLI) to such problems in order to establish an exponential rate of convergence (a.k.a. ``linear convergence'' in t…
▽ More
Solutions of optimization problems, including policy optimization in reinforcement learning, typically rely upon some variant of gradient descent. There has been much recent work in the machine learning, control, and optimization communities applying the Polyak-Łojasiewicz Inequality (PLI) to such problems in order to establish an exponential rate of convergence (a.k.a. ``linear convergence'' in the local-iteration language of numerical analysis) of loss functions to their minima under the gradient flow. Often, as is the case of policy iteration for the continuous-time LQR problem, this rate vanishes for large initial conditions, resulting in a mixed globally linear / locally exponential behavior. This is in sharp contrast with the discrete-time LQR problem, where there is global exponential convergence. That gap between CT and DT behaviors motivates the search for various generalized PLI-like conditions, and this talk will address that topic. Moreover, these generalizations are key to understanding the transient and asymptotic effects of errors in the estimation of the gradient, errors which might arise from adversarial attacks, wrong evaluation by an oracle, early stopping of a simulation, inaccurate and very approximate digital twins, stochastic computations (algorithm ``reproducibility''), or learning by sampling from limited data. We describe an ``input to state stability'' (ISS) analysis of this issue. The second part discusses convergence and PLI-like properties of ``linear feedforward neural networks'' in feedback control. Much of the work described here was done in collaboration with Arthur Castello B. de Oliveira, Leilei Cui, Zhong-Ping Jiang, and Milad Siami.
△ Less
Submitted 15 July, 2025; v1 submitted 14 July, 2025;
originally announced July 2025.
-
Diagnosing our datasets: How does my language model learn clinical information?
Authors:
Furong Jia,
David Sontag,
Monica Agrawal
Abstract:
Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational abilit…
▽ More
Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions across major pretraining corpora correlates with model performance. However, jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Similarly, we find that a non-negligible portion of documents support disputed claims that can then be parroted by models. Finally, we classified and analyzed the types of online sources in which clinical jargon and unsupported medical claims appear, with implications for future dataset composition.
△ Less
Submitted 22 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
Remarks on the Polyak-Lojasiewicz inequality and the convergence of gradient systems
Authors:
Arthur Castello B. de Oliveira,
Leilei Cui,
Eduardo D. Sontag
Abstract:
This work explores generalizations of the Polyak-Lojasiewicz inequality (PLI) and their implications for the convergence behavior of gradient flows in optimization problems. Motivated by the continuous-time linear quadratic regulator (CT-LQR) policy optimization problem -- where only a weaker version of the PLI is characterized in the literature -- this work shows that while weaker conditions are…
▽ More
This work explores generalizations of the Polyak-Lojasiewicz inequality (PLI) and their implications for the convergence behavior of gradient flows in optimization problems. Motivated by the continuous-time linear quadratic regulator (CT-LQR) policy optimization problem -- where only a weaker version of the PLI is characterized in the literature -- this work shows that while weaker conditions are sufficient for global convergence to, and optimality of the set of critical points of the cost function, the "profile" of the gradient flow solution can change significantly depending on which "flavor" of inequality the cost satisfies. After a general theoretical analysis, we focus on fitting the CT-LQR policy optimization problem to the proposed framework, showing that, in fact, it can never satisfy a PLI in its strongest form. We follow up our analysis with a brief discussion on the difference between continuous- and discrete-time LQR policy optimization, and end the paper with some intuition on the extension of this framework to optimization problems with L1 regularization and solved through proximal gradient flows.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
CodingGenie: A Proactive LLM-Powered Programming Assistant
Authors:
Sebastian Zhao,
Alan Zhu,
Hussein Mozannar,
David Sontag,
Ameet Talwalkar,
Valerie Chen
Abstract:
While developers increasingly adopt tools powered by large language models (LLMs) in day-to-day workflows, these tools still require explicit user invocation. To seamlessly integrate LLM capabilities to a developer's workflow, we introduce CodingGenie, a proactive assistant integrated into the code editor. CodingGenie autonomously provides suggestions, ranging from bug fixing to unit testing, base…
▽ More
While developers increasingly adopt tools powered by large language models (LLMs) in day-to-day workflows, these tools still require explicit user invocation. To seamlessly integrate LLM capabilities to a developer's workflow, we introduce CodingGenie, a proactive assistant integrated into the code editor. CodingGenie autonomously provides suggestions, ranging from bug fixing to unit testing, based on the current code context and allows users to customize suggestions by providing a task description and selecting what suggestions are shown. We demonstrate multiple use cases to show how proactive suggestions from CodingGenie can improve developer experience, and also analyze the cost of adding proactivity. We believe this open-source tool will enable further research into proactive assistants. CodingGenie is open-sourced at https://github.com/sebzhao/CodingGenie/ and video demos are available at https://sebzhao.github.io/CodingGenie/.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Large Language Models are Powerful Electronic Health Record Encoders
Authors:
Stefan Hegselmann,
Georg von Arnim,
Tillmann Rheude,
Noel Kronenberg,
David Sontag,
Gerhard Hindricks,
Roland Eils,
Benjamin Wild
Abstract:
Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity present significant challenges for traditional machine learning methods. Recently, domain-specific EHR foundation models trained on large volumes of unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by l…
▽ More
Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity present significant challenges for traditional machine learning methods. Recently, domain-specific EHR foundation models trained on large volumes of unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited access to diverse, high-quality datasets, and inconsistencies in coding standards and clinical practices. In this study, we explore the use of general-purpose Large Language Models (LLMs) to encode EHR into high-dimensional representations for downstream clinical prediction tasks. We convert structured EHR data into Markdown-formatted plain-text documents by replacing medical codes with natural language descriptions. This enables the use of LLMs and their extensive semantic understanding and generalization capabilities as effective encoders of EHRs without requiring access to private medical training data. We show that LLM-based embeddings can often match or even surpass the performance of a specialized EHR foundation model, CLMBR-T-Base, across 15 diverse clinical tasks from the EHRSHOT benchmark. Critically, our approach requires no institution-specific training and can incorporate any medical code with a text description, whereas existing EHR foundation models operate on fixed vocabularies and can only process codes seen during pretraining. To demonstrate generalizability, we further evaluate the approach on the UK Biobank (UKB) cohort, out-of-domain for CLMBR-T-Base, whose fixed vocabulary covers only 16% of UKB codes. Notably, an LLM-based model achieves superior performance for prediction of disease onset, hospitalization, and mortality, indicating robustness to population and coding shifts.
△ Less
Submitted 19 October, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
Need Help? Designing Proactive AI Assistants for Programming
Authors:
Valerie Chen,
Alan Zhu,
Sebastian Zhao,
Hussein Mozannar,
David Sontag,
Ameet Talwalkar
Abstract:
While current chat-based AI assistants primarily operate reactively, responding only when prompted by users, there is significant potential for these systems to proactively assist in tasks without explicit invocation, enabling a mixed-initiative interaction. This work explores the design and implementation of proactive AI assistants powered by large language models. We first outline the key design…
▽ More
While current chat-based AI assistants primarily operate reactively, responding only when prompted by users, there is significant potential for these systems to proactively assist in tasks without explicit invocation, enabling a mixed-initiative interaction. This work explores the design and implementation of proactive AI assistants powered by large language models. We first outline the key design considerations for building effective proactive assistants. As a case study, we propose a proactive chat-based programming assistant that automatically provides suggestions and facilitates their integration into the programmer's code. The programming context provides a shared workspace enabling the assistant to offer more relevant suggestions. We conducted a randomized experimental study examining the impact of various design elements of the proactive assistant on programmer productivity and user experience. Our findings reveal significant benefits of incorporating proactive chat assistants into coding environments and uncover important nuances that influence their usage and effectiveness.
△ Less
Submitted 28 February, 2025; v1 submitted 6 October, 2024;
originally announced October 2024.
-
Exact Recovery Guarantees for Parameterized Nonlinear System Identification Problem under Sparse Disturbances or Semi-Oblivious Attacks
Authors:
Haixiang Zhang,
Baturalp Yalcin,
Javad Lavaei,
Eduardo D. Sontag
Abstract:
In this work, we study the problem of learning a nonlinear dynamical system by parameterizing its dynamics using basis functions. We assume that disturbances occur at each time step with an arbitrary probability $p$, which models the sparsity level of the disturbance vectors over time. These disturbances are drawn from an arbitrary, unknown probability distribution, which may depend on past distur…
▽ More
In this work, we study the problem of learning a nonlinear dynamical system by parameterizing its dynamics using basis functions. We assume that disturbances occur at each time step with an arbitrary probability $p$, which models the sparsity level of the disturbance vectors over time. These disturbances are drawn from an arbitrary, unknown probability distribution, which may depend on past disturbances, provided that it satisfies a zero-mean assumption. The primary objective of this paper is to learn the system's dynamics within a finite time and analyze the sample complexity as a function of $p$. To achieve this, we examine a LASSO-type non-smooth estimator, and establish necessary and sufficient conditions for its well-specifiedness and the uniqueness of the global solution to the underlying optimization problem. We then provide exact recovery guarantees for the estimator under two distinct conditions: boundedness and Lipschitz continuity of the basis functions. We show that finite-time exact recovery is achieved with high probability, even when $p$ approaches 1. Unlike prior works, which primarily focus on independent and identically distributed (i.i.d.) disturbances and provide only asymptotic guarantees for system learning, this study presents the first finite-time analysis of nonlinear dynamical systems under a highly general disturbance model. Our framework allows for possible temporal correlations in the disturbances and accommodates semi-oblivious adversarial attacks, significantly broadening the scope of existing theoretical results.
△ Less
Submitted 20 March, 2025; v1 submitted 30 August, 2024;
originally announced September 2024.
-
Seq-to-Final: A Benchmark for Tuning from Sequential Distributions to a Final Time Point
Authors:
Christina X Ji,
Ahmed M Alaa,
David Sontag
Abstract:
Distribution shift over time occurs in many settings. Leveraging historical data is necessary to learn a model for the last time point when limited data is available in the final period, yet few methods have been developed specifically for this purpose. In this work, we construct a benchmark with different sequences of synthetic shifts to evaluate the effectiveness of 3 classes of methods that 1)…
▽ More
Distribution shift over time occurs in many settings. Leveraging historical data is necessary to learn a model for the last time point when limited data is available in the final period, yet few methods have been developed specifically for this purpose. In this work, we construct a benchmark with different sequences of synthetic shifts to evaluate the effectiveness of 3 classes of methods that 1) learn from all data without adapting to the final period, 2) learn from historical data with no regard to the sequential nature and then adapt to the final period, and 3) leverage the sequential nature of historical data when tailoring a model to the final period. We call this benchmark Seq-to-Final to highlight the focus on using a sequence of time periods to learn a model for the final time point. Our synthetic benchmark allows users to construct sequences with different types of shift and compare different methods. We focus on image classification tasks using CIFAR-10 and CIFAR-100 as the base images for the synthetic sequences. We also evaluate the same methods on the Portraits dataset to explore the relevance to real-world shifts over time. Finally, we create a visualization to contrast the initializations and updates from different methods at the final time step. Our results suggest that, for the sequences in our benchmark, methods that disregard the sequential structure and adapt to the final time point tend to perform well. The approaches we evaluate that leverage the sequential nature do not offer any improvement. We hope that this benchmark will inspire the development of new algorithms that are better at leveraging sequential historical data or a deeper understanding of why methods that disregard the sequential nature are able to perform well.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Prediction-powered Generalization of Causal Inferences
Authors:
Ilker Demirel,
Ahmed Alaa,
Anthony Philippakis,
David Sontag
Abstract:
Causal inferences from a randomized controlled trial (RCT) may not pertain to a target population where some effect modifiers have a different distribution. Prior work studies generalizing the results of a trial to a target population with no outcome but covariate data available. We show how the limited size of trials makes generalization a statistically infeasible task, as it requires estimating…
▽ More
Causal inferences from a randomized controlled trial (RCT) may not pertain to a target population where some effect modifiers have a different distribution. Prior work studies generalizing the results of a trial to a target population with no outcome but covariate data available. We show how the limited size of trials makes generalization a statistically infeasible task, as it requires estimating complex nuisance functions. We develop generalization algorithms that supplement the trial data with a prediction model learned from an additional observational study (OS), without making any assumptions on the OS. We theoretically and empirically show that our methods facilitate better generalization when the OS is high-quality, and remain robust when it is not, and e.g., have unmeasured confounding.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Theoretical Analysis of Weak-to-Strong Generalization
Authors:
Hunter Lang,
David Sontag,
Aravindan Vijayaraghavan
Abstract:
Strong student models can learn from weaker teachers: when trained on the predictions of a weaker model, a strong pretrained student can learn to correct the weak model's errors and generalize to examples where the teacher is not confident, even when these examples are excluded from training. This enables learning from cheap, incomplete, and possibly incorrect label information, such as coarse log…
▽ More
Strong student models can learn from weaker teachers: when trained on the predictions of a weaker model, a strong pretrained student can learn to correct the weak model's errors and generalize to examples where the teacher is not confident, even when these examples are excluded from training. This enables learning from cheap, incomplete, and possibly incorrect label information, such as coarse logical rules or the generations of a language model. We show that existing weak supervision theory fails to account for both of these effects, which we call pseudolabel correction and coverage expansion, respectively. We give a new bound based on expansion properties of the data distribution and student hypothesis class that directly accounts for pseudolabel correction and coverage expansion. Our bounds capture the intuition that weak-to-strong generalization occurs when the strong model is unable to fit the mistakes of the weak teacher without incurring additional error. We show that these expansion properties can be checked from finite data and give empirical evidence that they hold in practice.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision Oncology
Authors:
Zeshan Hussain,
Barbara D. Lam,
Fernando A. Acosta-Perez,
Irbaz Bin Riaz,
Maia Jacobs,
Andrew J. Yee,
David Sontag
Abstract:
We evaluated how clinicians approach clinical decision-making when given findings from both randomized controlled trials (RCTs) and machine learning (ML) models. To do so, we designed a clinical decision support system (CDSS) that displays survival curves and adverse event information from a synthetic RCT and ML model for 12 patients with multiple myeloma. We conducted an interventional study in a…
▽ More
We evaluated how clinicians approach clinical decision-making when given findings from both randomized controlled trials (RCTs) and machine learning (ML) models. To do so, we designed a clinical decision support system (CDSS) that displays survival curves and adverse event information from a synthetic RCT and ML model for 12 patients with multiple myeloma. We conducted an interventional study in a simulated setting to evaluate how clinicians synthesized the available data to make treatment decisions. Participants were invited to participate in a follow-up interview to discuss their choices in an open-ended format. When ML model results were concordant with RCT results, physicians had increased confidence in treatment choice compared to when they were given RCT results alone. When ML model results were discordant with RCT results, the majority of physicians followed the ML model recommendation in their treatment selection. Perceived reliability of the ML model was consistently higher after physicians were provided with data on how it was trained and validated. Follow-up interviews revealed four major themes: (1) variability in what variables participants used for decision-making, (2) perceived advantages to an ML model over RCT data, (3) uncertainty around decision-making when the ML model quality was poor, and (4) perception that this type of study is an important thought exercise for clinicians. Overall, ML-based CDSSs have the potential to change treatment decisions in cancer management. However, meticulous development and validation of these systems as well as clinician training are required before deployment.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Authors:
Hussein Mozannar,
Valerie Chen,
Mohammed Alsobay,
Subhro Das,
Sebastian Zhao,
Dennis Wei,
Manish Nagireddy,
Prasanna Sattigeri,
Ameet Talwalkar,
David Sontag
Abstract:
Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spe…
▽ More
Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional -- a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better proxy signals. We open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.
△ Less
Submitted 14 October, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Learning to Decode Collaboratively with Multiple Language Models
Authors:
Shannon Zejiang Shen,
Hunter Lang,
Bailin Wang,
Yoon Kim,
David Sontag
Abstract:
We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the marginal likelihood of a training set under our latent variable model, the base LLM automatically learns when to generate itself and when to call on one of the ``ass…
▽ More
We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the marginal likelihood of a training set under our latent variable model, the base LLM automatically learns when to generate itself and when to call on one of the ``assistant'' language models to generate, all without direct supervision. Token-level collaboration during decoding allows for a fusion of each model's expertise in a manner tailored to the specific task at hand. Our collaborative decoding is especially useful in cross-domain settings where a generalist base LLM learns to invoke domain expert models. On instruction-following, domain-specific QA, and reasoning tasks, we show that the performance of the joint system exceeds that of the individual models. Through qualitative analysis of the learned latent decisions, we show models trained with our method exhibit several interesting collaboration patterns, e.g., template-filling. Our code is available at https://github.com/clinicalml/co-llm.
△ Less
Submitted 27 August, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
Med-Real2Sim: Non-Invasive Medical Digital Twins using Physics-Informed Self-Supervised Learning
Authors:
Keying Kuang,
Frances Dean,
Jack B. Jedlicki,
David Ouyang,
Anthony Philippakis,
David Sontag,
Ahmed M. Alaa
Abstract:
A digital twin is a virtual replica of a real-world physical phenomena that uses mathematical modeling to characterize and simulate its defining features. By constructing digital twins for disease processes, we can perform in-silico simulations that mimic patients' health conditions and counterfactual outcomes under hypothetical interventions in a virtual setting. This eliminates the need for inva…
▽ More
A digital twin is a virtual replica of a real-world physical phenomena that uses mathematical modeling to characterize and simulate its defining features. By constructing digital twins for disease processes, we can perform in-silico simulations that mimic patients' health conditions and counterfactual outcomes under hypothetical interventions in a virtual setting. This eliminates the need for invasive procedures or uncertain treatment decisions. In this paper, we propose a method to identify digital twin model parameters using only noninvasive patient health data. We approach the digital twin modeling as a composite inverse problem, and observe that its structure resembles pretraining and finetuning in self-supervised learning (SSL). Leveraging this, we introduce a physics-informed SSL algorithm that initially pretrains a neural network on the pretext task of learning a differentiable simulator of a physiological process. Subsequently, the model is trained to reconstruct physiological measurements from noninvasive modalities while being constrained by the physical equations learned in pretraining. We apply our method to identify digital twins of cardiac hemodynamics using noninvasive echocardiogram videos, and demonstrate its utility in unsupervised disease detection and in-silico clinical trials.
△ Less
Submitted 31 October, 2024; v1 submitted 29 February, 2024;
originally announced March 2024.
-
A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
Authors:
Stefan Hegselmann,
Shannon Zejiang Shen,
Florian Gierse,
Monica Agrawal,
David Sontag,
Xiaoyi Jiang
Abstract:
Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rig…
▽ More
Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.
△ Less
Submitted 25 June, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study
Authors:
Niklas Mannhardt,
Elizabeth Bondi-Kelly,
Barbara Lam,
Hussein Mozannar,
Chloe O'Connell,
Mercy Asiedu,
Alejandro Buendia,
Tatiana Urman,
Irbaz B. Riaz,
Catherine E. Ricciardi,
Monica Agrawal,
Marzyeh Ghassemi,
David Sontag
Abstract:
Large language models (LLMs) have immense potential to make information more accessible, particularly in medicine, where complex medical jargon can hinder patient comprehension of clinical notes. We developed a patient-facing tool using LLMs to make clinical notes more readable by simplifying, extracting information from, and adding context to the notes. We piloted the tool with clinical notes don…
▽ More
Large language models (LLMs) have immense potential to make information more accessible, particularly in medicine, where complex medical jargon can hinder patient comprehension of clinical notes. We developed a patient-facing tool using LLMs to make clinical notes more readable by simplifying, extracting information from, and adding context to the notes. We piloted the tool with clinical notes donated by patients with a history of breast cancer and synthetic notes from a clinician. Participants (N=200, healthy, female-identifying patients) were randomly assigned three clinical notes in our tool with varying levels of augmentations and answered quantitative and qualitative questions evaluating their understanding of follow-up actions. Augmentations significantly increased their quantitative understanding scores. In-depth interviews were conducted with participants (N=7, patients with a history of breast cancer), revealing both positive sentiments about the augmentations and concerns about AI. We also performed a qualitative clinician-driven analysis of the model's error modes.
△ Less
Submitted 14 October, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
Towards Verifiable Text Generation with Symbolic References
Authors:
Lucas Torroba Hennigen,
Shannon Shen,
Aniruddha Nrusimha,
Bernhard Gapp,
David Sontag,
Yoon Kim
Abstract:
LLMs are vulnerable to hallucinations, and thus their outputs generally require laborious human verification for high-stakes applications. To this end, we propose symbolically grounded generation (SymGen) as a simple approach for enabling easier manual validation of an LLM's output. SymGen prompts an LLM to interleave its regular output text with explicit symbolic references to fields present in s…
▽ More
LLMs are vulnerable to hallucinations, and thus their outputs generally require laborious human verification for high-stakes applications. To this end, we propose symbolically grounded generation (SymGen) as a simple approach for enabling easier manual validation of an LLM's output. SymGen prompts an LLM to interleave its regular output text with explicit symbolic references to fields present in some conditioning data (e.g., a table in JSON format). The references can be used to display the provenance of different spans of text in the generation, reducing the effort required for manual verification. Across a range of data-to-text and question-answering experiments, we find that LLMs are able to directly output text that makes use of accurate symbolic references while maintaining fluency and factuality. In a human study we further find that such annotations can streamline human verification of machine-generated text. Our code will be available at http://symgen.github.io.
△ Less
Submitted 15 April, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Effective Human-AI Teams via Learned Natural Language Rules and Onboarding
Authors:
Hussein Mozannar,
Jimin J Lee,
Dennis Wei,
Prasanna Sattigeri,
Subhro Das,
David Sontag
Abstract:
People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules, grounded in data regions and described in natural language, that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data…
▽ More
People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules, grounded in data regions and described in natural language, that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space where prior human behavior should be corrected. Each region is then described using a large language model in an iterative and contrastive procedure. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
△ Less
Submitted 7 November, 2023; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Why should autoencoders work?
Authors:
Matthew D. Kvalheim,
Eduardo D. Sontag
Abstract:
Deep neural network autoencoders are routinely used computationally for model reduction. They allow recognizing the intrinsic dimension of data that lie in a $k$-dimensional subset $K$ of an input Euclidean space $\mathbb{R}^n$. The underlying idea is to obtain both an encoding layer that maps $\mathbb{R}^n$ into $\mathbb{R}^k$ (called the bottleneck layer or the space of latent variables) and a d…
▽ More
Deep neural network autoencoders are routinely used computationally for model reduction. They allow recognizing the intrinsic dimension of data that lie in a $k$-dimensional subset $K$ of an input Euclidean space $\mathbb{R}^n$. The underlying idea is to obtain both an encoding layer that maps $\mathbb{R}^n$ into $\mathbb{R}^k$ (called the bottleneck layer or the space of latent variables) and a decoding layer that maps $\mathbb{R}^k$ back into $\mathbb{R}^n$, in such a way that the input data from the set $K$ is recovered when composing the two maps. This is achieved by adjusting parameters (weights) in the network to minimize the discrepancy between the input and the reconstructed output. Since neural networks (with continuous activation functions) compute continuous maps, the existence of a network that achieves perfect reconstruction would imply that $K$ is homeomorphic to a $k$-dimensional subset of $\mathbb{R}^k$, so clearly there are topological obstructions to finding such a network. On the other hand, in practice the technique is found to "work" well, which leads one to ask if there is a way to explain this effectiveness. We show that, up to small errors, indeed the method is guaranteed to work. This is done by appealing to certain facts from differential topology. A computational example is also included to illustrate the ideas.
△ Less
Submitted 17 February, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Conceptualizing Machine Learning for Dynamic Information Retrieval of Electronic Health Record Notes
Authors:
Sharon Jiang,
Shannon Shen,
Monica Agrawal,
Barbara Lam,
Nicholas Kurtzman,
Steven Horng,
David Karger,
David Sontag
Abstract:
The large amount of time clinicians spend sifting through patient notes and documenting in electronic health records (EHRs) is a leading cause of clinician burnout. By proactively and dynamically retrieving relevant notes during the documentation process, we can reduce the effort required to find relevant patient history. In this work, we conceptualize the use of EHR audit logs for machine learnin…
▽ More
The large amount of time clinicians spend sifting through patient notes and documenting in electronic health records (EHRs) is a leading cause of clinician burnout. By proactively and dynamically retrieving relevant notes during the documentation process, we can reduce the effort required to find relevant patient history. In this work, we conceptualize the use of EHR audit logs for machine learning as a source of supervision of note relevance in a specific clinical context, at a particular point in time. Our evaluation focuses on the dynamic retrieval in the emergency department, a high acuity setting with unique patterns of information retrieval and note writing. We show that our methods can achieve an AUC of 0.963 for predicting which notes will be read in an individual note writing session. We additionally conduct a user study with several clinicians and find that our framework can help clinicians retrieve relevant information more efficiently. Demonstrating that our framework and methods can perform well in this demanding setting is a promising proof of concept that they will translate to other clinical settings and data modalities (e.g., labs, medications, imaging).
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
Closing the Gap in High-Risk Pregnancy Care Using Machine Learning and Human-AI Collaboration
Authors:
Hussein Mozannar,
Yuria Utsumi,
Irene Y. Chen,
Stephanie S. Gervasi,
Michele Ewing,
Aaron Smith-McLallen,
David Sontag
Abstract:
A high-risk pregnancy is a pregnancy complicated by factors that can adversely affect the outcomes of the mother or the infant. Health insurers use algorithms to identify members who would benefit from additional clinical support. This work presents the implementation of a real-world ML-based system to assist care managers in identifying pregnant patients at risk of complications. In this retrospe…
▽ More
A high-risk pregnancy is a pregnancy complicated by factors that can adversely affect the outcomes of the mother or the infant. Health insurers use algorithms to identify members who would benefit from additional clinical support. This work presents the implementation of a real-world ML-based system to assist care managers in identifying pregnant patients at risk of complications. In this retrospective evaluation study, we developed a novel hybrid-ML classifier to predict whether patients are pregnant and trained a standard classifier using claims data from a health insurance company in the US to predict whether a patient will develop pregnancy complications. These models were developed in cooperation with the care management team and integrated into a user interface with explanations for the nurses. The proposed models outperformed commonly used claim codes for the identification of pregnant patients at the expense of a manageable false positive rate. Our risk complication classifier shows that we can accurately triage patients by risk of complication. Our approach and evaluation are guided by human-centric design. In user studies with the nurses, they preferred the proposed models over existing approaches.
△ Less
Submitted 22 April, 2024; v1 submitted 26 May, 2023;
originally announced May 2023.
-
On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural Networks with Linear Activations
Authors:
Arthur Castello B. de Oliveira,
Milad Siami,
Eduardo D. Sontag
Abstract:
Recent research in neural networks and machine learning suggests that using many more parameters than strictly required by the initial complexity of a regression problem can result in more accurate or faster-converging models -- contrary to classical statistical belief. This phenomenon, sometimes known as ``benign overfitting'', raises questions regarding in what other ways might overparameterizat…
▽ More
Recent research in neural networks and machine learning suggests that using many more parameters than strictly required by the initial complexity of a regression problem can result in more accurate or faster-converging models -- contrary to classical statistical belief. This phenomenon, sometimes known as ``benign overfitting'', raises questions regarding in what other ways might overparameterization affect the properties of a learning problem. In this work, we investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. This uncertainty arises naturally if the gradient is estimated from noisy data or directly measured. Our object of study is a linear neural network with a single, arbitrarily wide, hidden layer and an arbitrary number of inputs and outputs. In this paper we solve the problem for the case where the input and output of our neural-network are one-dimensional, deriving sufficient conditions for robustness of our system based on necessary and sufficient conditions for convergence in the undisturbed case. We then show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized, and discuss directions of future work that might extend our current results for more general formulations.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Large-Scale Study of Temporal Shift in Health Insurance Claims
Authors:
Christina X Ji,
Ahmed M Alaa,
David Sontag
Abstract:
Most machine learning models for predicting clinical outcomes are developed using historical data. Yet, even if these models are deployed in the near future, dataset shift over time may result in less than ideal performance. To capture this phenomenon, we consider a task--that is, an outcome to be predicted at a particular time point--to be non-stationary if a historical model is no longer optimal…
▽ More
Most machine learning models for predicting clinical outcomes are developed using historical data. Yet, even if these models are deployed in the near future, dataset shift over time may result in less than ideal performance. To capture this phenomenon, we consider a task--that is, an outcome to be predicted at a particular time point--to be non-stationary if a historical model is no longer optimal for predicting that outcome. We build an algorithm to test for temporal shift either at the population level or within a discovered sub-population. Then, we construct a meta-algorithm to perform a retrospective scan for temporal shift on a large collection of tasks. Our algorithms enable us to perform the first comprehensive evaluation of temporal shift in healthcare to our knowledge. We create 1,010 tasks by evaluating 242 healthcare outcomes for temporal shift from 2015 to 2020 on a health insurance claims dataset. 9.7% of the tasks show temporal shifts at the population level, and 93.0% have some sub-population affected by shifts. We dive into case studies to understand the clinical implications. Our analysis highlights the widespread prevalence of temporal shifts in healthcare.
△ Less
Submitted 18 June, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks
Authors:
Zejiang Shen,
Tal August,
Pao Siangliulue,
Kyle Lo,
Jonathan Bragg,
Jeff Hammerbacher,
Doug Downey,
Joseph Chee Chang,
David Sontag
Abstract:
Large language models have introduced exciting new opportunities and challenges in designing and developing new AI-assisted writing support tools. Recent work has shown that leveraging this new technology can transform writing in many scenarios such as ideation during creative writing, editing support, and summarization. However, AI-supported expository writing--including real-world tasks like sch…
▽ More
Large language models have introduced exciting new opportunities and challenges in designing and developing new AI-assisted writing support tools. Recent work has shown that leveraging this new technology can transform writing in many scenarios such as ideation during creative writing, editing support, and summarization. However, AI-supported expository writing--including real-world tasks like scholars writing literature reviews or doctors writing progress notes--is relatively understudied. In this position paper, we argue that developing AI supports for expository writing has unique and exciting research challenges and can lead to high real-world impacts. We characterize expository writing as evidence-based and knowledge-generating: it contains summaries of external documents as well as new information or knowledge. It can be seen as the product of authors' sensemaking process over a set of source documents, and the interplay between reading, reflection, and writing opens up new opportunities for designing AI support. We sketch three components for AI support design and discuss considerations for future research.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Conformalized Unconditional Quantile Regression
Authors:
Ahmed M. Alaa,
Zeshan Hussain,
David Sontag
Abstract:
We develop a predictive inference procedure that combines conformal prediction (CP) with unconditional quantile regression (QR) -- a commonly used tool in econometrics that involves regressing the recentered influence function (RIF) of the quantile functional over input covariates. Unlike the more widely-known conditional QR, unconditional QR explicitly captures the impact of changes in covariate…
▽ More
We develop a predictive inference procedure that combines conformal prediction (CP) with unconditional quantile regression (QR) -- a commonly used tool in econometrics that involves regressing the recentered influence function (RIF) of the quantile functional over input covariates. Unlike the more widely-known conditional QR, unconditional QR explicitly captures the impact of changes in covariate distribution on the quantiles of the marginal distribution of outcomes. Leveraging this property, our procedure issues adaptive predictive intervals with localized frequentist coverage guarantees. It operates by fitting a machine learning model for the RIFs using training data, and then applying the CP procedure for any test covariate with respect to a ``hypothetical'' covariate distribution localized around the new instance. Experiments show that our procedure is adaptive to heteroscedasticity, provides transparent coverage guarantees that are relevant to the test instance at hand, and performs competitively with existing methods in terms of efficiency.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions
Authors:
Zeshan Hussain,
Ming-Chieh Shih,
Michael Oberst,
Ilker Demirel,
David Sontag
Abstract:
Randomized Controlled Trials (RCT)s are relied upon to assess new treatments, but suffer from limited power to guide personalized treatment decisions. On the other hand, observational (i.e., non-experimental) studies have large and diverse populations, but are prone to various biases (e.g. residual confounding). To safely leverage the strengths of observational studies, we focus on the problem of…
▽ More
Randomized Controlled Trials (RCT)s are relied upon to assess new treatments, but suffer from limited power to guide personalized treatment decisions. On the other hand, observational (i.e., non-experimental) studies have large and diverse populations, but are prone to various biases (e.g. residual confounding). To safely leverage the strengths of observational studies, we focus on the problem of falsification, whereby RCTs are used to validate causal effect estimates learned from observational data. In particular, we show that, given data from both an RCT and an observational study, assumptions on internal and external validity have an observable, testable implication in the form of a set of Conditional Moment Restrictions (CMRs). Further, we show that expressing these CMRs with respect to the causal effect, or "causal contrast", as opposed to individual counterfactual means, provides a more reliable falsification test. In addition to giving guarantees on the asymptotic properties of our test, we demonstrate superior power and type I error of our approach on semi-synthetic and real world datasets. Our approach is interpretable, allowing a practitioner to visualize which subgroups in the population lead to falsification of an observational study.
△ Less
Submitted 6 March, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Who Should Predict? Exact Algorithms For Learning to Defer to Humans
Authors:
Hussein Mozannar,
Hunter Lang,
Dennis Wei,
Prasanna Sattigeri,
Subhro Das,
David Sontag
Abstract:
Automated AI classifiers should be able to defer the prediction to a human decision maker to ensure more accurate predictions. In this work, we jointly train a classifier with a rejector, which decides on each data point whether the classifier or the human should predict. We show that prior approaches can fail to find a human-AI system with low misclassification error even when there exists a line…
▽ More
Automated AI classifiers should be able to defer the prediction to a human decision maker to ensure more accurate predictions. In this work, we jointly train a classifier with a rejector, which decides on each data point whether the classifier or the human should predict. We show that prior approaches can fail to find a human-AI system with low misclassification error even when there exists a linear classifier and rejector that have zero error (the realizable setting). We prove that obtaining a linear pair with low error is NP-hard even when the problem is realizable. To complement this negative result, we give a mixed-integer-linear-programming (MILP) formulation that can optimally solve the problem in the linear setting. However, the MILP only scales to moderately-sized problems. Therefore, we provide a novel surrogate loss function that is realizable-consistent and performs well empirically. We test our approaches on a comprehensive set of datasets and compare to a wide range of baselines.
△ Less
Submitted 11 April, 2023; v1 submitted 15 January, 2023;
originally announced January 2023.
-
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Authors:
Stefan Hegselmann,
Alejandro Buendia,
Hunter Lang,
Monica Agrawal,
Xiaoyi Jiang,
David Sontag
Abstract:
We study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serializa…
▽ More
We study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serialization methods including templates, table-to-text models, and large language models. Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method's ability to exploit prior knowledge encoded in large language models. Unlike many deep learning methods for tabular datasets, this approach is also competitive with strong traditional baselines like gradient-boosted trees, especially in the very-few-shot setting.
△ Less
Submitted 17 March, 2023; v1 submitted 19 October, 2022;
originally announced October 2022.
-
Falsification before Extrapolation in Causal Effect Estimation
Authors:
Zeshan Hussain,
Michael Oberst,
Ming-Chieh Shih,
David Sontag
Abstract:
Randomized Controlled Trials (RCTs) represent a gold standard when developing policy guidelines. However, RCTs are often narrow, and lack data on broader populations of interest. Causal effects in these populations are often estimated using observational datasets, which may suffer from unobserved confounding and selection bias. Given a set of observational estimates (e.g. from multiple studies), w…
▽ More
Randomized Controlled Trials (RCTs) represent a gold standard when developing policy guidelines. However, RCTs are often narrow, and lack data on broader populations of interest. Causal effects in these populations are often estimated using observational datasets, which may suffer from unobserved confounding and selection bias. Given a set of observational estimates (e.g. from multiple studies), we propose a meta-algorithm that attempts to reject observational estimates that are biased. We do so using validation effects, causal effects that can be inferred from both RCT and observational data. After rejecting estimators that do not pass this test, we generate conservative confidence intervals on the extrapolated causal effects for subgroups not observed in the RCT. Under the assumption that at least one observational estimator is asymptotically normal and consistent for both the validation and extrapolated effects, we provide guarantees on the coverage probability of the intervals output by our algorithm. To facilitate hypothesis testing in settings where causal effect transportation across datasets is necessary, we give conditions under which a doubly-robust estimator of group average treatment effects is asymptotically normal, even when flexible machine learning methods are used for estimation of nuisance parameters. We illustrate the properties of our approach on semi-synthetic and real world datasets, and show that it compares favorably to standard meta-analysis techniques.
△ Less
Submitted 6 March, 2023; v1 submitted 27 September, 2022;
originally announced September 2022.
-
Sample Efficient Learning of Predictors that Complement Humans
Authors:
Mohammad-Amin Charusaie,
Hussein Mozannar,
David Sontag,
Samira Samadi
Abstract:
One of the goals of learning algorithms is to complement and reduce the burden on human decision makers. The expert deferral setting wherein an algorithm can either predict on its own or defer the decision to a downstream expert helps accomplish this goal. A fundamental aspect of this setting is the need to learn complementary predictors that improve on the human's weaknesses rather than learning…
▽ More
One of the goals of learning algorithms is to complement and reduce the burden on human decision makers. The expert deferral setting wherein an algorithm can either predict on its own or defer the decision to a downstream expert helps accomplish this goal. A fundamental aspect of this setting is the need to learn complementary predictors that improve on the human's weaknesses rather than learning predictors optimized for average error. In this work, we provide the first theoretical analysis of the benefit of learning complementary predictors in expert deferral. To enable efficiently learning such predictors, we consider a family of consistent surrogate loss functions for expert deferral and analyze their theoretical properties. Finally, we design active learning schemes that require minimal amount of data of human expert predictions in order to learn accurate deferral systems.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
Training Subset Selection for Weak Supervision
Authors:
Hunter Lang,
Aravindan Vijayaraghavan,
David Sontag
Abstract:
Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al…
▽ More
Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.
△ Less
Submitted 6 March, 2023; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Evaluating Robustness to Dataset Shift via Parametric Robustness Sets
Authors:
Nikolaj Thams,
Michael Oberst,
David Sontag
Abstract:
We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individ…
▽ More
We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
△ Less
Submitted 15 January, 2023; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Large Language Models are Few-Shot Clinical Information Extractors
Authors:
Monica Agrawal,
Stefan Hegselmann,
Hunter Lang,
Yoon Kim,
David Sontag
Abstract:
A long-running goal of the clinical NLP community is the extraction of important variables trapped in clinical notes. However, roadblocks have included dataset shift from the general domain and a lack of public clinical corpora and annotations. In this work, we show that large language models, such as InstructGPT, perform well at zero- and few-shot information extraction from clinical text despite…
▽ More
A long-running goal of the clinical NLP community is the extraction of important variables trapped in clinical notes. However, roadblocks have included dataset shift from the general domain and a lack of public clinical corpora and annotations. In this work, we show that large language models, such as InstructGPT, perform well at zero- and few-shot information extraction from clinical text despite not being trained specifically for the clinical domain. Whereas text classification and generation performance have already been studied extensively in such models, here we additionally demonstrate how to leverage them to tackle a diverse set of NLP tasks which require more structured outputs, including span identification, token-level sequence classification, and relation extraction. Further, due to the dearth of available data to evaluate these systems, we introduce new datasets for benchmarking few-shot clinical information extraction based on a manual re-annotation of the CASI dataset for new tasks. On the clinical extraction tasks we studied, the GPT-3 systems significantly outperform existing zero- and few-shot baselines.
△ Less
Submitted 30 November, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Co-training Improves Prompt-based Learning for Large Language Models
Authors:
Hunter Lang,
Monica Agrawal,
Yoon Kim,
David Sontag
Abstract:
We demonstrate that co-training (Blum & Mitchell, 1998) can improve the performance of prompt-based learning by using unlabeled data. While prompting has emerged as a promising paradigm for few-shot and zero-shot learning, it is often brittle and requires much larger models compared to the standard supervised setup. We find that co-training makes it possible to improve the original prompt model an…
▽ More
We demonstrate that co-training (Blum & Mitchell, 1998) can improve the performance of prompt-based learning by using unlabeled data. While prompting has emerged as a promising paradigm for few-shot and zero-shot learning, it is often brittle and requires much larger models compared to the standard supervised setup. We find that co-training makes it possible to improve the original prompt model and at the same time learn a smaller, downstream task-specific model. In the case where we only have partial access to a prompt model (e.g., output probabilities from GPT-3 (Brown et al., 2020)) we learn a calibration model over the prompt outputs. When we have full access to the prompt model's gradients but full finetuning remains prohibitively expensive (e.g., T0 (Sanh et al., 2021)), we learn a set of soft prompt continuous vectors to iteratively update the prompt model. We find that models trained in this manner can significantly improve performance on challenging datasets where there is currently a large gap between prompt-based learning and fully-supervised models.
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Teaching Humans When To Defer to a Classifier via Exemplars
Authors:
Hussein Mozannar,
Arvind Satyanarayan,
David Sontag
Abstract:
Expert decision makers are starting to rely on data-driven automated agents to assist them with various tasks. For this collaboration to perform properly, the human decision maker must have a mental model of when and when not to rely on the agent. In this work, we aim to ensure that human decision makers learn a valid mental model of the agent's strengths and weaknesses. To accomplish this goal, w…
▽ More
Expert decision makers are starting to rely on data-driven automated agents to assist them with various tasks. For this collaboration to perform properly, the human decision maker must have a mental model of when and when not to rely on the agent. In this work, we aim to ensure that human decision makers learn a valid mental model of the agent's strengths and weaknesses. To accomplish this goal, we propose an exemplar-based teaching strategy where humans solve the task with the help of the agent and try to formulate a set of guidelines of when and when not to defer. We present a novel parameterization of the human's mental model of the AI that applies a nearest neighbor rule in local regions surrounding the teaching examples. Using this model, we derive a near-optimal strategy for selecting a representative teaching set. We validate the benefits of our teaching strategy on a multi-hop question answering task using crowd workers and find that when workers draw the right lessons from the teaching stage, their task performance improves, we furthermore validate our method on a set of synthetic experiments.
△ Less
Submitted 13 December, 2021; v1 submitted 22 November, 2021;
originally announced November 2021.
-
Leveraging Time Irreversibility with Order-Contrastive Pre-training
Authors:
Monica Agrawal,
Hunter Lang,
Michael Offin,
Lior Gazit,
David Sontag
Abstract:
Label-scarce, high-dimensional domains such as healthcare present a challenge for modern machine learning techniques. To overcome the difficulties posed by a lack of labeled data, we explore an "order-contrastive" method for self-supervised pre-training on longitudinal data. We sample pairs of time segments, switch the order for half of them, and train a model to predict whether a given pair is in…
▽ More
Label-scarce, high-dimensional domains such as healthcare present a challenge for modern machine learning techniques. To overcome the difficulties posed by a lack of labeled data, we explore an "order-contrastive" method for self-supervised pre-training on longitudinal data. We sample pairs of time segments, switch the order for half of them, and train a model to predict whether a given pair is in the correct order. Intuitively, the ordering task allows the model to attend to the least time-reversible features (for example, features that indicate progression of a chronic disease). The same features are often useful for downstream tasks of interest. To quantify this, we study a simple theoretical setting where we prove a finite-sample guarantee for the downstream error of a representation learned with order-contrastive pre-training. Empirically, in synthetic and longitudinal healthcare settings, we demonstrate the effectiveness of order-contrastive pre-training in the small-data regime over supervised learning and other self-supervised pre-training baselines. Our results indicate that pre-training methods designed for particular classes of distributions and downstream tasks can improve the performance of self-supervised learning.
△ Less
Submitted 29 March, 2022; v1 submitted 3 November, 2021;
originally announced November 2021.
-
Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models
Authors:
Rickard K. A. Karlsson,
Martin Willbo,
Zeshan Hussain,
Rahul G. Krishnan,
David Sontag,
Fredrik D. Johansson
Abstract:
We study prediction of future outcomes with supervised models that use privileged information during learning. The privileged information comprises samples of time series observed between the baseline time of prediction and the future outcome; this information is only available at training time which differs from the traditional supervised learning. Our question is when using this privileged data…
▽ More
We study prediction of future outcomes with supervised models that use privileged information during learning. The privileged information comprises samples of time series observed between the baseline time of prediction and the future outcome; this information is only available at training time which differs from the traditional supervised learning. Our question is when using this privileged data leads to more sample-efficient learning of models that use only baseline data for predictions at test time. We give an algorithm for this setting and prove that when the time series are drawn from a non-stationary Gaussian-linear dynamical system of fixed horizon, learning with privileged information is more efficient than learning without it. On synthetic data, we test the limits of our algorithm and theory, both when our assumptions hold and when they are violated. On three diverse real-world datasets, we show that our approach is generally preferable to classical learning, particularly when data is scarce. Finally, we relate our estimator to a distillation approach both theoretically and empirically.
△ Less
Submitted 5 May, 2022; v1 submitted 28 October, 2021;
originally announced October 2021.
-
Finding Regions of Heterogeneity in Decision-Making via Expected Conditional Covariance
Authors:
Justin Lim,
Christina X Ji,
Michael Oberst,
Saul Blecker,
Leora Horwitz,
David Sontag
Abstract:
Individuals often make different decisions when faced with the same context, due to personal preferences and background. For instance, judges may vary in their leniency towards certain drug-related offenses, and doctors may vary in their preference for how to start treatment for certain types of patients. With these examples in mind, we present an algorithm for identifying types of contexts (e.g.,…
▽ More
Individuals often make different decisions when faced with the same context, due to personal preferences and background. For instance, judges may vary in their leniency towards certain drug-related offenses, and doctors may vary in their preference for how to start treatment for certain types of patients. With these examples in mind, we present an algorithm for identifying types of contexts (e.g., types of cases or patients) with high inter-decision-maker disagreement. We formalize this as a causal inference problem, seeking a region where the assignment of decision-maker has a large causal effect on the decision. Our algorithm finds such a region by maximizing an empirical objective, and we give a generalization bound for its performance. In a semi-synthetic experiment, we show that our algorithm recovers the correct region of heterogeneity accurately compared to baselines. Finally, we apply our algorithm to real-world healthcare datasets, recovering variation that aligns with existing clinical knowledge.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
MedKnowts: Unified Documentation and Information Retrieval for Electronic Health Records
Authors:
Luke Murray,
Divya Gopinath,
Monica Agrawal,
Steven Horng,
David Sontag,
David R. Karger
Abstract:
Clinical documentation can be transformed by Electronic Health Records, yet the documentation process is still a tedious, time-consuming, and error-prone process. Clinicians are faced with multi-faceted requirements and fragmented interfaces for information exploration and documentation. These challenges are only exacerbated in the Emergency Department -- clinicians often see 35 patients in one sh…
▽ More
Clinical documentation can be transformed by Electronic Health Records, yet the documentation process is still a tedious, time-consuming, and error-prone process. Clinicians are faced with multi-faceted requirements and fragmented interfaces for information exploration and documentation. These challenges are only exacerbated in the Emergency Department -- clinicians often see 35 patients in one shift, during which they have to synthesize an often previously unknown patient's medical records in order to reach a tailored diagnosis and treatment plan. To better support this information synthesis, clinical documentation tools must enable rapid contextual access to the patient's medical record. MedKnowts is an integrated note-taking editor and information retrieval system which unifies the documentation and search process and provides concise synthesized concept-oriented slices of the patient's medical record. MedKnowts automatically captures structured data while still allowing users the flexibility of natural language. MedKnowts leverages this structure to enable easier parsing of long notes, auto-populated text, and proactive information retrieval, easing the documentation burden.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
CLIP: A Dataset for Extracting Action Items for Physicians from Hospital Discharge Notes
Authors:
James Mullenbach,
Yada Pruksachatkun,
Sean Adler,
Jennifer Seale,
Jordan Swartz,
T. Greg McKelvey,
Hui Dai,
Yi Yang,
David Sontag
Abstract:
Continuity of care is crucial to ensuring positive health outcomes for patients discharged from an inpatient hospital setting, and improved information sharing can help. To share information, caregivers write discharge notes containing action items to share with patients and their future caregivers, but these action items are easily lost due to the lengthiness of the documents. In this work, we de…
▽ More
Continuity of care is crucial to ensuring positive health outcomes for patients discharged from an inpatient hospital setting, and improved information sharing can help. To share information, caregivers write discharge notes containing action items to share with patients and their future caregivers, but these action items are easily lost due to the lengthiness of the documents. In this work, we describe our creation of a dataset of clinical action items annotated over MIMIC-III, the largest publicly available dataset of real clinical notes. This dataset, which we call CLIP, is annotated by physicians and covers 718 documents representing 100K sentences. We describe the task of extracting the action items from these documents as multi-aspect extractive summarization, with each aspect representing a type of action to be taken. We evaluate several machine learning models on this task, and show that the best models exploit in-domain language model pre-training on 59K unannotated documents, and incorporate context from neighboring sentences. We also propose an approach to pre-training data selection that allows us to explore the trade-off between size and domain-specificity of pre-training datasets for this task.
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
Assessing the Impact of Automated Suggestions on Decision Making: Domain Experts Mediate Model Errors but Take Less Initiative
Authors:
Ariel Levy,
Monica Agrawal,
Arvind Satyanarayan,
David Sontag
Abstract:
Automated decision support can accelerate tedious tasks as users can focus their attention where it is needed most. However, a key concern is whether users overly trust or cede agency to automation. In this paper, we investigate the effects of introducing automation to annotating clinical texts--a multi-step, error-prone task of identifying clinical concepts (e.g., procedures) in medical notes, an…
▽ More
Automated decision support can accelerate tedious tasks as users can focus their attention where it is needed most. However, a key concern is whether users overly trust or cede agency to automation. In this paper, we investigate the effects of introducing automation to annotating clinical texts--a multi-step, error-prone task of identifying clinical concepts (e.g., procedures) in medical notes, and mapping them to labels in a large ontology. We consider two forms of decision aid: recommending which labels to map concepts to, and pre-populating annotation suggestions. Through laboratory studies, we find that 18 clinicians generally build intuition of when to rely on automation and when to exercise their own judgement. However, when presented with fully pre-populated suggestions, these expert users exhibit less agency: accepting improper mentions, and taking less initiative in creating additional annotations. Our findings inform how systems and algorithms should be designed to mitigate the observed issues.
△ Less
Submitted 29 March, 2021; v1 submitted 8 March, 2021;
originally announced March 2021.
-
Regularizing towards Causal Invariance: Linear Models with Proxies
Authors:
Michael Oberst,
Nikolaj Thams,
Jonas Peters,
David Sontag
Abstract:
We propose a method for learning linear models whose predictive performance is robust to causal interventions on unobserved variables, when noisy proxies of those variables are available. Our approach takes the form of a regularization term that trades off between in-distribution performance and robustness to interventions. Under the assumption of a linear structural causal model, we show that a s…
▽ More
We propose a method for learning linear models whose predictive performance is robust to causal interventions on unobserved variables, when noisy proxies of those variables are available. Our approach takes the form of a regularization term that trades off between in-distribution performance and robustness to interventions. Under the assumption of a linear structural causal model, we show that a single proxy can be used to create estimators that are prediction optimal under interventions of bounded strength. This strength depends on the magnitude of the measurement noise in the proxy, which is, in general, not identifiable. In the case of two proxy variables, we propose a modified estimator that is prediction optimal under interventions up to a known strength. We further show how to extend these estimators to scenarios where additional information about the "test time" intervention is available during training. We evaluate our theoretical findings in synthetic experiments and using real data of hourly pollution levels across several cities in China.
△ Less
Submitted 27 June, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Beyond Perturbation Stability: LP Recovery Guarantees for MAP Inference on Noisy Stable Instances
Authors:
Hunter Lang,
Aravind Reddy,
David Sontag,
Aravindan Vijayaraghavan
Abstract:
Several works have shown that perturbation stable instances of the MAP inference problem in Potts models can be solved exactly using a natural linear programming (LP) relaxation. However, most of these works give few (or no) guarantees for the LP solutions on instances that do not satisfy the relatively strict perturbation stability definitions. In this work, we go beyond these stability results b…
▽ More
Several works have shown that perturbation stable instances of the MAP inference problem in Potts models can be solved exactly using a natural linear programming (LP) relaxation. However, most of these works give few (or no) guarantees for the LP solutions on instances that do not satisfy the relatively strict perturbation stability definitions. In this work, we go beyond these stability results by showing that the LP approximately recovers the MAP solution of a stable instance even after the instance is corrupted by noise. This "noisy stable" model realistically fits with practical MAP inference problems: we design an algorithm for finding "close" stable instances, and show that several real-world instances from computer vision have nearby instances that are perturbation stable. These results suggest a new theoretical explanation for the excellent performance of this LP relaxation in practice.
△ Less
Submitted 26 February, 2021;
originally announced March 2021.
-
Neural Pharmacodynamic State Space Modeling
Authors:
Zeshan Hussain,
Rahul G. Krishnan,
David Sontag
Abstract:
Modeling the time-series of high-dimensional, longitudinal data is important for predicting patient disease progression. However, existing neural network based approaches that learn representations of patient state, while very flexible, are susceptible to overfitting. We propose a deep generative model that makes use of a novel attention-based neural architecture inspired by the physics of how tre…
▽ More
Modeling the time-series of high-dimensional, longitudinal data is important for predicting patient disease progression. However, existing neural network based approaches that learn representations of patient state, while very flexible, are susceptible to overfitting. We propose a deep generative model that makes use of a novel attention-based neural architecture inspired by the physics of how treatments affect disease state. The result is a scalable and accurate model of high-dimensional patient biomarkers as they vary over time. Our proposed model yields significant improvements in generalization and, on real-world clinical data, provides interpretable insights into the dynamics of cancer progression.
△ Less
Submitted 17 June, 2021; v1 submitted 22 February, 2021;
originally announced February 2021.
-
Clustering Interval-Censored Time-Series for Disease Phenotyping
Authors:
Irene Y. Chen,
Rahul G. Krishnan,
David Sontag
Abstract:
Unsupervised learning is often used to uncover clusters in data. However, different kinds of noise may impede the discovery of useful patterns from real-world time-series data. In this work, we focus on mitigating the interference of interval censoring in the task of clustering for disease phenotyping. We develop a deep generative, continuous-time model of time-series data that clusters time-serie…
▽ More
Unsupervised learning is often used to uncover clusters in data. However, different kinds of noise may impede the discovery of useful patterns from real-world time-series data. In this work, we focus on mitigating the interference of interval censoring in the task of clustering for disease phenotyping. We develop a deep generative, continuous-time model of time-series data that clusters time-series while correcting for censorship time. We provide conditions under which clusters and the amount of delayed entry may be identified from data under a noiseless model. On synthetic data, we demonstrate accurate, stable, and interpretable results that outperform several benchmarks. On real-world clinical datasets of heart failure and Parkinson's disease patients, we study how interval censoring can adversely affect the task of disease phenotyping. Our model corrects for this source of error and recovers known clinical subtypes.
△ Less
Submitted 5 December, 2021; v1 submitted 13 February, 2021;
originally announced February 2021.