Skip to main content

Showing 1–50 of 97 results for author: Talwalkar, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2603.24586  [pdf, ps, other

    cs.SE cs.CL

    Comparing Developer and LLM Biases in Code Evaluation

    Authors: Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donahue, Ameet Talwalkar, Wayne Chi, Valerie Chen

    Abstract: As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and mo… ▽ More

    Submitted 25 March, 2026; originally announced March 2026.

  2. arXiv:2603.01223  [pdf, ps, other

    cs.LG cs.CL

    Learn Hard Problems During RL with Reference Guided Fine-tuning

    Authors: Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai

    Abstract: Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefi… ▽ More

    Submitted 5 March, 2026; v1 submitted 1 March, 2026; originally announced March 2026.

    Comments: 15 pages, 5 figures

  3. arXiv:2602.11103  [pdf, ps, other

    cs.AI cs.CL cs.SE

    GameDevBench: Evaluating Agentic Capabilities Through Game Development

    Authors: Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue

    Abstract: Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets suc… ▽ More

    Submitted 11 February, 2026; originally announced February 2026.

  4. arXiv:2602.03593  [pdf, ps, other

    cs.SE

    Beyond the Commit: Developer Perspectives on Productivity with AI Coding Assistants

    Authors: Valerie Chen, Jasmyn He, Behnjamin Williams, Jason Valentino, Ameet Talwalkar

    Abstract: Measuring developer productivity is a topic that has attracted attention from both academic research and industrial practice. In the age of AI coding assistants, it has become even more important for both academia and industry to understand how to measure their impact on developer productivity, and to reconsider whether earlier measures and frameworks still apply. This study analyzes the validity… ▽ More

    Submitted 3 February, 2026; originally announced February 2026.

    Comments: ICSE SEIP

  5. arXiv:2512.00564  [pdf, ps, other

    cs.LG physics.flu-dyn

    Pre-Generating Multi-Difficulty PDE Data for Few-Shot Neural PDE Solvers

    Authors: Naman Choudhary, Vedant Singh, Ameet Talwalkar, Nicholas Matthew Boffi, Mikhail Khodak, Tanya Marwah

    Abstract: A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty--e.g., more complex geometries and higher Reynolds numbers--along which problems become (1) harder for classical solvers and thus (2) more likely to benefi… ▽ More

    Submitted 22 January, 2026; v1 submitted 29 November, 2025; originally announced December 2025.

    Comments: 10 Pages, 11 Figures

  6. arXiv:2511.04486  [pdf, ps, other

    cs.SE

    EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

    Authors: Wayne Chi, Valerie Chen, Ryan Shar, Aditya Mittal, Jenny Liang, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue

    Abstract: Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage… ▽ More

    Submitted 17 November, 2025; v1 submitted 6 November, 2025; originally announced November 2025.

  7. arXiv:2510.25744  [pdf

    cs.CL cs.AI

    Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents

    Authors: Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag

    Abstract: Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs… ▽ More

    Submitted 30 October, 2025; v1 submitted 29 October, 2025; originally announced October 2025.

    Comments: 22 pages, 5 figures, 3 tables

  8. arXiv:2510.13756   

    cs.CV cs.AI cs.LG

    RECODE: Reasoning Through Code Generation for Visual Question Answering

    Authors: Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi

    Abstract: Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic… ▽ More

    Submitted 10 March, 2026; v1 submitted 15 October, 2025; originally announced October 2025.

    Comments: The authors are withdrawing this manuscript temporarily to conduct additional checks of the experimental setup and implementation. We plan to post an updated version after completing these checks

  9. arXiv:2510.09801  [pdf, ps, other

    cs.AI

    How can we assess human-agent interactions? Case studies in software agent design

    Authors: Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig

    Abstract: LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessm… ▽ More

    Submitted 4 November, 2025; v1 submitted 10 October, 2025; originally announced October 2025.

  10. arXiv:2508.00723  [pdf, ps, other

    cs.HC

    Why Do Decision Makers (Not) Use AI? A Cross-Domain Analysis of Factors Impacting AI Adoption

    Authors: Rebecca Yu, Valerie Chen, Ameet Talwalkar, Hoda Heidari

    Abstract: Growing excitement around deploying AI across various domains calls for a careful assessment of how human decision-makers interact with AI-powered systems. In particular, it is essential to understand when decision-makers voluntarily choose to consult AI tools, which we term decision-maker adoption. We interviewed experts across four domains -- medicine, law, journalism, and the public sector -- t… ▽ More

    Submitted 1 August, 2025; originally announced August 2025.

    Comments: To be published in Proceedings of the Eighth AAAI/ACM Conference on AI, Ethics, and Society (AIES-25). 10 pages, 4 figures, 1 table

  11. arXiv:2507.08149  [pdf, ps, other

    cs.SE

    Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows

    Authors: Valerie Chen, Ameet Talwalkar, Robert Brennan, Graham Neubig

    Abstract: Developers now have access to a growing array of increasingly autonomous AI tools for software development. While many studies examine copilots that provide chat assistance or code completions, evaluations of coding agents -- which can automatically write files and run code -- still rely on static benchmarks. We present the first controlled study of developer interactions with coding agents, chara… ▽ More

    Submitted 13 September, 2025; v1 submitted 10 July, 2025; originally announced July 2025.

  12. arXiv:2506.20640  [pdf, ps, other

    cs.AI cs.LG

    CoMind: Towards Community-Driven Agents for Machine Learning Engineering

    Authors: Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

    Abstract: Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to a… ▽ More

    Submitted 27 February, 2026; v1 submitted 25 June, 2025; originally announced June 2025.

    Comments: ICLR 2026. Code available at https://github.com/comind-ml/CoMind

  13. arXiv:2506.07976  [pdf, ps, other

    cs.LG cs.AI

    Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

    Authors: Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar

    Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propo… ▽ More

    Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Fixed typo in Figure 6 and Conclusion

  14. arXiv:2506.05295  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Sample Complexity and Representation Ability of Test-time Scaling Paradigms

    Authors: Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao

    Abstract: Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampl… ▽ More

    Submitted 12 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  15. arXiv:2505.16952  [pdf, ps, other

    cs.LG

    FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization

    Authors: Shengyu Feng, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

    Abstract: Machine learning (ML) has shown promise for tackling combinatorial optimization (CO), but much of the reported progress relies on small-scale, synthetic benchmarks that fail to capture real-world structure and scale. A core limitation is that ML methods are typically trained and evaluated on synthetic instance generators, leaving open how they perform on irregular, competition-grade, or industrial… ▽ More

    Submitted 9 March, 2026; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: ICLR 2026

  16. arXiv:2505.14766  [pdf, ps, other

    cs.LG cs.AI

    This Time is Different: An Observability Perspective on Time Series Foundation Models

    Authors: Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, Othmane Abou-Amal

    Abstract: We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger th… ▽ More

    Submitted 4 November, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  17. arXiv:2505.08783  [pdf, ps, other

    cs.LG cs.AI cs.CL math.NA

    CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

    Authors: Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar

    Abstract: Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and in… ▽ More

    Submitted 22 February, 2026; v1 submitted 13 May, 2025; originally announced May 2025.

    Comments: TMLR. Code available at https://github.com/LithiumDA/CodePDE

  18. arXiv:2503.14724  [pdf, other

    cs.HC

    CodingGenie: A Proactive LLM-Powered Programming Assistant

    Authors: Sebastian Zhao, Alan Zhu, Hussein Mozannar, David Sontag, Ameet Talwalkar, Valerie Chen

    Abstract: While developers increasingly adopt tools powered by large language models (LLMs) in day-to-day workflows, these tools still require explicit user invocation. To seamlessly integrate LLM capabilities to a developer's workflow, we introduce CodingGenie, a proactive assistant integrated into the code editor. CodingGenie autonomously provides suggestions, ranging from bug fixing to unit testing, base… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: FSE Demo 2025

  19. arXiv:2502.18413  [pdf, other

    cs.HC

    When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

    Authors: Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, Valerie Chen

    Abstract: Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting. Specifically, we perturb static coding benchmarks so that the code model must interac… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  20. arXiv:2502.09328  [pdf, other

    cs.SE

    Copilot Arena: A Platform for Code LLM Evaluation in the Wild

    Authors: Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar

    Abstract: Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce l… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

  21. arXiv:2411.15004  [pdf, other

    cs.CL cs.AI

    ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

    Authors: Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar

    Abstract: Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long… ▽ More

    Submitted 4 December, 2024; v1 submitted 22 November, 2024; originally announced November 2024.

  22. arXiv:2411.02796  [pdf, other

    cs.LG cs.AI cs.CV q-bio.GN

    Specialized Foundation Models Struggle to Beat Supervised Baselines

    Authors: Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen, Ameet Talwalkar, Mikhail Khodak

    Abstract: Following its success for vision and text, the "foundation model" (FM) paradigm -- pretraining large models on massive data, then fine-tuning on target tasks -- has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at thre… ▽ More

    Submitted 20 March, 2025; v1 submitted 4 November, 2024; originally announced November 2024.

    Comments: The first two authors contributed equally. The order was determined by coin flip

  23. arXiv:2410.24206  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Understanding Optimization in Deep Learning with Central Flows

    Authors: Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, J. Zico Kolter, Jason D. Lee

    Abstract: Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the "edge of stability." In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the… ▽ More

    Submitted 25 September, 2025; v1 submitted 31 October, 2024; originally announced October 2024.

    Comments: First two authors contributed equally; author order determined by coin flip. This is the full version of a paper published at ICLR 2025. We encourage readers to explore the blog version of this paper, with animated optimization trajectories, at https://centralflows.github.io. Our code can be found at https://github.com/centralflows/centralflows

  24. arXiv:2410.04596  [pdf, other

    cs.HC

    Need Help? Designing Proactive AI Assistants for Programming

    Authors: Valerie Chen, Alan Zhu, Sebastian Zhao, Hussein Mozannar, David Sontag, Ameet Talwalkar

    Abstract: While current chat-based AI assistants primarily operate reactively, responding only when prompted by users, there is significant potential for these systems to proactively assist in tasks without explicit invocation, enabling a mixed-initiative interaction. This work explores the design and implementation of proactive AI assistants powered by large language models. We first outline the key design… ▽ More

    Submitted 28 February, 2025; v1 submitted 6 October, 2024; originally announced October 2024.

    Comments: CHI 2025

  25. arXiv:2409.12089  [pdf, other

    cs.LG

    The Impact of Element Ordering on LM Agent Performance

    Authors: Wayne Chi, Ameet Talwalkar, Chris Donahue

    Abstract: There has been a surge of interest in language model agents that can navigate virtual environments such as the web or desktop. To navigate such environments, agents benefit from information on the various elements (e.g., buttons, text, or images) present. It remains unclear which element attributes have the greatest impact on agent performance, especially in environments that only provide a graphi… ▽ More

    Submitted 6 October, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

  26. arXiv:2407.12804  [pdf, other

    cs.HC cs.AI cs.LG

    Modulating Language Model Experiences through Frictions

    Authors: Katherine M. Collins, Valerie Chen, Ilia Sucholutsky, Hannah Rose Kirk, Malak Sadek, Holli Sargeant, Ameet Talwalkar, Adrian Weller, Umang Bhatt

    Abstract: Language models are transforming the ways that their users engage with the world. Despite impressive capabilities, over-consumption of language model outputs risks propagating unchecked errors in the short-term and damaging human capabilities for critical thinking in the long-term. How can we develop scaffolding around language models to curate more appropriate use? We propose selective frictions… ▽ More

    Submitted 18 November, 2024; v1 submitted 24 June, 2024; originally announced July 2024.

    Comments: NeurIPS Workshop on Behavioral ML; non-archival

  27. arXiv:2407.02348  [pdf, ps, other

    cs.LG

    Agreement-Based Cascading for Efficient Inference

    Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

    Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of… ▽ More

    Submitted 24 September, 2025; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: Published at TMLR (July 2025)

    Journal ref: TMLR 2025

  28. arXiv:2404.02806  [pdf, other

    cs.SE cs.AI cs.HC

    The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

    Authors: Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, David Sontag

    Abstract: Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spe… ▽ More

    Submitted 14 October, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

  29. arXiv:2403.07187  [pdf, other

    cs.LG

    UPS: Efficiently Building Foundation Models for PDE Solving via Cross-Modal Adaptation

    Authors: Junhong Shen, Tanya Marwah, Ameet Talwalkar

    Abstract: We present Unified PDE Solvers (UPS), a data- and compute-efficient approach to developing unified neural operators for diverse families of spatiotemporal PDEs from various domains, dimensions, and resolutions. UPS embeds different PDEs into a shared representation space and processes them using a FNO-transformer architecture. Rather than training the network from scratch, which is data-demanding… ▽ More

    Submitted 23 November, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: TMLR 2024; ICML 2024 AI for Science Workshop (Spotlight)

  30. arXiv:2402.05406  [pdf, ps, other

    cs.LG cs.CL

    Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

    Authors: Steven Kolawole, Lucio Dery, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar

    Abstract: Structured pruning is a promising approach to create smaller, faster large language models. However, existing methods typically rely on computing the gradient via backward passes, which can inflate memory requirements and compute costs. In this work we introduce Bonsai, a gradient-free structured pruning method that eliminates the need for backpropagation, significantly reducing memory requirement… ▽ More

    Submitted 22 January, 2026; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: 19 pages, 6 fiigures, 16 tables

  31. arXiv:2312.03151  [pdf, other

    cs.LG

    Multitask Learning Can Improve Worst-Group Outcomes

    Authors: Atharva Kulkarni, Lucio Dery, Amrith Setlur, Aditi Raghunathan, Ameet Talwalkar, Graham Neubig

    Abstract: In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is on… ▽ More

    Submitted 28 February, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: 20 pages, 7 tables, 6 Figures

  32. arXiv:2311.04076  [pdf, other

    cs.CL

    Do LLMs exhibit human-like response biases? A case study in survey design

    Authors: Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig

    Abstract: As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording - but interestingly, humans also d… ▽ More

    Submitted 5 February, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

  33. arXiv:2310.02246  [pdf, other

    cs.LG cs.AI math.NA stat.ML

    Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

    Authors: Mikhail Khodak, Edmond Chow, Maria-Florina Balcan, Ameet Talwalkar

    Abstract: Solving a linear system $Ax=b$ is a fundamental scientific computing primitive for which numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved and are often impossible or too expensive to identify; thus in practice sub-optimal heuristics are used. We consider the common setting in which many related linear system… ▽ More

    Submitted 2 May, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 Spotlight

  34. arXiv:2307.15475  [pdf, other

    cs.HC cs.AI cs.LG

    FeedbackLogs: Recording and Incorporating Stakeholder Feedback into Machine Learning Pipelines

    Authors: Matthew Barker, Emma Kallina, Dhananjay Ashok, Katherine M. Collins, Ashley Casovan, Adrian Weller, Ameet Talwalkar, Valerie Chen, Umang Bhatt

    Abstract: Even though machine learning (ML) pipelines affect an increasing array of stakeholders, there is little work on how input from stakeholders is recorded and incorporated. We propose FeedbackLogs, addenda to existing documentation of ML pipelines, to track the input of multiple stakeholders. Each log records important details about the feedback collection process, the feedback itself, and how the fe… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

  35. Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms

    Authors: Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar

    Abstract: Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets ("slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to… ▽ More

    Submitted 9 February, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 11(1), 65-76. Best Paper Award

  36. arXiv:2304.06701  [pdf, other

    cs.LG cs.AI cs.CY cs.HC

    Learning Personalized Decision Support Policies

    Authors: Umang Bhatt, Valerie Chen, Katherine M. Collins, Parameswaran Kamalaruban, Emma Kallina, Adrian Weller, Ameet Talwalkar

    Abstract: Individual human decision-makers may benefit from different forms of support to improve decision outcomes, but when each form of support will yield better outcomes? In this work, we posit that personalizing access to decision support tools can be an effective mechanism for instantiating the appropriate use of AI assistance. Specifically, we propose the general problem of learning a decision suppor… ▽ More

    Submitted 23 January, 2025; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: AAAI 2025

  37. arXiv:2302.08450  [pdf, other

    cs.LG cs.HC

    Assisting Human Decisions in Document Matching

    Authors: Joon Sik Kim, Valerie Chen, Danish Pruthi, Nihar B. Shah, Ameet Talwalkar

    Abstract: Many practical applications, ranging from paper-reviewer assignment in peer review to job-applicant matching for hiring, require human decision makers to identify relevant matches by combining their expertise with predictions from machine learning models. In many such model-assisted document matching tasks, the decision makers have stressed the need for assistive information about the model output… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

  38. arXiv:2302.05738  [pdf, other

    cs.LG

    Cross-Modal Fine-Tuning: Align then Refine

    Authors: Junhong Shen, Liam Li, Lucio M. Dery, Corey Staten, Mikhail Khodak, Graham Neubig, Ameet Talwalkar

    Abstract: Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse mo… ▽ More

    Submitted 18 March, 2023; v1 submitted 11 February, 2023; originally announced February 2023.

  39. Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning

    Authors: Ángel Alexander Cabrera, Erica Fu, Donald Bertucci, Kenneth Holstein, Ameet Talwalkar, Jason I. Hong, Adam Perer

    Abstract: Machine learning models with high accuracy on test data can still produce systematic failures, such as harmful biases and safety issues, when deployed in the real world. To detect and mitigate such failures, practitioners run behavioral evaluation of their models, checking model outputs for specific types of inputs. Behavioral evaluation is important but challenging, requiring that practitioners d… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

  40. arXiv:2212.08930  [pdf, other

    cs.LG

    On Noisy Evaluation in Federated Hyperparameter Tuning

    Authors: Kevin Kuo, Pratiksha Thaker, Mikhail Khodak, John Nguyen, Daniel Jiang, Ameet Talwalkar, Virginia Smith

    Abstract: Hyperparameter tuning is critical to the success of federated learning applications. Unfortunately, appropriately selecting hyperparameters is challenging in federated networks. Issues of scale, privacy, and heterogeneity introduce noise in the tuning process and make it difficult to evaluate the performance of various hyperparameters. In this work, we perform the first systematic study on the eff… ▽ More

    Submitted 15 May, 2023; v1 submitted 17 December, 2022; originally announced December 2022.

    Comments: v1: 19 pages, 15 figures, submitted to MLSys2023; v2: Fixed citation formatting; v3: Fixed typo, update acks v4: MLSys2023 camera-ready

  41. arXiv:2210.03324  [pdf, other

    cs.LG cs.AI stat.ML

    AutoML for Climate Change: A Call to Action

    Authors: Renbo Tu, Nicholas Roberts, Vishak Prasad, Sibasis Nayak, Paarth Jain, Frederic Sala, Ganesh Ramakrishnan, Ameet Talwalkar, Willie Neiswanger, Colin White

    Abstract: The challenge that climate change poses to humanity has spurred a rapidly developing field of artificial intelligence research focused on climate change applications. The climate change AI (CCAI) community works on a diverse, challenging set of problems which often involve physics-constrained ML or heterogeneous spatiotemporal data. It would be desirable to use automated machine learning (AutoML)… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

  42. arXiv:2208.12218  [pdf, other

    cs.LG

    SONAR: Joint Architecture and System Optimization Search

    Authors: Elias Jääsaari, Michelle Ma, Ameet Talwalkar, Tianqi Chen

    Abstract: There is a growing need to deploy machine learning for different tasks on a wide array of new hardware platforms. Such deployment scenarios require tackling multiple challenges, including identifying a model architecture that can achieve a suitable predictive accuracy (architecture search), and finding an efficient implementation of the model to satisfy underlying hardware-specific systems constra… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

  43. arXiv:2207.10199  [pdf, other

    cs.LG stat.ML

    Provably tuning the ElasticNet across instances

    Authors: Maria-Florina Balcan, Mikhail Khodak, Dravyansh Sharma, Ameet Talwalkar

    Abstract: An important unresolved challenge in the theory of regularization is to set the regularization coefficients of popular techniques like the ElasticNet with general provable guarantees. We consider the problem of tuning the regularization parameters of Ridge regression, LASSO, and the ElasticNet across multiple problem instances, a setting that encompasses both cross-validation and multi-task hyperp… ▽ More

    Submitted 15 January, 2024; v1 submitted 20 July, 2022; originally announced July 2022.

  44. arXiv:2207.04104  [pdf, other

    cs.LG cs.CV

    Towards a More Rigorous Science of Blindspot Discovery in Image Classification Models

    Authors: Gregory Plumb, Nari Johnson, Ángel Alexander Cabrera, Ameet Talwalkar

    Abstract: A growing body of work studies Blindspot Discovery Methods ("BDM"s): methods that use an image embedding to find semantically meaningful (i.e., united by a human-understandable concept) subsets of the data where an image classifier performs significantly worse. Motivated by observed gaps in prior work, we introduce a new framework for evaluating BDMs, SpotCheck, that uses synthetic image datasets… ▽ More

    Submitted 11 July, 2023; v1 submitted 8 July, 2022; originally announced July 2022.

    Comments: reviewed on OpenReview: https://openreview.net/forum?id=MaDvbLaBiF

    Journal ref: TMLR 2023

  45. arXiv:2206.13503  [pdf, other

    cs.LG cs.HC

    On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods

    Authors: Kasun Amarasinghe, Kit T. Rodolfa, Sérgio Jesus, Valerie Chen, Vladimir Balayan, Pedro Saleiro, Pedro Bizarro, Ameet Talwalkar, Rayid Ghani

    Abstract: Most existing evaluations of explainable machine learning (ML) methods rely on simplifying assumptions or proxies that do not reflect real-world use cases; the handful of more robust evaluations on real-world settings have shortcomings in their design, resulting in limited conclusions of methods' real-world utility. In this work, we seek to bridge this gap by conducting a study that evaluates thre… ▽ More

    Submitted 21 February, 2023; v1 submitted 24 June, 2022; originally announced June 2022.

  46. arXiv:2206.02256  [pdf, other

    cs.HC cs.AI cs.LG

    Use-Case-Grounded Simulations for Explanation Evaluation

    Authors: Valerie Chen, Nari Johnson, Nicholay Topin, Gregory Plumb, Ameet Talwalkar

    Abstract: A growing body of research runs human subject evaluations to study whether providing users with explanations of machine learning models can help them with practical real-world use cases. However, running user studies is challenging and costly, and consequently each study typically only evaluates a limited number of different settings, e.g., studies often only evaluate a few arbitrarily selected ex… ▽ More

    Submitted 20 August, 2022; v1 submitted 5 June, 2022; originally announced June 2022.

  47. arXiv:2205.14082  [pdf, other

    cs.LG cs.AI

    AANG: Automating Auxiliary Learning

    Authors: Lucio M. Dery, Paul Michel, Mikhail Khodak, Graham Neubig, Ameet Talwalkar

    Abstract: Auxiliary objectives, supplementary learning signals that are introduced to help aid learning on data-starved or highly complex end-tasks, are commonplace in machine learning. Whilst much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious hand-design. Intuition for how and when these objectives improve end-task perform… ▽ More

    Submitted 27 February, 2023; v1 submitted 27 May, 2022; originally announced May 2022.

    Comments: Accepted to ICLR 2023 22 pages, 7 tables and 5 figures

  48. arXiv:2205.06905  [pdf, other

    cs.LG

    Perspectives on Incorporating Expert Feedback into Model Updates

    Authors: Valerie Chen, Umang Bhatt, Hoda Heidari, Adrian Weller, Ameet Talwalkar

    Abstract: Machine learning (ML) practitioners are increasingly tasked with developing models that are aligned with non-technical experts' values and goals. However, there has been insufficient consideration on how practitioners should translate domain expertise into ML updates. In this paper, we consider how to capture interactions between practitioners and experts systematically. We devise a taxonomy to ma… ▽ More

    Submitted 16 July, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

  49. arXiv:2204.07554  [pdf, other

    cs.LG cs.AI

    Efficient Architecture Search for Diverse Tasks

    Authors: Junhong Shen, Mikhail Khodak, Ameet Talwalkar

    Abstract: While neural architecture search (NAS) has enabled automated machine learning (AutoML) for well-researched areas, its application to tasks beyond computer vision is still under-explored. As less-studied domains are precisely those where we expect AutoML to have the greatest impact, in this work we study NAS for efficiently solving diverse problems. Seeking an approach that is fast, simple, and bro… ▽ More

    Submitted 9 October, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: NeurIPS 2022 Camera-Ready; code available at https://github.com/sjunhongshen/DASH

  50. arXiv:2202.09312  [pdf, other

    cs.LG cs.AI cs.DS stat.ML

    Learning Predictions for Algorithms with Predictions

    Authors: Mikhail Khodak, Maria-Florina Balcan, Ameet Talwalkar, Sergei Vassilvitskii

    Abstract: A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms can take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions them… ▽ More

    Submitted 17 October, 2022; v1 submitted 18 February, 2022; originally announced February 2022.

    Comments: NeurIPS 2022 camera-ready