Skip to main content

Showing 1–43 of 43 results for author: Nushi, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.20856  [pdf, ps, other

    cs.CL cs.AI cs.LG

    NVIDIA Nemotron 3: Efficient and Open Intelligence

    Authors: NVIDIA, :, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla , et al. (334 additional authors not shown)

    Abstract: We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel appro… ▽ More

    Submitted 23 December, 2025; originally announced December 2025.

  2. arXiv:2512.20848  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

    Authors: NVIDIA, :, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla , et al. (289 additional authors not shown)

    Abstract: We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activa… ▽ More

    Submitted 23 December, 2025; originally announced December 2025.

  3. arXiv:2510.27055  [pdf, ps, other

    cs.CL cs.AI

    Detecting Data Contamination in LLMs via In-Context Learning

    Authors: Michał Zawalski, Meriem Boubdir, Klaudia Bałazy, Besmira Nushi, Pablo Ribalta

    Abstract: We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unsee… ▽ More

    Submitted 30 October, 2025; originally announced October 2025.

    ACM Class: I.2.7

  4. arXiv:2510.10494  [pdf, ps, other

    cs.AI

    Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning

    Authors: Martina G. Vilas, Safoora Yousefi, Besmira Nushi, Eric Horvitz, Vidhisha Balachandran

    Abstract: Reasoning models improve their problem-solving ability through inference-time scaling, allocating more compute via longer token budgets. Identifying which reasoning traces are likely to succeed remains a key opportunity: reliably predicting productive paths can substantially reduce wasted computation and improve overall efficiency. We introduce Latent-Trajectory signals that characterize the tempo… ▽ More

    Submitted 12 October, 2025; originally announced October 2025.

  5. arXiv:2510.01670  [pdf, ps, other

    cs.AI cs.CL cs.CR cs.CY cs.LG

    Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness

    Authors: Erfan Shayegani, Keegan Hines, Yue Dong, Nael Abu-Ghazaleh, Roman Lutz, Spencer Whitehead, Vidhisha Balachandran, Besmira Nushi, Vibhav Vineet

    Abstract: Computer-Use Agents (CUAs) are an increasingly deployed class of agents that take actions on GUIs to accomplish user goals. In this paper, we show that CUAs consistently exhibit Blind Goal-Directedness (BGD): a bias to pursue goals regardless of feasibility, safety, reliability, or context. We characterize three prevalent patterns of BGD: (i) lack of contextual reasoning, (ii) assumptions and deci… ▽ More

    Submitted 2 October, 2025; originally announced October 2025.

  6. arXiv:2506.20702  [pdf

    cs.AI cs.CY

    The Singapore Consensus on Global AI Safety Research Priorities

    Authors: Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Juntao Dai , et al. (63 additional authors not shown)

    Abstract: Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on… ▽ More

    Submitted 30 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

    Comments: Final report from the "2025 Singapore Conference on AI (SCAI)" held April 26: https://www.scai.gov.sg/2025/scai2025-report

  7. arXiv:2506.10527  [pdf, ps, other

    cs.AI cs.PF

    LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs

    Authors: Yanan Cai, Ahmed Salem, Besmira Nushi, Mark Russinovich

    Abstract: We introduce LogiPlan, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in logical planning and reasoning over complex relational structures. Logical relational reasoning is important for applications that may rely on LLMs to generate and query structured graphs of relations such as network infrastructure, knowledge bases, or business process schema. Our fram… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  8. arXiv:2504.21318  [pdf, other

    cs.AI cs.CL

    Phi-4-reasoning Technical Report

    Authors: Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing Zheng

    Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectivel… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  9. arXiv:2504.00294  [pdf, other

    cs.LG cs.AI cs.CL

    Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

    Authors: Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, Safoora Yousefi

    Abstract: Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods ac… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

    ACM Class: I.2

  10. arXiv:2501.04155  [pdf, other

    cs.CV cs.CL cs.LG

    MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation

    Authors: Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman

    Abstract: Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

  11. arXiv:2410.22584  [pdf, ps, other

    cs.LG cs.AI cs.CL

    BenchAgents: Multi-Agent Systems for Structured Benchmark Creation

    Authors: Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran

    Abstract: Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leve… ▽ More

    Submitted 7 October, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

  12. arXiv:2410.22517  [pdf, other

    cs.CL cs.AI cs.LG

    Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

    Authors: Rishabh Adiga, Besmira Nushi, Varun Chandrasekaran

    Abstract: We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addres… ▽ More

    Submitted 29 October, 2024; originally announced October 2024.

  13. arXiv:2410.13826  [pdf, other

    cs.LG cs.AI cs.CV

    Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

    Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

    Abstract: With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of… ▽ More

    Submitted 24 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: Code at: github.com/microsoft/skill-slice-insights

  14. arXiv:2410.12877  [pdf, other

    cs.CL cs.AI cs.LG

    Improving Instruction-Following in Language Models through Activation Steering

    Authors: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi

    Abstract: The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a m… ▽ More

    Submitted 14 April, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: ICLR 2025

  15. arXiv:2409.10566  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Eureka: Evaluating and Understanding Large Foundation Models

    Authors: Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, Safoora Yousefi

    Abstract: Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensi… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    ACM Class: I.2

  16. arXiv:2406.04236  [pdf, other

    cs.CV

    Understanding Information Storage and Transfer in Multi-modal Large Language Models

    Authors: Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, Daniela Massiceti

    Abstract: Understanding the mechanisms of information storage and transfer in Transformer-based models is important for driving model understanding progress. Recent work has studied these mechanisms for Large Language Models (LLMs), revealing insights on how information is stored in a model's parameters and how information flows to and from these parameters in response to specific prompts. However, these st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 20 pages

  17. arXiv:2404.12241  [pdf, other

    cs.CL cs.AI

    Introducing v0.5 of the AI Safety Benchmark from MLCommons

    Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

    Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

  18. arXiv:2404.06209  [pdf, other

    cs.LG cs.AI cs.CL

    Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

    Authors: Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich Caruana

    Abstract: While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation revea… ▽ More

    Submitted 4 December, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: COLM camera ready, fix typo

  19. arXiv:2310.15511  [pdf, other

    cs.LG cs.AI cs.CL cs.IR

    KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

    Authors: Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi

    Abstract: We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many cu… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: 23 pages

    ACM Class: I.2.7

  20. arXiv:2310.07088  [pdf, other

    cs.CL cs.AI

    Diversity of Thought Improves Reasoning Abilities of LLMs

    Authors: Ranjita Naik, Varun Chandrasekaran, Mert Yuksekgonul, Hamid Palangi, Besmira Nushi

    Abstract: Large language models (LLMs) are documented to struggle in settings that require complex reasoning. Nevertheless, instructing the model to break down the problem into smaller reasoning steps, or ensembling various generations through modifying decoding steps boosts performance. However, these methods assume that the input prompt is fixed and expect the decoding strategies to introduce the diversit… ▽ More

    Submitted 23 February, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

  21. arXiv:2309.15098  [pdf, other

    cs.CL cs.AI cs.LG

    Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models

    Authors: Mert Yuksekgonul, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, Besmira Nushi

    Abstract: We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as constraint satisfaction problems and use this framework to investigate how the LLM interacts internally with factual constraints. We find a strong positive relationship between the LLM's attention to constraint tokens and the fac… ▽ More

    Submitted 17 April, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Published at ICLR 2024

  22. arXiv:2304.06034  [pdf, other

    cs.CY cs.AI cs.CL cs.CV

    Social Biases through the Text-to-Image Generation Lens

    Authors: Ranjita Naik, Besmira Nushi

    Abstract: Text-to-Image (T2I) generation is enabling new applications that support creators, designers, and general end users of productivity software by generating illustrative content with high photorealism starting from a given descriptive text as a prompt. Such models are however trained on massive amounts of web data, which surfaces the peril of potential harmful biases that may leak in the generation… ▽ More

    Submitted 30 March, 2023; originally announced April 2023.

  23. arXiv:2304.03916  [pdf, other

    cs.LG cs.AI

    Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning

    Authors: Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman

    Abstract: Spurious correlations that degrade model generalization or lead the model to be right for the wrong reasons are one of the main robustness concerns for real-world deployments. However, mitigating these correlations during pre-training for large-scale models can be costly and impractical, particularly for those without access to high-performance computing resources. This paper proposes a novel appr… ▽ More

    Submitted 30 May, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

  24. arXiv:2212.10015  [pdf, other

    cs.CV cs.AI cs.CL

    Benchmarking Spatial Relationships in Text-to-Image Generation

    Authors: Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, Yezhou Yang

    Abstract: Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of… ▽ More

    Submitted 27 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: preprint; Code and Data at https://github.com/microsoft/VISOR and https://huggingface.co/datasets/tgokhale/sr2d_visor

  25. arXiv:2208.07960  [pdf, other

    cs.HC cs.AI

    Advancing Human-AI Complementarity: The Impact of User Expertise and Algorithmic Tuning on Joint Decision Making

    Authors: Kori Inkpen, Shreya Chappidi, Keri Mallari, Besmira Nushi, Divya Ramesh, Pietro Michelucci, Vani Mandava, Libuše Hannah Vepřek, Gabrielle Quinn

    Abstract: Human-AI collaboration for decision-making strives to achieve team performance that exceeds the performance of humans or AI alone. However, many factors can impact success of Human-AI teams, including a user's domain expertise, mental models of an AI system, trust in recommendations, and more. This work examines users' interaction with three simulated algorithmic models, all with similar accuracy… ▽ More

    Submitted 16 August, 2022; originally announced August 2022.

    Comments: Paper accepted and to be published in Transactions on Computer Human Interaction

  26. arXiv:2205.09696  [pdf, other

    cs.HC cs.AI

    Who Goes First? Influences of Human-AI Workflow on Decision Making in Clinical Imaging

    Authors: Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Michael Fitzke, Mark Parkinson, Diane Wilson, Paul Fisher, Eric Horvitz, Kori Inkpen, Besmira Nushi

    Abstract: Details of the designs and mechanisms in support of human-AI collaboration must be considered in the real-world fielding of AI technologies. A critical aspect of interaction design for AI-assisted human decision making are policies about the display and sequencing of AI inferences within larger decision-making workflows. We have a poor understanding of the influences of making AI inferences availa… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: Accepted at ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2022

  27. arXiv:2202.11812  [pdf, other

    cs.HC cs.AI

    Investigations of Performance and Bias in Human-AI Teamwork in Hiring

    Authors: Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, Ece Kamar

    Abstract: In AI-assisted decision-making, effective hybrid (human-AI) teamwork is not solely dependent on AI performance alone, but also on its impact on human decision-making. While prior work studies the effects of model accuracy on humans, we endeavour here to investigate the complex dynamics of how both a model's predictive performance and bias may transfer to humans in a recommendation-aided decision t… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

    Comments: Accepted at AAAI 2022

  28. arXiv:2107.06618  [pdf, other

    eess.IV cs.CV cs.LG

    Hierarchical Analysis of Visual COVID-19 Features from Chest Radiographs

    Authors: Shruthi Bannur, Ozan Oktay, Melanie Bernhardt, Anton Schwaighofer, Rajesh Jena, Besmira Nushi, Sharan Wadhwani, Aditya Nori, Kal Natarajan, Shazad Ashraf, Javier Alvarez-Valle, Daniel C. Castro

    Abstract: Chest radiography has been a recommended procedure for patient triaging and resource management in intensive care units (ICUs) throughout the COVID-19 pandemic. The machine learning efforts to augment this workflow have been long challenged due to deficiencies in reporting, model evaluation, and failure mode analysis. To address some of those shortcomings, we model radiological features with a hum… ▽ More

    Submitted 14 July, 2021; originally announced July 2021.

    Comments: Presented at ICML 2021 Workshop on Interpretable Machine Learning in Healthcare

  29. arXiv:2012.01750  [pdf, other

    cs.CV

    Understanding Failures of Deep Networks via Robust Feature Extraction

    Authors: Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, Eric Horvitz

    Abstract: Traditional evaluation metrics for learned models that report aggregate scores over a test set are insufficient for surfacing important and informative patterns of failure over features and instances. We introduce and study a method aimed at characterizing and explaining failures by identifying visual attributes whose presence or absence results in poor performance. In distinction to previous work… ▽ More

    Submitted 12 June, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

    Comments: Accepted at CVPR, 2021

  30. arXiv:2008.04572  [pdf, other

    cs.LG cs.SE stat.ML

    An Empirical Analysis of Backward Compatibility in Machine Learning Systems

    Authors: Megha Srivastava, Besmira Nushi, Ece Kamar, Shital Shah, Eric Horvitz

    Abstract: In many applications of machine learning (ML), updates are performed with the goal of enhancing model performance. However, current practices for updating models rely solely on isolated, aggregate performance analyses, overlooking important dependencies, expectations, and needs in real-world deployments. We consider how updates, intended to improve ML models, can introduce new errors that can sign… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: KDD 2020, 9 pages, 7 figures

  31. arXiv:2006.14779  [pdf, other

    cs.AI cs.CL cs.HC cs.LG

    Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance

    Authors: Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, Daniel S. Weld

    Abstract: Many researchers motivate explainable AI with studies showing that human-AI team performance on decision-making tasks improves when the AI explains its recommendations. However, prior studies observed improvements from explanations only when the AI, alone, outperformed both the human and the best team. Can explanations help lead to complementary performance, where team accuracy is higher than eith… ▽ More

    Submitted 12 January, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

    Comments: CHI'21

  32. arXiv:2004.13102  [pdf, other

    cs.AI cs.HC cs.LG

    Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork

    Authors: Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, Daniel S. Weld

    Abstract: AI practitioners typically strive to develop the most accurate systems, making an implicit assumption that the AI system will function autonomously. However, in practice, AI systems often are used to provide advice to people in domains ranging from criminal justice and finance to healthcare. In such AI-advised decision making, humans and machines form a team, where the human is responsible for mak… ▽ More

    Submitted 19 February, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

    Comments: v2

  33. arXiv:2001.06927  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

    Authors: Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar

    Abstract: Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the… ▽ More

    Submitted 16 June, 2020; v1 submitted 19 January, 2020; originally announced January 2020.

    Comments: Accepted to CVPR'20 as an Oral Presentation

  34. arXiv:1909.03567  [pdf, other

    cs.HC cs.AI cs.CY

    What You See Is What You Get? The Impact of Representation Criteria on Human Bias in Hiring

    Authors: Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, Siddharth Suri, Ece Kamar

    Abstract: Although systematic biases in decision-making are widely documented, the ways in which they emerge from different sources is less understood. We present a controlled experimental platform to study gender bias in hiring by decoupling the effect of world distribution (the gender breakdown of candidates in a specific profession) from bias in human decision-making. We explore the effectiveness of \tex… ▽ More

    Submitted 8 September, 2019; originally announced September 2019.

    Comments: This paper has been accepted for publication at HCOMP 2019

  35. arXiv:1906.01148  [pdf, other

    cs.HC cs.LG stat.ML

    A Case for Backward Compatibility for Human-AI Teams

    Authors: Gagan Bansal, Besmira Nushi, Ece Kamar, Dan Weld, Walter Lasecki, Eric Horvitz

    Abstract: AI systems are being deployed to support human decision making in high-stakes domains. In many cases, the human and AI form a team, in which the human makes decisions after reviewing the AI's inferences. A successful partnership requires that the human develops insights into the performance of the AI system, including its failures. We study the influence of updates to an AI system in this setting.… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

    Comments: presented at 2019 ICML Workshop on Human in the Loop Learning (HILL 2019), Long Beach, USA

  36. arXiv:1905.05179  [pdf, other

    cs.LG cs.AI cs.SE stat.ML

    Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

    Authors: Aditya Modi, Debadeepta Dey, Alekh Agarwal, Adith Swaminathan, Besmira Nushi, Sean Andrist, Eric Horvitz

    Abstract: Assemblies of modular subsystems are being pressed into service to perform sensing, reasoning, and decision making in high-stakes, time-critical tasks in such areas as transportation, healthcare, and industrial automation. We address the opportunity to maximize the utility of an overall computing system by employing reinforcement learning to guide the configuration of the set of interacting module… ▽ More

    Submitted 12 May, 2019; originally announced May 2019.

    Comments: 12 pages, 7 figures, 2 tables

  37. arXiv:1810.10033  [pdf, other

    cs.SI

    Analysis of Strategy and Spread of Russia-sponsored Content in the US in 2017

    Authors: Alexander Spangher, Gireeja Ranade, Besmira Nushi, Adam Fourney, Eric Horvitz

    Abstract: The Russia-based Internet Research Agency (IRA) carried out a broad information campaign in the U.S. before and after the 2016 presidential election. The organization created an expansive set of internet properties: web domains, Facebook pages, and Twitter bots, which received traffic via purchased Facebook ads, tweets, and search engines indexing their domains. We investigate the scope of IRA act… ▽ More

    Submitted 23 October, 2018; originally announced October 2018.

  38. arXiv:1809.07424  [pdf, other

    cs.LG cs.AI cs.HC stat.ML

    Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure

    Authors: Besmira Nushi, Ece Kamar, Eric Horvitz

    Abstract: As machine learning systems move from computer-science laboratories into the open world, their accountability becomes a high priority problem. Accountability requires deep understanding of system behavior and its failures. Current evaluation methods such as single-score error metrics and confusion matrices provide aggregate views of system performance that hide important shortcomings. Understandin… ▽ More

    Submitted 19 September, 2018; originally announced September 2018.

    Journal ref: AAAI Conference on Human Computation and Crowdsourcing 2018

  39. arXiv:1611.08309  [pdf, other

    cs.LG

    On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems

    Authors: Besmira Nushi, Ece Kamar, Eric Horvitz, Donald Kossmann

    Abstract: We study the problem of troubleshooting machine learning systems that rely on analytical pipelines of distinct components. Understanding and fixing errors that arise in such integrative systems is difficult as failures can occur at multiple points in the execution workflow. Moreover, errors can propagate, become amplified or be suppressed, making blame assignment difficult. We propose a human-in-t… ▽ More

    Submitted 24 November, 2016; originally announced November 2016.

    Comments: 11 pages, Thirty-First AAAI conference on Artificial Intelligence

    ACM Class: I.2; H.1.2; I.4; I.2.7

  40. arXiv:1512.00537  [pdf, other

    cs.DB

    Fault-Tolerant Entity Resolution with the Crowd

    Authors: Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann

    Abstract: In recent years, crowdsourcing is increasingly applied as a means to enhance data quality. Although the crowd generates insightful information especially for complex problems such as entity resolution (ER), the output quality of crowd workers is often noisy. That is, workers may unintentionally generate false or contradicting data even for simple tasks. The challenge that we address in this paper… ▽ More

    Submitted 1 December, 2015; originally announced December 2015.

  41. arXiv:1508.01951  [pdf, other

    cs.LG cs.DB

    Crowd Access Path Optimization: Diversity Matters

    Authors: Besmira Nushi, Adish Singla, Anja Gruenheid, Erfan Zamanian, Andreas Krause, Donald Kossmann

    Abstract: Quality assurance is one the most important challenges in crowdsourcing. Assigning tasks to several workers to increase quality through redundant answers can be expensive if asking homogeneous sources. This limitation has been overlooked by current crowdsourcing platforms resulting therefore in costly solutions. In order to achieve desirable cost-quality tradeoffs it is essential to apply efficien… ▽ More

    Submitted 11 August, 2015; v1 submitted 8 August, 2015; originally announced August 2015.

    Comments: 10 pages, 3rd AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2015)

    ACM Class: H.1.2; I.2.6; H.2.5

  42. arXiv:1407.6714  [pdf, other

    cs.SI

    CrowdSTAR: A Social Task Routing Framework for Online Communities

    Authors: Besmira Nushi, Omar Alonso, Martin Hentschel, Vasileios Kandylas

    Abstract: The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framewor… ▽ More

    Submitted 24 July, 2014; originally announced July 2014.

    ACM Class: H.4.m; H.5.3

  43. arXiv:1208.1931  [pdf, other

    cs.DB

    Uncertain Time-Series Similarity: Return to the Basics

    Authors: Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas

    Abstract: In the last years there has been a considerable increase in the availability of continuous sensor measurements in a wide range of application domains, such as Location-Based Services (LBS), medical monitoring systems, manufacturing plants and engineering facilities to ensure efficiency, product quality and safety, hydrologic and geologic observing systems, pollution management, and others. Due to… ▽ More

    Submitted 9 August, 2012; originally announced August 2012.

    Comments: VLDB2012

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 11, pp. 1662-1673 (2012)