Skip to main content

Showing 1–50 of 231 results for author: Smith, N A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.15586  [pdf, ps, other

    cs.CL

    Bolmo: Byteifying the Next Generation of Language Models

    Authors: Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann

    Abstract: We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character und… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  2. arXiv:2512.13961  [pdf, ps, other

    cs.CL cs.LG

    Olmo 3

    Authors: Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng , et al. (44 additional authors not shown)

    Abstract: We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, a… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  3. arXiv:2511.03056  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Reading Between the Lines: The One-Sided Conversation Problem

    Authors: Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota

    Abstract: Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating sum… ▽ More

    Submitted 4 November, 2025; originally announced November 2025.

    Comments: 8 pages, 6 figures, 4 tables

  4. arXiv:2510.14261  [pdf, ps, other

    cs.CL

    Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

    Authors: Rahul Nadkarni, Yanai Elazar, Hila Gonen, Noah A. Smith

    Abstract: We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items fr… ▽ More

    Submitted 15 October, 2025; originally announced October 2025.

  5. arXiv:2509.11106  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Fluid Language Model Benchmarking

    Authors: Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, Noah A. Smith

    Abstract: Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions abo… ▽ More

    Submitted 14 September, 2025; originally announced September 2025.

    Comments: COLM 2025

  6. arXiv:2508.13144  [pdf, ps, other

    cs.CL cs.LG

    Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

    Authors: David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge

    Abstract: Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchm… ▽ More

    Submitted 18 August, 2025; originally announced August 2025.

  7. arXiv:2507.07024  [pdf, ps, other

    cs.CL cs.AI

    FlexOlmo: Open Language Models for Flexible Data Use

    Authors: Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min

    Abstract: We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture… ▽ More

    Submitted 22 August, 2025; v1 submitted 9 July, 2025; originally announced July 2025.

  8. arXiv:2506.19065  [pdf, ps, other

    cs.CV cs.DL

    LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

    Authors: Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith

    Abstract: We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pre… ▽ More

    Submitted 1 October, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  9. arXiv:2506.19004  [pdf, ps, other

    cs.CL

    Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

    Authors: Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith

    Abstract: Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find t… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: preprint

  10. arXiv:2506.14123  [pdf, ps, other

    cs.CL cs.FL cs.LG

    Sampling from Your Language Model One Byte at a Time

    Authors: Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh

    Abstract: Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents… ▽ More

    Submitted 11 July, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: 23 pages, 8 figures

  11. arXiv:2506.12229  [pdf, ps, other

    cs.CL

    Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

    Authors: Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

    Abstract: Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an effici… ▽ More

    Submitted 4 October, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

  12. arXiv:2506.01937  [pdf, ps, other

    cs.CL

    RewardBench 2: Advancing Reward Model Evaluation

    Authors: Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Nathan Lambert

    Abstract: Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Data, models, and leaderboard available at https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51

  13. arXiv:2505.17613  [pdf, ps, other

    cs.AI cs.CL cs.CV

    MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

    Authors: Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu

    Abstract: Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, i… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  14. arXiv:2505.09990  [pdf, other

    cs.CV

    PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

    Authors: Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, Rose Hendrix, Noah A. Smith, Fei Xia, Dieter Fox, Ranjay Krishna

    Abstract: Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platf… ▽ More

    Submitted 16 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

    Comments: 10 Pages, Dataset and code:https://pointarena.github.io/

  15. arXiv:2505.03054  [pdf, other

    cs.AI cs.CL cs.SD eess.AS

    BLAB: Brutally Long Audio Bench

    Authors: Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar

    Abstract: Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limite… ▽ More

    Submitted 12 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

  16. arXiv:2504.20879  [pdf, other

    cs.AI cs.CL cs.LG stat.ME

    The Leaderboard Illusion

    Authors: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker

    Abstract: Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private test… ▽ More

    Submitted 12 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

    Comments: 68 pages, 18 figures, 9 tables

  17. arXiv:2504.18509  [pdf, other

    cs.CV

    Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

    Authors: Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman, Noah A. Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, Wei-Chiu Ma

    Abstract: Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often ov… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: CVPR 2025. Project page and codes: https://eval3d.github.io/

  18. arXiv:2504.12459  [pdf, other

    cs.CL cs.AI

    On Linear Representations and Pretraining Data Frequency in Language Models

    Authors: Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar

    Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the r… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: ICLR 2025

  19. arXiv:2504.11393  [pdf, ps, other

    cs.LG cs.CL

    DataDecide: How to Predict Best Pretraining Data with Small Experiments

    Authors: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

    Abstract: Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and eval… ▽ More

    Submitted 13 July, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: ICML 2025

  20. arXiv:2504.07096  [pdf, ps, other

    cs.CL

    OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

    Authors: Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh , et al. (6 additional authors not shown)

    Abstract: We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few second… ▽ More

    Submitted 7 July, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

    Comments: ACL 2025 demo track

  21. arXiv:2504.03790  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

    Authors: Gonçalo Faria, Noah A. Smith

    Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what a… ▽ More

    Submitted 18 December, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

  22. arXiv:2503.13423  [pdf, ps, other

    cs.CL cs.LG

    SuperBPE: Space Travel for Language Models

    Authors: Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi

    Abstract: The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation… ▽ More

    Submitted 26 August, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: COLM 2025 camera-ready

  23. arXiv:2501.00656  [pdf, ps, other

    cs.CL cs.LG

    2 OLMo 2 Furious

    Authors: Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng Liu , et al. (18 additional authors not shown)

    Abstract: We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focu… ▽ More

    Submitted 8 October, 2025; v1 submitted 31 December, 2024; originally announced January 2025.

    Comments: Shorter version accepted to COLM 2025. Updated to include 32B results. Model demo available at playground.allenai.org

  24. arXiv:2412.04403  [pdf, ps, other

    cs.CL cs.AI

    Establishing Task Scaling Laws via Compute-Efficient Model Ladders

    Authors: Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi

    Abstract: We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: (1) use model and data size to predict an intermediate loss, then (2) use it to predict task performan… ▽ More

    Submitted 22 August, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: COLM 2025

  25. arXiv:2411.15124  [pdf, other

    cs.CL

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Authors: Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi

    Abstract: Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce… ▽ More

    Submitted 14 April, 2025; v1 submitted 22 November, 2024; originally announced November 2024.

    Comments: Added Tulu 3 405B results and additional analyses

  26. arXiv:2410.19133  [pdf, ps, other

    cs.CL

    Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

    Authors: Lester James V. Miranda, Yizhong Wang, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, Pradeep Dasigi

    Abstract: Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, collecting human preferences is expensive and time-consuming, with highly variable annotation quality. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations, offering a cost-effective and scalable alternative, albeit susceptible to other biases… ▽ More

    Submitted 30 May, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

    Comments: Code in https://github.com/allenai/hybrid-preferences, MultiPref dataset in https://huggingface.co/datasets/allenai/multipref, Updated related work and acknowledgments

  27. arXiv:2410.16560  [pdf, ps, other

    cs.HC cs.AI cs.CL

    How Performance Pressure Influences AI-Assisted Decision Making

    Authors: Nikita Haduong, Noah A. Smith

    Abstract: Many domains now employ AI-based decision-making aids, and although the potential for AI systems to assist with decision making is much discussed, human-AI collaboration often underperforms due to factors such as (mis)trust in the AI system and beliefs about AI being incapable of completing subjective tasks. One potential tool for influencing human decision making is performance pressure, which ha… ▽ More

    Submitted 21 August, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

  28. arXiv:2410.16027  [pdf, other

    cs.CL

    ComPO: Community Preferences for Language Model Personalization

    Authors: Sachin Kumar, Chan Young Park, Yulia Tsvetkov, Noah A. Smith, Hannaneh Hajishirzi

    Abstract: Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an "average" user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

  29. arXiv:2410.12937  [pdf, other

    cs.CL cs.LG

    Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

    Authors: Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge, Pradeep Dasigi

    Abstract: Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g.… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

    Comments: Findings of EMNLP 2024

  30. arXiv:2409.17146  [pdf, other

    cs.CV cs.CL cs.LG

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Authors: Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou , et al. (25 additional authors not shown)

    Abstract: Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs t… ▽ More

    Submitted 5 December, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: Updated with ablations and more technical details

  31. arXiv:2409.02060  [pdf, other

    cs.CL cs.AI cs.LG

    OLMoE: Open Mixture-of-Experts Language Models

    Authors: Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi

    Abstract: We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat an… ▽ More

    Submitted 2 March, 2025; v1 submitted 3 September, 2024; originally announced September 2024.

    Comments: 63 pages (24 main), 36 figures, 17 tables

  32. arXiv:2409.00316  [pdf, other

    cs.CV cs.AI

    Toward a More Complete OMR Solution

    Authors: Guang Yang, Muru Zhang, Lin Qiu, Yanming Wan, Noah A. Smith

    Abstract: Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In… ▽ More

    Submitted 30 August, 2024; originally announced September 2024.

  33. Risks and NLP Design: A Case Study on Procedural Document QA

    Authors: Nikita Haduong, Alice Gao, Noah A. Smith

    Abstract: As NLP systems are increasingly deployed at scale, concerns about their potential negative impacts have attracted the attention of the research community, yet discussions of risk have mostly been at an abstract level and focused on generic AI or NLP applications. We argue that clearer assessments of risks and harms to users--and concrete strategies to mitigate them--will be possible when we specia… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Journal ref: Findings of the Association for Computational Linguistics ACL (2023) 1248-1269

  34. arXiv:2408.08853  [pdf, other

    cs.HC

    CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks

    Authors: Nikita Haduong, Irene Wang, Bo-Ru Lu, Prithviraj Ammanabrolu, Noah A. Smith

    Abstract: Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI), but team research literature finds that, for complex tasks, larger teams are more effective. Progress in studying collaboration with more than two agents, thr… ▽ More

    Submitted 21 October, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  35. arXiv:2408.06518  [pdf, other

    cs.CL

    Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models

    Authors: Hila Gonen, Terra Blevins, Alisa Liu, Luke Zettlemoyer, Noah A. Smith

    Abstract: Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and a… ▽ More

    Submitted 15 May, 2025; v1 submitted 12 August, 2024; originally announced August 2024.

  36. arXiv:2407.16607  [pdf, other

    cs.CL cs.LG

    Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

    Authors: Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith

    Abstract: The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair enc… ▽ More

    Submitted 30 November, 2024; v1 submitted 23 July, 2024; originally announced July 2024.

    Comments: NeurIPS camera-ready, code at https://github.com/alisawuffles/tokenizer-attack

  37. arXiv:2407.12043  [pdf, other

    cs.CL cs.AI cs.HC

    The Art of Saying No: Contextual Noncompliance in Language Models

    Authors: Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

    Abstract: Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a… ▽ More

    Submitted 22 November, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: The first two authors are co-first authors; Accepted at NeurIPS 2024 Track on Datasets and Benchmarks

  38. arXiv:2407.08818  [pdf

    cs.CL

    MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

    Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, Noah A. Smith

    Abstract: In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptiv… ▽ More

    Submitted 16 November, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

  39. arXiv:2407.06460  [pdf, other

    cs.CL cs.AI

    MUSE: Machine Unlearning Six-Way Evaluation for Language Models

    Authors: Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, Chiyuan Zhang

    Abstract: Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approxim… ▽ More

    Submitted 14 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  40. arXiv:2406.19564  [pdf, other

    cs.CL

    Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

    Authors: Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov

    Abstract: Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  41. arXiv:2406.18853  [pdf, other

    cs.LG

    Decoding-Time Language Model Alignment with Multiple Objectives

    Authors: Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A. Smith, Simon S. Du

    Abstract: Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a lin… ▽ More

    Submitted 27 October, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: NeurIPS accepted version

  42. arXiv:2406.18664  [pdf, other

    cs.CL cs.LG

    Evaluating Copyright Takedown Methods for Language Models

    Authors: Boyi Wei, Weijia Shi, Yangsibo Huang, Noah A. Smith, Chiyuan Zhang, Luke Zettlemoyer, Kai Li, Peter Henderson

    Abstract: Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns fo… ▽ More

    Submitted 11 October, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: 31 pages, 9 figures, 14 tables

  43. arXiv:2406.13069  [pdf, ps, other

    cs.CL cs.AI

    Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG

    Authors: William Merrill, Noah A. Smith, Yanai Elazar

    Abstract: How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily… ▽ More

    Submitted 22 August, 2025; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: To appear at EMNLP 2024

  44. arXiv:2406.09403  [pdf, other

    cs.CV cs.CL

    Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

    Authors: Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna

    Abstract: Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In t… ▽ More

    Submitted 10 November, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to NeurIPS 2024. Project and codes url: https://visualsketchpad.github.io/

  45. arXiv:2406.09279  [pdf, other

    cs.CL

    Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

    Authors: Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

    Abstract: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core a… ▽ More

    Submitted 7 October, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Neurips 2024 camera-ready

  46. arXiv:2405.06563  [pdf, other

    cs.CL

    What Can Natural Language Processing Do for Peer Review?

    Authors: Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen, Nils Dycke, Alexander Goldberg, Tom Hope, Dirk Hovy, Jonathan K. Kummerfeld, Anne Lauscher, Kevin Leyton-Brown, Sheng Lu, Mausam, Margot Mieskes, Aurélie Névéol, Danish Pruthi, Lizhen Qu, Roy Schwartz, Noah A. Smith, Thamar Solorio, Jingyan Wang, Xiaodan Zhu, Anna Rogers, Nihar B. Shah, Iryna Gurevych

    Abstract: The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  47. arXiv:2404.16367  [pdf, other

    cs.CL cs.LG

    Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers

    Authors: Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah A. Smith, Navin Goyal, Yulia Tsvetkov

    Abstract: Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge. We extensively experiment with transfor… ▽ More

    Submitted 16 March, 2025; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted in TACL Code now available: https://github.com/kabirahuja2431/transformers-hg

  48. arXiv:2404.12390  [pdf, other

    cs.CV cs.AI cs.CL

    BLINK: Multimodal Large Language Models Can See but Not Perceive

    Authors: Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna

    Abstract: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challeng… ▽ More

    Submitted 3 July, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

    Comments: Multimodal Benchmark, Project Url: https://zeyofu.github.io/blink/, ECCV 2024

  49. arXiv:2403.14072  [pdf, other

    cs.CL

    A Taxonomy of Ambiguity Types for NLP

    Authors: Margaret Y. Li, Alisa Liu, Zhaofeng Wu, Noah A. Smith

    Abstract: Ambiguity is an critical component of language that allows for more effective communication between speakers, but is often ignored in NLP. Recent work suggests that NLP systems may struggle to grasp certain elements of human language understanding because they may not handle ambiguities at the level that humans naturally do in communication. Additionally, different types of ambiguity may serve dif… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: To appear at the UnImplicit workshop at EACL 2024

  50. arXiv:2403.13787  [pdf, other

    cs.LG

    RewardBench: Evaluating Reward Models for Language Modeling

    Authors: Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

    Abstract: Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training a… ▽ More

    Submitted 8 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: 44 pages, 19 figures, 12 tables