Skip to main content

Showing 1–23 of 23 results for author: Wettig, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2603.24477  [pdf, ps, other

    cs.SE cs.LG

    Composer 2 Technical Report

    Authors: Cursor Research, :, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, Chen Shen, Emily Jia, Federico Cassano, Hanpeng Liu, Haoyu Chen, Henry Wildermuth, Jacob Jackson, Janet Li, Jediah Katz, Jiajun Yao, Joey Hejna, Josh Warner, Julius Vering, Kevin Frans , et al. (31 additional authors not shown)

    Abstract: Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model's knowledge and latent coding ability, followed by large-scale reinforcement learni… ▽ More

    Submitted 25 March, 2026; v1 submitted 25 March, 2026; originally announced March 2026.

  2. arXiv:2512.13961  [pdf, ps, other

    cs.CL cs.LG

    Olmo 3

    Authors: Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng , et al. (44 additional authors not shown)

    Abstract: We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, a… ▽ More

    Submitted 14 April, 2026; v1 submitted 15 December, 2025; originally announced December 2025.

    Comments: minor edit updates

  3. arXiv:2510.18148  [pdf, ps, other

    cs.CL cs.LG

    Extracting Rule-based Descriptions of Attention Features in Transformers

    Authors: Dan Friedman, Adithya Bhaskar, Alexander Wettig, Danqi Chen

    Abstract: Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper… ▽ More

    Submitted 20 October, 2025; originally announced October 2025.

    Comments: Our code is available at https://github.com/princeton-nlp/AttentionRules

  4. arXiv:2506.17121  [pdf, ps, other

    cs.CL

    Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?

    Authors: Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen

    Abstract: Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: We release our code publicly at https://github.com/princeton-pli/PruLong

  5. arXiv:2504.21798  [pdf, other

    cs.SE cs.AI cs.CL

    SWE-smith: Scaling Data for Software Engineering Agents

    Authors: John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, Diyi Yang

    Abstract: Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up sever… ▽ More

    Submitted 21 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

    Comments: All assets available at https://swesmith.com

  6. arXiv:2504.06536  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Lugha-Llama: Adapting Large Language Models for African Languages

    Authors: Happy Buzaaba, Alexander Wettig, David Ifeoluwa Adelani, Christiane Fellbaum

    Abstract: Large language models (LLMs) have achieved impressive results in a wide range of natural language applications. However, they often struggle to recognize low-resource languages, in particular African languages, which are not well represented in large training corpora. In this paper, we consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African la… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  7. arXiv:2502.10341  [pdf, ps, other

    cs.CL

    Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

    Authors: Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini

    Abstract: Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce W… ▽ More

    Submitted 16 July, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

    Comments: Accepted at ICML 2025. Project page: https://weborganizer.allen.ai

  8. arXiv:2501.01956  [pdf, ps, other

    cs.CL

    Metadata Conditioning Accelerates Language Model Pre-training

    Authors: Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen

    Abstract: The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate… ▽ More

    Submitted 27 June, 2025; v1 submitted 3 January, 2025; originally announced January 2025.

    Comments: Accepted to ICML 2025. Code available at https://github.com/princeton-pli/MeCo

  9. arXiv:2412.04403  [pdf, ps, other

    cs.CL cs.AI

    Establishing Task Scaling Laws via Compute-Efficient Model Ladders

    Authors: Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi

    Abstract: We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: (1) use model and data size to predict an intermediate loss, then (2) use it to predict task performan… ▽ More

    Submitted 22 August, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: COLM 2025

  10. arXiv:2410.02660  [pdf, ps, other

    cs.CL cs.LG

    How to Train Long-Context Language Models (Effectively)

    Authors: Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen

    Abstract: We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context downstream tasks, and we evaluate models after SFT as this better reveals long-co… ▽ More

    Submitted 3 December, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

    Comments: Accepted to ACL 2025. Our code, data, and models are available at https://github.com/princeton-nlp/ProLong

  11. arXiv:2409.02060  [pdf, other

    cs.CL cs.AI cs.LG

    OLMoE: Open Mixture-of-Experts Language Models

    Authors: Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi

    Abstract: We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat an… ▽ More

    Submitted 2 March, 2025; v1 submitted 3 September, 2024; originally announced September 2024.

    Comments: 63 pages (24 main), 36 figures, 17 tables

  12. arXiv:2406.16778  [pdf, other

    cs.CL

    Finding Transformer Circuits with Edge Pruning

    Authors: Adithya Bhaskar, Alexander Wettig, Dan Friedman, Danqi Chen

    Abstract: The path to interpreting a language model often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated… ▽ More

    Submitted 2 April, 2025; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: NeurIPS 2024 (Spotlight), code available at https://github.com/princeton-nlp/Edge-Pruning

  13. arXiv:2405.15793  [pdf, other

    cs.SE cs.AI cs.CL cs.HC cs.LG

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    Authors: John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press

    Abstract: Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built int… ▽ More

    Submitted 11 November, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: Code, data, and demo available at https://swe-agent.com

  14. arXiv:2402.11111  [pdf, other

    cs.CL

    Language Models as Science Tutors

    Authors: Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodríguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen

    Abstract: NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering bench… ▽ More

    Submitted 21 July, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: 8 pages without bibliography and appendix, 26 pages total

  15. arXiv:2402.09739  [pdf, other

    cs.CL cs.LG

    QuRating: Selecting High-Quality Data for Training Language Models

    Authors: Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

    Abstract: Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs a… ▽ More

    Submitted 17 July, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: Accepted at ICML 2024. The results for top-k selection have been corrected. The code, models and data are available at https://github.com/princeton-nlp/QuRating

  16. arXiv:2310.19156  [pdf, other

    cs.CL cs.IR

    Poisoning Retrieval Corpora by Injecting Adversarial Passages

    Authors: Zexuan Zhong, Ziqing Huang, Alexander Wettig, Danqi Chen

    Abstract: Dense retrievers have achieved state-of-the-art performance in various information retrieval tasks, but to what extent can they be safely deployed in real-world applications? In this work, we propose a novel attack for dense retrieval systems in which a malicious user generates a small number of adversarial passages by perturbing discrete tokens to maximize similarity with a provided set of traini… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023. Our code is available at https://github.com/princeton-nlp/corpus-poisoning

  17. arXiv:2310.06770  [pdf, other

    cs.CL cs.AI cs.SE

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Authors: Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan

    Abstract: Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ softw… ▽ More

    Submitted 11 November, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Data, code, and leaderboard are available at https://www.swebench.com ICLR 2024, https://openreview.net/forum?id=VTF8yNQM66

  18. arXiv:2306.01128  [pdf, other

    cs.LG cs.CL

    Learning Transformer Programs

    Authors: Dan Friedman, Alexander Wettig, Danqi Chen

    Abstract: Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short of providing complete, faithful descriptions of the underlying algorithms. In this work, we introduce a procedure for training Transformers that are mechanistic… ▽ More

    Submitted 30 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 (oral). Our code is available at https://github.com/princeton-nlp/TransformerPrograms

  19. arXiv:2305.14788  [pdf, other

    cs.CL

    Adapting Language Models to Compress Contexts

    Authors: Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen

    Abstract: Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window and the expensive computational cost of processing long text documents. We propose to adapt pre-trained LMs into AutoCompressors. These language models are capable of compressing long contexts into compact summary vectors, which are then accessible to the… ▽ More

    Submitted 4 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP 2023; added results for Llama-2-7B model

  20. arXiv:2210.11560  [pdf, other

    cs.CL

    Finding Dataset Shortcuts with Grammar Induction

    Authors: Dan Friedman, Alexander Wettig, Danqi Chen

    Abstract: Many NLP datasets have been found to contain shortcuts: simple decision rules that achieve surprisingly high accuracy. However, it is difficult to discover shortcuts automatically. Prior work on automatic shortcut detection has focused on enumerating features like unigrams or bigrams, which can find only low-level shortcuts, or relied on post-hoc model interpretability methods like saliency maps,… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022. Our code is publicly available at https://github.com/princeton-nlp/ShortcutGrammar

  21. arXiv:2210.05643  [pdf, other

    cs.LG cs.CL

    A Kernel-Based View of Language Model Fine-Tuning

    Authors: Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, Sanjeev Arora

    Abstract: It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a mode… ▽ More

    Submitted 6 June, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: Accepted at ICML 2023. Code and pre-computed kernels are publicly available at https://github.com/princeton-nlp/LM-Kernel-FT

  22. arXiv:2202.08005  [pdf, other

    cs.CL cs.LG

    Should You Mask 15% in Masked Language Modeling?

    Authors: Alexander Wettig, Tianyu Gao, Zexuan Zhong, Danqi Chen

    Abstract: Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models shoul… ▽ More

    Submitted 10 February, 2023; v1 submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted to EACL 2023. The code and pre-trained models are available at https://github.com/princeton-nlp/DinkyTrain

  23. arXiv:2109.08133  [pdf, other

    cs.CL cs.IR

    Phrase Retrieval Learns Passage Retrieval, Too

    Authors: Jinhyuk Lee, Alexander Wettig, Danqi Chen

    Abstract: Dense retrieval methods have shown great promise over sparse retrieval methods in a range of NLP problems. Among them, dense phrase retrieval-the most fine-grained retrieval unit-is appealing because phrases can be directly used as the output for question answering and slot filling tasks. In this work, we follow the intuition that retrieving phrases naturally entails retrieving larger text blocks… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021. Code available at https://github.com/princeton-nlp/DensePhrases