Skip to main content

Showing 1–31 of 31 results for author: Bawden, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.17738  [pdf, ps, other

    cs.CL

    When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

    Authors: Lydia Nishimwe, Benoît Sagot, Rachel Bawden

    Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guideline… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

    Comments: 10 pages, 19 pages with references and appendices

  2. arXiv:2510.25771  [pdf, ps, other

    cs.CL cs.AI

    Gaperon: A Peppered English-French Generative Language Model Suite

    Authors: Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

    Abstract: We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data… ▽ More

    Submitted 29 October, 2025; originally announced October 2025.

  3. arXiv:2510.11919  [pdf, ps, other

    cs.CL

    LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

    Authors: Armel Zebaze, Rachel Bawden, Benoît Sagot

    Abstract: Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

  4. arXiv:2508.14909  [pdf, ps, other

    cs.CL

    Preliminary Ranking of WMT25 General Machine Translation Systems

    Authors: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Natalia Fedorova, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo Proietti, Martin Popel, Maja Popović, Parker Riley, Mariya Shmatova , et al. (3 additional authors not shown)

    Abstract: We present the preliminary rankings of machine translation (MT) systems submitted to the WMT25 General Machine Translation Shared Task, as determined by automatic evaluation metrics. Because these rankings are derived from automatic evaluation, they may exhibit a bias toward systems that employ re-ranking techniques, such as Quality Estimation or Minimum Bayes Risk decoding. The official WMT25 ran… ▽ More

    Submitted 24 August, 2025; v1 submitted 11 August, 2025; originally announced August 2025.

  5. arXiv:2508.08680  [pdf, ps, other

    cs.CL

    TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

    Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden

    Abstract: LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, q… ▽ More

    Submitted 12 August, 2025; originally announced August 2025.

  6. arXiv:2508.02290  [pdf, ps, other

    cs.CL

    A French Version of the OLDI Seed Corpus

    Authors: Malik Marmonier, Benoît Sagot, Rachel Bawden

    Abstract: We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which co… ▽ More

    Submitted 4 August, 2025; originally announced August 2025.

  7. arXiv:2503.09454  [pdf, ps, other

    cs.CL

    Explicit Learning and the LLM in Machine Translation

    Authors: Malik Marmonier, Rachel Bawden, Benoît Sagot

    Abstract: This study explores an LLM's ability to learn new languages using explanations found in a grammar book, a process we term "explicit learning." To rigorously assess this ability, we design controlled translation experiments between English and constructed languages generated, through specific cryptographic means, from Latin or French. Contrary to previous studies, our results demonstrate that LLMs… ▽ More

    Submitted 4 September, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

  8. arXiv:2503.04554  [pdf, other

    cs.CL

    Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation

    Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden

    Abstract: The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. Machine Translation (MT) has been shown to benefit from in-context examples, in particular when they are semantically similar to the sentence to translate. In this paper, we propose a new LLM-b… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  9. arXiv:2501.06374  [pdf, ps, other

    cs.CL

    AFRIDOC-MT: Document-level MT Corpus for African Languages

    Authors: Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab, Davis David, Shamsuddeen Hassan Muhammad, Neo Putini, David O. Ademuyiwa, Andrew Caines, Dietrich Klakow

    Abstract: This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine tra… ▽ More

    Submitted 13 October, 2025; v1 submitted 10 January, 2025; originally announced January 2025.

    Comments: EMNLP 2025

  10. arXiv:2412.17592  [pdf, other

    cs.CL

    Investigating Length Issues in Document-level Machine Translation

    Authors: Ziqian Peng, Rachel Bawden, François Yvon

    Abstract: Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT). In this work, we challenge the ability of MT systems to handle texts comprising up to several thousands of tokens. We design and implement a new approach designed to precisely measure the effect of length increments on MT… ▽ More

    Submitted 28 April, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

    Comments: Accepted at the MT Summit 2025

  11. arXiv:2410.06634  [pdf, other

    cs.CL

    Tree of Problems: Improving structured problem solving with compositionality

    Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden

    Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across multiple tasks through in-context learning. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when combined with self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) and Grap… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  12. arXiv:2408.00397  [pdf, other

    cs.CL

    In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

    Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden

    Abstract: The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. In this paper, we focus on machine translation (MT), a task that has been shown to benefit from in-context translation examples. However no systematic studies have been published on how best to… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  13. arXiv:2407.19884  [pdf, other

    cs.CL

    Preliminary WMT24 Ranking of General MT Systems and LLMs

    Authors: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovic, Mariya Shmatova, Steinþór Steingrímsson, Vilém Zouhar

    Abstract: This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submissio… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  14. arXiv:2407.13579  [pdf, other

    cs.CL

    Towards Zero-Shot Multimodal Machine Translation

    Authors: Matthieu Futeral, Cordelia Schmid, Benoît Sagot, Rachel Bawden

    Abstract: Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT… ▽ More

    Submitted 11 March, 2025; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: NAACL 2025 (Findings)

  15. arXiv:2406.08707  [pdf, ps, other

    cs.CL cs.CV

    mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

    Authors: Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot

    Abstract: Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. (2022) showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have… ▽ More

    Submitted 29 May, 2025; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: ACL 2025 (Findings)

  16. arXiv:2403.17220  [pdf, other

    cs.CL

    Making Sentence Embeddings Robust to User-Generated Content

    Authors: Lydia Nishimwe, Benoît Sagot, Rachel Bawden

    Abstract: NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their s… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  17. arXiv:2305.14012  [pdf, other

    cs.CL

    When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages

    Authors: Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot, Rachel Bawden

    Abstract: Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related hig… ▽ More

    Submitted 25 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 9 pages, Accepted at LREC-COLING 2024

  18. arXiv:2305.03207  [pdf, other

    cs.CL cs.AI

    Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages

    Authors: Sonal Sannigrahi, Rachel Bawden

    Abstract: Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati,… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: EAMT main conference

  19. arXiv:2303.01911  [pdf, ps, other

    cs.CL

    Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM

    Authors: Rachel Bawden, François Yvon

    Abstract: The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from ov… ▽ More

    Submitted 9 May, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: Accepted at EAMT 2023

  20. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

    Authors: Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

    Abstract: One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training d… ▽ More

    Submitted 26 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL 2023

  21. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  22. arXiv:2205.12394  [pdf, other

    cs.CL

    MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification

    Authors: Yu Lu Liu, Rachel Bawden, Thomas Scialom, Benoît Sagot, Jackie Chi Kit Cheung

    Abstract: In text summarization and simplification, system outputs must be evaluated along multiple dimensions such as relevance, factual consistency, fluency, and grammaticality, and a wide range of possible outputs could be of high quality. These properties make the development of an adaptable, reference-less evaluation metric both necessary and challenging. We introduce MaskEval, a reference-less metric… ▽ More

    Submitted 13 October, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

  23. arXiv:2202.09452  [pdf, other

    cs.CL

    From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

    Authors: Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette, Benoît Sagot

    Abstract: Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we prese… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

    Comments: 8 pages, 2 figures, 4 tables

  24. arXiv:2110.08207  [pdf, other

    cs.LG cs.CL

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Authors: Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen , et al. (16 additional authors not shown)

    Abstract: Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale,… ▽ More

    Submitted 17 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: ICLR 2022 Spotlight (with extended discussion)

  25. arXiv:2109.00486  [pdf, other

    cs.CL

    Survey of Low-Resource Machine Translation

    Authors: Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl, Alexandra Birch

    Abstract: We present a survey covering the state of the art in low-resource machine translation research. There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated train… ▽ More

    Submitted 7 February, 2022; v1 submitted 1 September, 2021; originally announced September 2021.

  26. arXiv:2103.16911  [pdf, other

    cs.CL

    Few-shot learning through contextual data augmentation

    Authors: Farid Arthaud, Rachel Bawden, Alexandra Birch

    Abstract: Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in… ▽ More

    Submitted 31 March, 2021; originally announced March 2021.

    Comments: 14 pages includince 3 of appendices

  27. arXiv:2004.14989  [pdf, ps, other

    cs.CL

    A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing

    Authors: Rachel Bawden, Biao Zhang, Lisa Yankovskaya, Andre Tättar, Matt Post

    Abstract: We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional diverse references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language direction… ▽ More

    Submitted 8 October, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: Accepted in the Findings of EMNLP 2020

  28. arXiv:1912.06598  [pdf, other

    cs.CL

    Document Sub-structure in Neural Machine Translation

    Authors: Radina Dobreva, Jie Zhou, Rachel Bawden

    Abstract: Current approaches to machine translation (MT) either translate sentences in isolation, disregarding the context they appear in, or model context at the level of the full document, without a notion of any internal structure the document may have. In this work we consider the fact that documents are rarely homogeneous blocks of text, but rather consist of parts covering different topics. Some docum… ▽ More

    Submitted 10 March, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

    Comments: Accepted at LREC 2020

  29. arXiv:1907.05854  [pdf, other

    cs.CL

    The University of Edinburgh's Submissions to the WMT19 News Translation Task

    Authors: Rachel Bawden, Nikolay Bogoychev, Ulrich Germann, Roman Grundkiewicz, Faheem Kirefu, Antonio Valerio Miceli Barone, Alexandra Birch

    Abstract: The University of Edinburgh participated in the WMT19 Shared Task on News Translation in six language directions: English-to-Gujarati, Gujarati-to-English, English-to-Chinese, Chinese-to-English, German-to-English, and English-to-Czech. For all translation directions, we created or used back-translations of monolingual data in the target language as additional synthetic training data. For English-… ▽ More

    Submitted 12 July, 2019; originally announced July 2019.

    Comments: To appear in the Proceedings of WMT19: Shared Task Papers

  30. arXiv:1905.13354  [pdf, other

    cs.CL

    DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

    Authors: Rachel Bawden, Sophie Rosset, Thomas Lavergne, Eric Bilinski

    Abstract: We present a new English-French test set for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality… ▽ More

    Submitted 30 May, 2019; originally announced May 2019.

  31. arXiv:1711.00513  [pdf, other

    cs.CL

    Evaluating Discourse Phenomena in Neural Machine Translation

    Authors: Rachel Bawden, Rico Sennrich, Alexandra Birch, Barry Haddow

    Abstract: For machine translation to tackle discourse phenomena, models must have access to extra-sentential linguistic context. There has been recent interest in modelling context in neural machine translation (NMT), but models have been principally evaluated with standard automatic metrics, poorly adapted to evaluating discourse phenomena. In this article, we present hand-crafted, discourse test sets, des… ▽ More

    Submitted 20 April, 2018; v1 submitted 1 November, 2017; originally announced November 2017.

    Comments: Final version of paper to appear in Proceedings of NAACL 2018