-
Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies
Authors:
Ekaterina Artemova,
Laurie Burchell,
Daryna Dementieva,
Shu Okabe,
Mariya Shmatova,
Pedro Ortiz Suarez
Abstract:
This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data colle…
▽ More
This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
Preliminary Ranking of WMT25 General Machine Translation Systems
Authors:
Tom Kocmi,
Eleftherios Avramidis,
Rachel Bawden,
Ondřej Bojar,
Konstantin Dranch,
Anton Dvorkovich,
Sergey Dukanov,
Natalia Fedorova,
Mark Fishel,
Markus Freitag,
Thamme Gowda,
Roman Grundkiewicz,
Barry Haddow,
Marzena Karpinska,
Philipp Koehn,
Howard Lakougna,
Jessica Lundin,
Kenton Murray,
Masaaki Nagata,
Stefano Perrella,
Lorenzo Proietti,
Martin Popel,
Maja Popović,
Parker Riley,
Mariya Shmatova
, et al. (3 additional authors not shown)
Abstract:
We present the preliminary rankings of machine translation (MT) systems submitted to the WMT25 General Machine Translation Shared Task, as determined by automatic evaluation metrics. Because these rankings are derived from automatic evaluation, they may exhibit a bias toward systems that employ re-ranking techniques, such as Quality Estimation or Minimum Bayes Risk decoding. The official WMT25 ran…
▽ More
We present the preliminary rankings of machine translation (MT) systems submitted to the WMT25 General Machine Translation Shared Task, as determined by automatic evaluation metrics. Because these rankings are derived from automatic evaluation, they may exhibit a bias toward systems that employ re-ranking techniques, such as Quality Estimation or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede these results. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede these results. The purpose of releasing these findings now is to assist task participants with their system description papers; not to provide final findings.
△ Less
Submitted 24 August, 2025; v1 submitted 11 August, 2025;
originally announced August 2025.
-
Preliminary WMT24 Ranking of General MT Systems and LLMs
Authors:
Tom Kocmi,
Eleftherios Avramidis,
Rachel Bawden,
Ondrej Bojar,
Anton Dvorkovich,
Christian Federmann,
Mark Fishel,
Markus Freitag,
Thamme Gowda,
Roman Grundkiewicz,
Barry Haddow,
Marzena Karpinska,
Philipp Koehn,
Benjamin Marie,
Kenton Murray,
Masaaki Nagata,
Martin Popel,
Maja Popovic,
Mariya Shmatova,
Steinþór Steingrímsson,
Vilém Zouhar
Abstract:
This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submissio…
▽ More
This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submission.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Authors:
Tom Kocmi,
Vilém Zouhar,
Eleftherios Avramidis,
Roman Grundkiewicz,
Marzena Karpinska,
Maja Popović,
Mrinmaya Sachan,
Mariya Shmatova
Abstract:
High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA)…
▽ More
High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable. In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM. We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
△ Less
Submitted 18 October, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks
Authors:
Andrey Malinin,
Neil Band,
Ganshin,
Alexander,
German Chesnokov,
Yarin Gal,
Mark J. F. Gales,
Alexey Noskov,
Andrey Ploskonosov,
Liudmila Prokhorenkova,
Ivan Provilkov,
Vatsal Raina,
Vyas Raina,
Roginskiy,
Denis,
Mariya Shmatova,
Panos Tigas,
Boris Yangel
Abstract:
There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image class…
▽ More
There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Thus, given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselines. In this work, we propose the Shifts Dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, "in-the-wild" distributional shifts and pose interesting challenges with respect to uncertainty estimation. In this work we provide a description of the dataset and baseline results for all tasks.
△ Less
Submitted 11 February, 2022; v1 submitted 15 July, 2021;
originally announced July 2021.