Skip to main content

Showing 1–50 of 69 results for author: Koehn, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.20485  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

    Authors: Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

    Abstract: Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody.… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: Under review for IEEE OJSP

  2. arXiv:2509.18550  [pdf, ps, other

    cs.CV

    HadaSmileNet: Hadamard fusion of handcrafted and deep-learning features for enhancing facial emotion recognition of genuine smiles

    Authors: Mohammad Junayed Hasan, Nabeel Mohammed, Shafin Rahman, Philipp Koehn

    Abstract: The distinction between genuine and posed emotions represents a fundamental pattern recognition challenge with significant implications for data mining applications in social sciences, healthcare, and human-computer interaction. While recent multi-task learning frameworks have shown promise in combining deep learning architectures with handcrafted D-Marker features for smile facial emotion recogni… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: Accepted to IEEE International Conference on Data Mining (ICDM) 2025. Final version to appear in the conference proceedings

  3. arXiv:2509.18360  [pdf, ps, other

    cs.CL

    Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

    Authors: Chutong Meng, Philipp Koehn

    Abstract: We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: Accepted by EMNLP 2025 (main)

  4. arXiv:2509.14480  [pdf, ps, other

    cs.CL cs.AI cs.MA

    Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

    Authors: Weiting Tan, Xinghua Qu, Ming Tu, Meng Ge, Andy T. Liu, Philipp Koehn, Lu Lu

    Abstract: Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

  5. arXiv:2508.16188  [pdf, ps, other

    cs.CL cs.CV cs.MM cs.SD eess.AS

    Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

    Authors: Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma

    Abstract: We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial… ▽ More

    Submitted 27 August, 2025; v1 submitted 22 August, 2025; originally announced August 2025.

    Comments: EMNLP 2025 (Findings)

  6. arXiv:2508.14909  [pdf, ps, other

    cs.CL

    Preliminary Ranking of WMT25 General Machine Translation Systems

    Authors: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Natalia Fedorova, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo Proietti, Martin Popel, Maja Popović, Parker Riley, Mariya Shmatova , et al. (3 additional authors not shown)

    Abstract: We present the preliminary rankings of machine translation (MT) systems submitted to the WMT25 General Machine Translation Shared Task, as determined by automatic evaluation metrics. Because these rankings are derived from automatic evaluation, they may exhibit a bias toward systems that employ re-ranking techniques, such as Quality Estimation or Minimum Bayes Risk decoding. The official WMT25 ran… ▽ More

    Submitted 24 August, 2025; v1 submitted 11 August, 2025; originally announced August 2025.

  7. arXiv:2505.16281  [pdf, ps, other

    cs.CL

    HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

    Authors: Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong

    Abstract: The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and a… ▽ More

    Submitted 15 September, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  8. arXiv:2502.21265  [pdf, other

    cs.CL

    Token-level Ensembling of Models with Different Vocabularies

    Authors: Rachel Wicks, Kartik Ravisankar, Xinchen Yang, Philipp Koehn, Matt Post

    Abstract: Model ensembling is a technique to combine the predicted distributions of two or more models, often leading to improved robustness and performance. For ensembling in text generation, the next token's probability distribution is derived from a weighted sum of the distributions of each individual model. This requires the underlying models to share the same subword vocabulary, limiting the applicabil… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

    Comments: Under review

  9. arXiv:2412.11732  [pdf, other

    cs.CL

    Findings of the WMT 2024 Shared Task on Discourse-Level Literary Translation

    Authors: Longyue Wang, Siyou Liu, Chenyang Lyu, Wenxiang Jiao, Xing Wang, Jiahao Xu, Zhaopeng Tu, Yan Gu, Weiyu Chen, Minghao Wu, Liting Zhou, Philipp Koehn, Andy Way, Yulin Yuan

    Abstract: Following last year, we have continued to host the WMT translation shared task this year, the second edition of the Discourse-Level Literary Translation. We focus on three language directions: Chinese-English, Chinese-German, and Chinese-Russian, with the latter two ones newly added. This year, we totally received 10 submissions from 5 academia and industry teams. We employ both automatic and huma… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: WMT2024

  10. arXiv:2410.03115  [pdf, other

    cs.CL

    X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

    Authors: Haoran Xu, Kenton Murray, Philipp Koehn, Hieu Hoang, Akiko Eriguchi, Huda Khayrallah

    Abstract: Large language models (LLMs) have achieved remarkable success across various NLP tasks with a focus on English due to English-centric pre-training and limited multilingual data. In this work, we focus on the problem of translation, and while some multilingual LLMs claim to support for hundreds of languages, models often fail to provide high-quality responses for mid- and low-resource languages, le… ▽ More

    Submitted 2 March, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

    Comments: Published as a conference paper at ICLR 2025 (spotlight)

  11. arXiv:2407.19884  [pdf, other

    cs.CL

    Preliminary WMT24 Ranking of General MT Systems and LLMs

    Authors: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovic, Mariya Shmatova, Steinþór Steingrímsson, Vilém Zouhar

    Abstract: This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submissio… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  12. arXiv:2406.13748  [pdf, ps, other

    cs.CL cs.LG

    Learn and Unlearn: Addressing Misinformation in Multilingual LLMs

    Authors: Taiming Lu, Philipp Koehn

    Abstract: This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated con… ▽ More

    Submitted 3 September, 2025; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: EMNLP 2025 Main Conference

  13. arXiv:2406.03869  [pdf, other

    cs.CL

    Recovering document annotations for sentence-level bitext

    Authors: Rachel Wicks, Matt Post, Philipp Koehn

    Abstract: Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets hav… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Findings

  14. arXiv:2405.20389  [pdf, other

    astro-ph.IM cs.AI cs.HC cs.IR

    Designing an Evaluation Framework for Large Language Models in Astronomy Research

    Authors: John F. Wu, Alina Hyk, Kiera McCormick, Christine Ye, Simone Astarita, Elina Baral, Jo Ciuca, Jesse Cranney, Anjalie Field, Kartheik Iyer, Philipp Koehn, Jenn Kotler, Sandor Kruk, Michelle Ntampaka, Charles O'Neill, Joshua E. G. Peek, Sanjib Sharma, Mikaeel Yunus

    Abstract: Large Language Models (LLMs) are shifting how scientific research is done. It is imperative to understand how researchers interact with these models and how scientific sub-communities like astronomy might benefit from them. However, there is currently no standard for evaluating the use of LLMs in astronomy. Therefore, we present the experimental design for an evaluation study on how astronomy rese… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 7 pages, 3 figures. Code available at https://github.com/jsalt2024-evaluating-llms-for-astronomy/astro-arxiv-bot

  15. arXiv:2405.13274  [pdf, other

    cs.CL

    DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

    Authors: Weiting Tan, Jingyu Zhang, Lingfeng Shen, Daniel Khashabi, Philipp Koehn

    Abstract: Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e.g., acoustic and lingu… ▽ More

    Submitted 21 October, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

    Comments: Accepted at NeurIPS 2024

  16. arXiv:2403.10963  [pdf, other

    cs.CL

    Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!

    Authors: Niyati Bafna, Philipp Koehn, David Yarowsky

    Abstract: While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural "shortcuts", such as copying subwords from the source to the target, given that such la… ▽ More

    Submitted 17 June, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

    Comments: 5 pages, Accepted at Workshop on Insights from Negative Results in NLP (NAACL) 2024

  17. arXiv:2402.01172  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming Sequence Transduction through Dynamic Compression

    Authors: Weiting Tan, Yunmo Chen, Tongfei Chen, Guanghui Qin, Haoran Xu, Heidi C. Zhang, Benjamin Van Durme, Philipp Koehn

    Abstract: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrat… ▽ More

    Submitted 21 May, 2025; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: IWSLT 2025

  18. arXiv:2401.13136  [pdf, other

    cs.CL cs.AI

    The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts

    Authors: Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi

    Abstract: As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  19. arXiv:2311.03127  [pdf, other

    cs.CL cs.AI

    Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs

    Authors: Longyue Wang, Zhaopeng Tu, Yan Gu, Siyou Liu, Dian Yu, Qingsong Ma, Chenyang Lyu, Liting Zhou, Chao-Hong Liu, Yufeng Ma, Weiyu Chen, Yvette Graham, Bonnie Webber, Philipp Koehn, Andy Way, Yulin Yuan, Shuming Shi

    Abstract: Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the first edition of the Discourse-Level Literary Translation. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel co… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: WMT2023 Discourse-Level Literary Translation Shared Task Overview Paper

  20. arXiv:2311.02310  [pdf, other

    cs.CL

    Narrowing the Gap between Zero- and Few-shot Machine Translation by Matching Styles

    Authors: Weiting Tan, Haoran Xu, Lingfeng Shen, Shuyue Stella Li, Kenton Murray, Philipp Koehn, Benjamin Van Durme, Yunmo Chen

    Abstract: Large language models trained primarily in a monolingual setting have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning. However, even though zero-shot translations are relatively good, there remains a discernible gap comparing their performance with the few-shot setting. In this paper, we investigate the factors contributing… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

  21. arXiv:2310.00840  [pdf, other

    cs.CL

    Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

    Authors: Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, Kenton Murray

    Abstract: Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective tha… ▽ More

    Submitted 18 March, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  22. arXiv:2305.14280  [pdf, other

    cs.CL

    Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer

    Authors: Elizabeth Salesky, Neha Verma, Philipp Koehn, Matt Post

    Abstract: We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations. We experiment with two different data settings with a variety of language and script coverage, demonstrating improved performance compared to subword embeddings. We explore various properties of pixel representations such as parameter sharing within and across scripts to better… ▽ More

    Submitted 24 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  23. arXiv:2305.13993  [pdf, other

    cs.CL

    Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

    Authors: Haoran Xu, Weiting Tan, Shuyue Stella Li, Yunmo Chen, Benjamin Van Durme, Philipp Koehn, Kenton Murray

    Abstract: Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fu… ▽ More

    Submitted 22 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted at the main conference of EMNLP 2023

  24. arXiv:2210.14378  [pdf, other

    cs.CL cs.LG

    Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport

    Authors: Kelly Marchisio, Ali Saad-Eldin, Kevin Duh, Carey Priebe, Philipp Koehn

    Abstract: Bilingual lexicons form a critical component of various natural language processing applications, including unsupervised and semisupervised machine translation and crosslingual information retrieval. We improve bilingual lexicon induction performance across 40 language pairs with a graph-matching method based on optimal transport. The method is especially strong with low amounts of supervision.

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 Camera-Ready

  25. arXiv:2210.05098  [pdf, other

    cs.CL cs.LG

    IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

    Authors: Kelly Marchisio, Neha Verma, Kevin Duh, Philipp Koehn

    Abstract: The ability to extract high-quality translation dictionaries from monolingual word embedding spaces depends critically on the geometric similarity of the spaces -- their degree of "isomorphism." We address the root-cause of faulty cross-lingual mapping: that word embedding training resulted in the underlying spaces being non-isomorphic. We incorporate global measures of isomorphism directly into t… ▽ More

    Submitted 4 July, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: Updated EMNLP2022 Camera Ready (citation correction, removed references to dimensionality reduction [was not used here].)

  26. arXiv:2210.05033  [pdf, other

    cs.CL

    Multilingual Representation Distillation with Contrastive Learning

    Authors: Weiting Tan, Kevin Heffernan, Holger Schwenk, Philipp Koehn

    Abstract: Multilingual sentence representations from large models encode semantic information from two or more languages and can be used for different cross-lingual information retrieval and matching tasks. In this paper, we integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences (i.e., find semantically similar sentences that can… ▽ More

    Submitted 30 April, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: EACL 2023

  27. arXiv:2208.11194  [pdf, other

    cs.CL

    Bitext Mining for Low-Resource Languages via Contrastive Learning

    Authors: Weiting Tan, Philipp Koehn

    Abstract: Mining high-quality bitexts for low-resource languages is challenging. This paper shows that sentence representation of language models fine-tuned with multiple negatives ranking loss, a contrastive objective, helps retrieve clean bitexts. Experiments show that parallel data mined from our approach substantially outperform the previous state-of-the-art method on low resource languages Khmer and Pa… ▽ More

    Submitted 23 August, 2022; originally announced August 2022.

  28. arXiv:2207.04672  [pdf

    cs.CL cs.AI

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Authors: NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran , et al. (14 additional authors not shown)

    Abstract: Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality res… ▽ More

    Submitted 25 August, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

    Comments: 190 pages

    MSC Class: 68T50 ACM Class: I.2.7

  29. arXiv:2205.11416  [pdf, other

    cs.CL

    The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains

    Authors: Haoran Xu, Philipp Koehn, Kenton Murray

    Abstract: Recent model pruning methods have demonstrated the ability to remove redundant parameters without sacrificing model performance. Common methods remove redundant parameters according to the parameter sensitivity, a gradient-based measure reflecting the contribution of the parameters. In this paper, however, we argue that redundant parameters can be trained to make beneficial contributions. We first… ▽ More

    Submitted 22 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: Accepted at EMNLP 2022

  30. arXiv:2205.08533  [pdf, ps, other

    cs.CL

    Consistent Human Evaluation of Machine Translation across Language Pairs

    Authors: Daniel Licht, Cynthia Gao, Janice Lam, Francisco Guzman, Mona Diab, Philipp Koehn

    Abstract: Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given the high variability between human evaluators, partly due to subjective expectations for translation quality for different language pairs. We propose a new metric called XSTS that is more focused on semantic equivalence and a cross-lingual calibration method that enables more cons… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

    Comments: 10 pages

  31. Learn To Remember: Transformer with Recurrent Memory for Document-Level Machine Translation

    Authors: Yukun Feng, Feng Li, Ziang Song, Boyuan Zheng, Philipp Koehn

    Abstract: The Transformer architecture has led to significant gains in machine translation. However, most studies focus on only sentence-level translation without considering the context dependency within documents, leading to the inadequacy of document-level coherence. Some recent research tried to mitigate this issue by introducing an additional context encoder or translating with multiple sentences or ev… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: Accepted by NAACL-2022 Findings

    Journal ref: Findings of the Association for Computational Linguistics: NAACL 2022, 1409--1420

  32. arXiv:2203.13867  [pdf, other

    cs.CL cs.LG

    Data Selection Curriculum for Neural Machine Translation

    Authors: Tasnim Mohiuddin, Philipp Koehn, Vishrav Chaudhary, James Cross, Shruti Bhosale, Shafiq Joty

    Abstract: Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. However, not all of the training data are equally useful to the model. Curriculum training aims to present the data to the NMT models in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model o… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

  33. arXiv:2110.08250  [pdf, other

    cs.CL cs.SD eess.AS

    Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

    Authors: Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen, Wei-Ning Hsu, Phillip Koehn, Juan Pino

    Abstract: We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised m… ▽ More

    Submitted 12 January, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

  34. arXiv:2110.07804  [pdf, other

    cs.CL

    Alternative Input Signals Ease Transfer in Multilingual Machine Translation

    Authors: Simeng Sun, Angela Fan, James Cross, Vishrav Chaudhary, Chau Tran, Philipp Koehn, Francisco Guzman

    Abstract: Recent work in multilingual machine translation (MMT) has focused on the potential of positive transfer between languages, particularly cases where higher-resourced languages can benefit lower-resourced ones. While training an MMT model, the supervision signals learned from one language pair can be transferred to the other via the tokens shared by multiple source languages. However, the transfer i… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

  35. arXiv:2110.05691  [pdf, other

    cs.CL

    Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation

    Authors: Weiting Tan, Shuoyang Ding, Huda Khayrallah, Philipp Koehn

    Abstract: Neural Machine Translation (NMT) models are known to suffer from noisy inputs. To make models robust, we generate adversarial augmentation samples that attack the model and preserve the source-side semantic meaning at the same time. To generate such samples, we propose a doubly-trained architecture that pairs two NMT models of opposite translation directions with a joint loss function, which combi… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

  36. arXiv:2109.12640  [pdf, other

    cs.CL

    An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces

    Authors: Kelly Marchisio, Youngser Park, Ali Saad-Eldin, Anton Alyakin, Kevin Duh, Carey Priebe, Philipp Koehn

    Abstract: Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node's graph neighborhood without assuming a linear transform, and exp… ▽ More

    Submitted 26 September, 2021; originally announced September 2021.

    Comments: EMNLP Findings 2021 Camera-Ready

  37. arXiv:2109.08724  [pdf, other

    cs.CL

    The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task

    Authors: Shuoyang Ding, Marcin Junczys-Dowmunt, Matt Post, Christian Federmann, Philipp Koehn

    Abstract: This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip transla… ▽ More

    Submitted 17 September, 2021; originally announced September 2021.

    Comments: 7 Pages, Accepted to WMT21 (System Description)

  38. arXiv:2109.05611  [pdf, other

    cs.CL

    Levenshtein Training for Word-level Quality Estimation

    Authors: Shuoyang Ding, Marcin Junczys-Dowmunt, Matt Post, Philipp Koehn

    Abstract: We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we pro… ▽ More

    Submitted 15 September, 2021; v1 submitted 12 September, 2021; originally announced September 2021.

    Comments: 10 pages, 1 figure, Accepted to EMNLP 2021. Fixed a minor typo in Table 2 (en-zh WMT20 best result)

  39. arXiv:2108.03265  [pdf, other

    cs.CL

    Facebook AI WMT21 News Translation Task Submission

    Authors: Chau Tran, Shruti Bhosale, James Cross, Philipp Koehn, Sergey Edunov, Angela Fan

    Abstract: We describe Facebook's multilingual model submission to the WMT2021 shared task on news translation. We participate in 14 language directions: English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese. To develop systems covering all these directions, we focus on multilingual models. We utilize data from all available sources --- WMT, large-scale data mining, and in-domai… ▽ More

    Submitted 6 August, 2021; originally announced August 2021.

  40. arXiv:2107.09186  [pdf, other

    cs.CL

    Cross-Lingual BERT Contextual Embedding Space Mapping with Isotropic and Isometric Conditions

    Authors: Haoran Xu, Philipp Koehn

    Abstract: Typically, a linearly orthogonal transformation mapping is learned by aligning static type-level embeddings to build a shared semantic space. In view of the analysis that contextual embeddings contain richer semantic features, we investigate a context-aware and dictionary-free mapping approach by leveraging parallel corpora. We illustrate that our contextual embedding space mapping significantly o… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

  41. arXiv:2106.11891  [pdf, other

    cs.CL

    On the Evaluation of Machine Translation for Terminology Consistency

    Authors: Md Mahfuz ibn Alam, Antonios Anastasopoulos, Laurent Besacier, James Cross, Matthias Gallé, Philipp Koehn, Vassilina Nikoulina

    Abstract: As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body of work focuses on combining NMT with terminologies. In many scenarios and particularly in cases of domain adaptation, one expects the MT output to adhere to the constraints provided by a terminology. In this work, we propose metrics to measure the consistency of MT output with… ▽ More

    Submitted 24 June, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: preprint

  42. arXiv:2105.15071  [pdf, other

    cs.CL

    Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

    Authors: Wei-Jen Ko, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Naman Goyal, Francisco Guzmán, Pascale Fung, Philipp Koehn, Mona Diab

    Abstract: The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a… ▽ More

    Submitted 1 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

    Comments: ACL 2021

  43. arXiv:2104.08721  [pdf, other

    cs.CL

    Embedding-Enhanced Giza++: Improving Alignment in Low- and High- Resource Scenarios Using Embedding Space Geometry

    Authors: Kelly Marchisio, Conghao Xiong, Philipp Koehn

    Abstract: A popular natural language processing task decades ago, word alignment has been dominated until recently by GIZA++, a statistical method based on the 30-year-old IBM models. New methods that outperform GIZA++ primarily rely on large machine translation models, massively multilingual language models, or supervision from GIZA++ alignments itself. We introduce Embedding-Enhanced GIZA++, and outperfor… ▽ More

    Submitted 10 October, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: AMTA2022 Camera Ready

  44. arXiv:2104.08597  [pdf, other

    cs.CL

    XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment

    Authors: Ahmed El-Kishky, Adithya Renduchintala, James Cross, Francisco Guzmán, Philipp Koehn

    Abstract: Cross-lingual named-entity lexica are an important resource to multilingual NLP tasks such as machine translation and cross-lingual wikification. While knowledge bases contain a large number of entities in high-resource languages such as English and French, corresponding entities for lower-resource languages are often missing. To address this, we propose Lexical-Semantic-Phonetic Align (LSP-Align)… ▽ More

    Submitted 10 September, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

  45. arXiv:2104.05824  [pdf, other

    cs.CL

    Evaluating Saliency Methods for Neural Language Models

    Authors: Shuoyang Ding, Philipp Koehn

    Abstract: Saliency methods are widely used to interpret neural network predictions, but different variants of saliency methods often disagree even on the interpretations of the same prediction made by the same model. In these cases, how do we identify when are these interpretations trustworthy enough to be used in analyses? To address this question, we conduct a comprehensive and quantitative evaluation of… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

    Comments: 19 pages, 2 figures, Accepted for NAACL 2021

  46. arXiv:2103.06968  [pdf, other

    cs.CL

    Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

    Authors: Gaurav Kumar, Philipp Koehn, Sanjeev Khudanpur

    Abstract: Large web-crawled corpora represent an excellent resource for improving the performance of Neural Machine Translation (NMT) systems across several language pairs. However, since these corpora are typically extremely noisy, their use is fairly limited. Current approaches to dealing with this problem mainly focus on filtering using heuristics or single features such as language model scores or bi-li… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 10 pages, 2 figures

  47. arXiv:2103.06964  [pdf, other

    cs.CL

    Learning Policies for Multilingual Training of Neural Machine Translation Systems

    Authors: Gaurav Kumar, Philipp Koehn, Sanjeev Khudanpur

    Abstract: Low-resource Multilingual Neural Machine Translation (MNMT) is typically tasked with improving the translation performance on one or more language pairs with the aid of high-resource language pairs. In this paper, we propose two simple search based curricula -- orderings of the multilingual training data -- which help improve translation performance in conjunction with existing techniques such as… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 7 pages, 2 figures

  48. arXiv:2103.02212  [pdf, other

    cs.CL

    Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding Transformation

    Authors: Haoran Xu, Philipp Koehn

    Abstract: Linear embedding transformation has been shown to be effective for zero-shot cross-lingual transfer tasks and achieve surprisingly promising results. However, cross-lingual embedding space mapping is usually studied in static word-level embeddings, where a space transformation is derived by aligning representations of translation pairs that are referred from dictionaries. We move further from this… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

    Journal ref: Adapt-NLP EACL 2021

  49. arXiv:2011.02048  [pdf, other

    cs.CL

    SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation

    Authors: Xutai Ma, Juan Pino, Philipp Koehn

    Abstract: Simultaneous text translation and end-to-end speech translation have recently made great progress but little work has combined these tasks together. We investigate how to adapt simultaneous text translation methods such as wait-k and monotonic multihead attention to end-to-end simultaneous speech translation by introducing a pre-decision module. A detailed analysis is provided on the latency-quali… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

  50. arXiv:2011.00033  [pdf, other

    cs.CL

    Streaming Simultaneous Speech Translation with Augmented Memory Transformer

    Authors: Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti, Philipp Koehn, Juan Pino

    Abstract: Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation… ▽ More

    Submitted 30 October, 2020; originally announced November 2020.