Skip to main content

Showing 1–12 of 12 results for author: Moghe, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.10267  [pdf, ps, other

    cs.CL

    An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

    Authors: Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O'Brien, Stephan Oepen , et al. (10 additional authors not shown)

    Abstract: Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 langu… ▽ More

    Submitted 4 June, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: ACL'2025 Main Proceedings

  2. arXiv:2408.15366  [pdf, other

    cs.CL

    Pitfalls and Outlooks in Using COMET

    Authors: Vilém Zouhar, Pinzhen Chen, Tsz Kin Lam, Nikita Moghe, Barry Haddow

    Abstract: The COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected… ▽ More

    Submitted 30 September, 2024; v1 submitted 27 August, 2024; originally announced August 2024.

  3. arXiv:2401.16313  [pdf, other

    cs.CL

    Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets

    Authors: Nikita Moghe, Arnisa Fazla, Chantal Amrhein, Tom Kocmi, Mark Steedman, Alexandra Birch, Rico Sennrich, Liane Guillou

    Abstract: Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2210.15615

  4. arXiv:2311.09796  [pdf, other

    cs.CL cs.AI

    Interpreting User Requests in the Context of Natural Language Standing Instructions

    Authors: Nikita Moghe, Patrick Xia, Jacob Andreas, Jason Eisner, Benjamin Van Durme, Harsh Jhamtani

    Abstract: Users of natural language interfaces, generally powered by Large Language Models (LLMs),often must repeat their preferences each time they make a similar request. We describe an approach to LLM-based dialogue modeling in which persistent user constraints and preferences -- collectively termed standing instructions -- as additional context for such interfaces. For example, when a user states "I'm h… ▽ More

    Submitted 7 March, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: Updated with results from LLaMA-2

  5. arXiv:2311.01153  [pdf, other

    cs.CL

    ACES: Translation Accuracy Challenge Sets at WMT 2023

    Authors: Chantal Amrhein, Nikita Moghe, Liane Guillou

    Abstract: We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set (Amrhein et al., 2022). The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs. The phenomena range from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. For each met… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: Camera Ready WMT 2023. arXiv admin note: text overlap with arXiv:2210.15615

  6. arXiv:2212.10455  [pdf, other

    cs.CL

    MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue

    Authors: Nikita Moghe, Evgeniia Razumovskaia, Liane Guillou, Ivan Vulić, Anna Korhonen, Alexandra Birch

    Abstract: Task-oriented dialogue (TOD) systems have been widely deployed in many industries as they deliver more efficient customer support. These systems are typically constructed for a single domain or language and do not generalise well beyond this. To support work on Natural Language Understanding (NLU) in TOD across multiple languages and domains simultaneously, we constructed MULTI3NLU++, a multilingu… ▽ More

    Submitted 19 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023 (Findings) Camera Ready

  7. arXiv:2212.10297  [pdf, other

    cs.CL cs.AI

    Extrinsic Evaluation of Machine Translation Metrics

    Authors: Nikita Moghe, Tom Sherborne, Mark Steedman, Alexandra Birch

    Abstract: Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT… ▽ More

    Submitted 18 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023 Camera Ready

  8. arXiv:2210.15615  [pdf, other

    cs.CL

    ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics

    Authors: Chantal Amrhein, Nikita Moghe, Liane Guillou

    Abstract: As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accurac… ▽ More

    Submitted 6 December, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: preprint for WMT 2022 with updated tables

    ACM Class: I.2.7

  9. arXiv:2109.13620  [pdf, other

    cs.CL

    Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking

    Authors: Nikita Moghe, Mark Steedman, Alexandra Birch

    Abstract: Recent progress in task-oriented neural dialogue systems is largely focused on a handful of languages, as annotation of training data is tedious and expensive. Machine translation has been used to make systems multilingual, but this can introduce a pipeline of errors. Another promising solution is using cross-lingual transfer learning through pretrained multilingual models. Existing methods train… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 Camera Ready

  10. arXiv:2005.14315  [pdf, other

    cs.CL

    On Incorporating Structural Information to improve Dialogue Response Generation

    Authors: Nikita Moghe, Priyesh Vijayan, Balaraman Ravindran, Mitesh M. Khapra

    Abstract: We consider the task of generating dialogue responses from background knowledge comprising of domain specific resources. Specifically, given a conversation around a movie, the task is to generate the next response based on background knowledge about the movie such as the plot, review, Reddit comments etc. This requires capturing structural, sequential and semantic information from the conversation… ▽ More

    Submitted 28 May, 2020; originally announced May 2020.

  11. arXiv:1809.08205  [pdf, other

    cs.CL

    Towards Exploiting Background Knowledge for Building Conversation Systems

    Authors: Nikita Moghe, Siddhartha Arora, Suman Banerjee, Mitesh M. Khapra

    Abstract: Existing dialog datasets contain a sequence of utterances and responses without any explicit background knowledge associated with them. This has resulted in the development of models which treat conversation as a sequence-to-sequence generation task i.e, given a sequence of utterances generate the response sequence). This is not only an overly simplistic view of conversation but it is also emphati… ▽ More

    Submitted 21 September, 2018; originally announced September 2018.

    Comments: Camera Ready EMNLP 2018

  12. arXiv:1806.05997  [pdf, other

    cs.CL

    A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

    Authors: Suman Banerjee, Nikita Moghe, Siddhartha Arora, Mitesh M. Khapra

    Abstract: There is an increasing demand for goal-oriented conversation systems which can assist users in various day-to-day activities such as booking tickets, restaurant reservations, shopping, etc. Most of the existing datasets for building such conversation systems focus on monolingual conversations and there is hardly any work on multilingual and/or code-mixed conversations. Such datasets and systems th… ▽ More

    Submitted 15 June, 2018; originally announced June 2018.

    Comments: 15 pages, 2 figures, 10 tables, Accepted in COLING - 2018