Skip to main content

Showing 1–24 of 24 results for author: Mohiuddin, T

.
  1. arXiv:2603.16397  [pdf, ps, other

    cs.CL cs.AI

    Fanar 2.0: Arabic Generative AI Stack

    Authors: FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus'ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat , et al. (12 additional authors not shown)

    Abstract: We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having on… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

  2. arXiv:2603.06687  [pdf, ps, other

    cs.CV cs.CL cs.ET cs.MM cs.RO

    TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

    Authors: Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

    Abstract: Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason abo… ▽ More

    Submitted 4 March, 2026; originally announced March 2026.

    Comments: 66 Pages. In Review

  3. arXiv:2601.13260  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

    Authors: Sawsan Alqahtani, Mir Tafseer Nayeem, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari

    Abstract: Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing s… ▽ More

    Submitted 23 January, 2026; v1 submitted 19 January, 2026; originally announced January 2026.

    Comments: Accepted to EACL 2026 (long, main). The first two authors contributed equally

  4. arXiv:2510.14305  [pdf, ps, other

    cs.CL

    MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

    Authors: Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda

    Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on E… ▽ More

    Submitted 24 January, 2026; v1 submitted 16 October, 2025; originally announced October 2025.

    Comments: Accepted for publication in Findings of EACL 2026

  5. arXiv:2510.09947  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

    Authors: Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari

    Abstract: Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chi… ▽ More

    Submitted 25 October, 2025; v1 submitted 10 October, 2025; originally announced October 2025.

    Comments: NeurIPS 2025 Workshop

  6. arXiv:2510.07877  [pdf, ps, other

    cs.CL

    Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

    Authors: Md. Faiyaz Abdullah Sayeedi, Md. Mahbub Alam, Subhey Sadi Rahman, Md. Adnanul Islam, Jannatul Ferdous Deepti, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda

    Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases pre… ▽ More

    Submitted 31 October, 2025; v1 submitted 9 October, 2025; originally announced October 2025.

  7. arXiv:2509.11425  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

    Authors: Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman

    Abstract: Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language model… ▽ More

    Submitted 29 September, 2025; v1 submitted 14 September, 2025; originally announced September 2025.

  8. arXiv:2501.13944  [pdf, other

    cs.CL cs.AI

    Fanar: An Arabic-Centric Multimodal Generative AI Platform

    Authors: Fanar Team, Ummar Abbas, Mohammad Shahmeer Ahmad, Firoj Alam, Enes Altinisik, Ehsannedin Asgari, Yazan Boshmaf, Sabri Boughorbel, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Masoomali Fatehkia, Anastasios Fragkopoulos, Maram Hasanain, Majd Hawasly, Mus'ab Husaini, Soon-Gyo Jung, Ji Kim Lucas, Walid Magdy, Safa Messaoud , et al. (17 additional authors not shown)

    Abstract: We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star and Fanar Prime, two highly capable Arabic Large Language Models (LLMs) that are best in the class on well established benchmarks for similar sized models. Fanar Star is a 7B (billion) parameter model that was trained from… ▽ More

    Submitted 18 January, 2025; originally announced January 2025.

    ACM Class: I.2.0; D.2.0

  9. arXiv:2412.18274  [pdf, other

    cs.CL cs.AI

    GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge

    Authors: Shammur Absar Chowdhury, Hind Almerekhi, Mucahid Kutlu, Kaan Efe Keles, Fatema Ahmad, Tasnim Mohiuddin, George Mikros, Firoj Alam

    Abstract: This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

    Comments: AI Generated Content, Academic Essay, LLMs, Arabic, English

    MSC Class: 68T50 ACM Class: F.2.2; I.2.7

  10. arXiv:2410.15017  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    DM-Codec: Distilling Multimodal Representations for Speech Tokenization

    Authors: Md Mubtasim Ahasan, Md Fahim, Tasnim Mohiuddin, A K M Mahbubur Rahman, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Md Mofijul Islam, Amin Ahsan Ali

    Abstract: Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into… ▽ More

    Submitted 29 September, 2025; v1 submitted 19 October, 2024; originally announced October 2024.

    Comments: Accepted at EMNLP 2025

  11. arXiv:2203.13867  [pdf, other

    cs.CL cs.LG

    Data Selection Curriculum for Neural Machine Translation

    Authors: Tasnim Mohiuddin, Philipp Koehn, Vishrav Chaudhary, James Cross, Shruti Bhosale, Shafiq Joty

    Abstract: Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. However, not all of the training data are equally useful to the model. Curriculum training aims to present the data to the NMT models in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model o… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

  12. arXiv:2106.05141  [pdf, other

    cs.CL

    AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT

    Authors: Tasnim Mohiuddin, M Saiful Bari, Shafiq Joty

    Abstract: The success of Neural Machine Translation (NMT) largely depends on the availability of large bitext training corpora. Due to the lack of such large corpora in low-resource language pairs, NMT systems often exhibit poor performance. Extra relevant monolingual data often helps, but acquiring it could be quite expensive, especially for low-resource languages. Moreover, domain mismatch between bitext… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: ACL-2021 accepted paper

  13. arXiv:2004.14626  [pdf, other

    cs.CL

    Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks

    Authors: Tasnim Mohiuddin, Prathyusha Jwalapuram, Xiang Lin, Shafiq Joty

    Abstract: Although coherence modeling has come a long way in developing novel models, their evaluation on downstream applications for which they are purportedly developed has largely been neglected. With the advancements made by neural approaches in applications such as machine translation (MT), summarization and dialog systems, the need for coherence evaluation of these tasks is now more crucial than ever.… ▽ More

    Submitted 13 February, 2021; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: Accepted paper at EACL-21

  14. arXiv:2004.13889  [pdf, other

    cs.CL cs.LG

    LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

    Authors: Tasnim Mohiuddin, M Saiful Bari, Shafiq Joty

    Abstract: Most of the successful and predominant methods for bilingual lexicon induction (BLI) are mapping-based, where a linear mapping function is learned with the assumption that the word embedding spaces of different languages exhibit similar geometric structures (i.e., approximately isomorphic). However, several recent studies have criticized this simplified assumption showing that it does not hold in… ▽ More

    Submitted 21 October, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020 accepted paper

  15. arXiv:2004.13240  [pdf, other

    cs.CL cs.LG

    UXLA: A Robust Unsupervised Data Augmentation Framework for Zero-Resource Cross-Lingual NLP

    Authors: M Saiful Bari, Tasnim Mohiuddin, Shafiq Joty

    Abstract: Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from a s… ▽ More

    Submitted 26 June, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

    Comments: ACL-2021 accepted paper

  16. arXiv:1909.00349  [pdf, other

    cs.CL cs.LG stat.ML

    A Unified Neural Coherence Model

    Authors: Han Cheol Moon, Tasnim Mohiuddin, Shafiq Joty, Xu Chi

    Abstract: Recently, neural approaches to coherence modeling have achieved state-of-the-art results in several evaluation tasks. However, we show that most of these models often fail on harder tasks with more realistic application scenarios. In particular, the existing models underperform on tasks that require the model to be sensitive to local contexts such as candidate ranking in conversational dialogue an… ▽ More

    Submitted 1 September, 2019; originally announced September 2019.

    Comments: To appear at EMNLP-IJCNLP 2019

  17. arXiv:1904.04116  [pdf, other

    cs.CL cs.LG stat.ML

    Revisiting Adversarial Autoencoder for Unsupervised Word Translation with Cycle Consistency and Improved Training

    Authors: Tasnim Mohiuddin, Shafiq Joty

    Abstract: Adversarial training has shown impressive success in learning bilingual dictionary without any parallel data by mapping monolingual embeddings to a shared space. However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs. In this work, we revisit adversarial autoencoder for unsupervised word translation and propose two novel extensions to it… ▽ More

    Submitted 4 April, 2019; originally announced April 2019.

    Comments: Published in NAACL-HLT 2019

  18. arXiv:1904.04021  [pdf, other

    cs.CL cs.LG stat.ML

    Adaptation of Hierarchical Structured Models for Speech Act Recognition in Asynchronous Conversation

    Authors: Tasnim Mohiuddin, Thanh-Tung Nguyen, Shafiq Joty

    Abstract: We address the problem of speech act recognition (SAR) in asynchronous conversations (forums, emails). Unlike synchronous conversations (e.g., meetings, phone), asynchronous domains lack large labeled datasets to train an effective SAR model. In this paper, we propose methods to effectively leverage abundant unlabeled conversational data and the available labeled data from synchronous domains. We… ▽ More

    Submitted 1 April, 2019; originally announced April 2019.

    Comments: To appear in NAACL 2019

  19. arXiv:1805.02275  [pdf, other

    cs.CL

    Coherence Modeling of Asynchronous Conversations: A Neural Entity Grid Approach

    Authors: Tasnim Mohiuddin, Shafiq Joty, Dat Tien Nguyen

    Abstract: We propose a novel coherence model for written asynchronous conversations (e.g., forums, emails), and show its applications in coherence assessment and thread reconstruction tasks. We conduct our research in two steps. First, we propose improvements to the recently proposed neural entity grid model by lexicalizing its entity transitions. Then, we extend the model to asynchronous conversations by i… ▽ More

    Submitted 6 May, 2018; originally announced May 2018.

  20. arXiv:1008.4425  [pdf

    cond-mat.mes-hall

    Charge carrier mobility degradation in graphene sheet under induced strain

    Authors: Raheel Shah, Tariq M. Mohiuddin

    Abstract: Impact of induced strain on charge carrier mobility is investigated for a monolayer graphene sheet. Mobility is computed within Born approximation by including impurity scattering, surface roughness effects and interaction with lattice phonons. Unlike its sSi counterpart, strained graphene shows a drop in mobility with increasing strain. Main reason for this effect is decrease in Fermi velocity du… ▽ More

    Submitted 12 October, 2010; v1 submitted 25 August, 2010; originally announced August 2010.

    Comments: 9 pages, 5 figures Version 2: Corrected Typos

  21. arXiv:0812.1538  [pdf, ps, other

    cond-mat.mtrl-sci

    Uniaxial Strain in Graphene by Raman Spectroscopy: G peak splitting, Gruneisen Parameters and Sample Orientation

    Authors: T. M. G. Mohiuddin, A. Lombardo, R. R. Nair, A. Bonetti, G. Savini, R. Jalil, N. Bonini, D. M. Basko, C. Galiotis, N. Marzari, K. S. Novoselov, A. K. Geim, A. C. Ferrari

    Abstract: Graphene is the two-dimensional building block for carbon allotropes of every other dimensionality. Since its experimental discovery, graphene continues to attract enormous interest, in particular as a new kind of matter, in which electron transport is governed by a Dirac-like wave equation, and as a model system for studying electronic and phonon properties of other, more complex, graphitic mat… ▽ More

    Submitted 8 December, 2008; originally announced December 2008.

    Journal ref: Phys. Rev. B, 79, 205433 (2009)

  22. arXiv:0810.4706  [pdf

    cond-mat.mes-hall cond-mat.mtrl-sci

    Control of graphene's properties by reversible hydrogenation

    Authors: D. C. Elias, R. R. Nair, T. M. G. Mohiuddin, S. V. Morozov, P. Blake, M. P. Halsall, A. C. Ferrari, D. W. Boukhvalov, M. I. Katsnelson, A. K. Geim, K. S. Novoselov

    Abstract: Graphene - a monolayer of carbon atoms densely packed into a hexagonal lattice - has one of the strongest possible atomic bonds and can be viewed as a robust atomic-scale scaffold, to which other chemical species can be attached without destroying it. This notion of graphene as a giant flat molecule that can be altered chemically is supported by the observation of so-called graphene oxide, that… ▽ More

    Submitted 26 October, 2008; originally announced October 2008.

    Journal ref: Science 323, 610-613 (2009)

  23. arXiv:0809.1162  [pdf

    cond-mat.mes-hall cond-mat.mtrl-sci

    Effect of high-k environment on charge carrier mobility in graphene

    Authors: L. A. Ponomarenko, R. Yang, T. M. Mohiuddin, S. M. Morozov, A. A. Zhukov, F. Schedin, E. W. Hill, K. S. Novoselov, M. I. Katsnelson, A. K. Geim

    Abstract: It is widely assumed that the dominant source of scattering in graphene is charged impurities in a substrate. We have tested this conjecture by studying graphene placed on various substrates and in high-k media. Unexpectedly, we have found no significant changes in carrier mobility either for different substrates or by using glycerol, ethanol and water as a top dielectric layer. This suggests th… ▽ More

    Submitted 8 May, 2009; v1 submitted 6 September, 2008; originally announced September 2008.

    Comments: further experiments proving the point are reported in the final version

    Journal ref: Phys. Rev. Lett. 102, 206603 (2009)

  24. Quantum-Hall activation gaps in graphene

    Authors: A. J. M. Giesbers, U. Zeitler, M. I. Katsnelson, L. A. Ponomarenko, T. M. G. Mohiuddin, J. C. Maan

    Abstract: We have measured the quantum-Hall activation gaps in graphene at filling factors $ν=2$ and $ν=6$ for magnetic fields up to 32 T and temperatures from 4 K to 300 K. The $ν=6$ gap can be described by thermal excitation to broadened Landau levels with a width of 400 K. In contrast, the gap measured at $ν=2$ is strongly temperature and field dependent and approaches the expected value for sharp Land… ▽ More

    Submitted 12 October, 2007; v1 submitted 19 June, 2007; originally announced June 2007.

    Comments: 4 pages, 4 figures, updated version after review, accepted for PRL

    Journal ref: Phys. Rev. Lett. 99, 206803 (2007)