-
Fanar 2.0: Arabic Generative AI Stack
Authors:
FANAR TEAM,
Ummar Abbas,
Mohammad Shahmeer Ahmad,
Minhaj Ahmad,
Abdulaziz Al-Homaid,
Anas Al-Nuaimi,
Enes Altinisik,
Ehsaneddin Asgari,
Sanjay Chawla,
Shammur Chowdhury,
Fahim Dalvi,
Kareem Darwish,
Nadir Durrani,
Mohamed Elfeky,
Ahmed Elmagarmid,
Mohamed Eltabakh,
Asim Ersoy,
Masoomali Fatehkia,
Mohammed Qusay Hashim,
Majd Hawasly,
Mohamed Hefeeda,
Mus'ab Husaini,
Keivin Isufaj,
Soon-Gyo Jung,
Houssam Lachemat
, et al. (12 additional authors not shown)
Abstract:
We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having on…
▽ More
We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings
Authors:
Azmine Toushik Wasi,
Shahriyar Zaman Ridoy,
Koushik Ahamed Tonmoy,
Kinga Tshering,
S. M. Muhtasimul Hasan,
Wahid Faisal,
Tasnim Mohiuddin,
Md Rizwan Parvez
Abstract:
Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason abo…
▽ More
Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding. TimeSpot is available at: https://TimeSpot-GT.github.io.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models
Authors:
Sawsan Alqahtani,
Mir Tafseer Nayeem,
Md Tahmid Rahman Laskar,
Tasnim Mohiuddin,
M Saiful Bari
Abstract:
Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing s…
▽ More
Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.
△ Less
Submitted 23 January, 2026; v1 submitted 19 January, 2026;
originally announced January 2026.
-
MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
Authors:
Mahbub E Sobhani,
Md. Faiyaz Abdullah Sayeedi,
Tasnim Mohiuddin,
Md Mofijul Islam,
Swakkhar Shatabda
Abstract:
Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on E…
▽ More
Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MATHMIST, a parallel multilingual benchmark for mathematical problem solving and reasoning. MATHMIST encompasses 2,890 parallel Bangla-English gold standard artifacts, totaling approximately 30K aligned question--answer pairs across thirteen languages, representing an extensive coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models under zero-shot, chain-of-thought (CoT), perturbated reasoning, and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist
△ Less
Submitted 24 January, 2026; v1 submitted 16 October, 2025;
originally announced October 2025.
-
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
Authors:
Mir Tafseer Nayeem,
Sawsan Alqahtani,
Md Tahmid Rahman Laskar,
Tasnim Mohiuddin,
M Saiful Bari
Abstract:
Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chi…
▽ More
Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility's blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.
△ Less
Submitted 25 October, 2025; v1 submitted 10 October, 2025;
originally announced October 2025.
-
Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Authors:
Md. Faiyaz Abdullah Sayeedi,
Md. Mahbub Alam,
Subhey Sadi Rahman,
Md. Adnanul Islam,
Jannatul Ferdous Deepti,
Tasnim Mohiuddin,
Md Mofijul Islam,
Swakkhar Shatabda
Abstract:
The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases pre…
▽ More
The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles
△ Less
Submitted 31 October, 2025; v1 submitted 9 October, 2025;
originally announced October 2025.
-
FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Authors:
Md Mubtasim Ahasan,
Rafat Hasan Khan,
Tasnim Mohiuddin,
Aman Chadha,
Tariq Iqbal,
M Ashraful Amin,
Amin Ahsan Ali,
Md Mofijul Islam,
A K M Mahbubur Rahman
Abstract:
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language model…
▽ More
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
△ Less
Submitted 29 September, 2025; v1 submitted 14 September, 2025;
originally announced September 2025.
-
Fanar: An Arabic-Centric Multimodal Generative AI Platform
Authors:
Fanar Team,
Ummar Abbas,
Mohammad Shahmeer Ahmad,
Firoj Alam,
Enes Altinisik,
Ehsannedin Asgari,
Yazan Boshmaf,
Sabri Boughorbel,
Sanjay Chawla,
Shammur Chowdhury,
Fahim Dalvi,
Kareem Darwish,
Nadir Durrani,
Mohamed Elfeky,
Ahmed Elmagarmid,
Mohamed Eltabakh,
Masoomali Fatehkia,
Anastasios Fragkopoulos,
Maram Hasanain,
Majd Hawasly,
Mus'ab Husaini,
Soon-Gyo Jung,
Ji Kim Lucas,
Walid Magdy,
Safa Messaoud
, et al. (17 additional authors not shown)
Abstract:
We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star and Fanar Prime, two highly capable Arabic Large Language Models (LLMs) that are best in the class on well established benchmarks for similar sized models. Fanar Star is a 7B (billion) parameter model that was trained from…
▽ More
We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star and Fanar Prime, two highly capable Arabic Large Language Models (LLMs) that are best in the class on well established benchmarks for similar sized models. Fanar Star is a 7B (billion) parameter model that was trained from scratch on nearly 1 trillion clean and deduplicated Arabic, English and Code tokens. Fanar Prime is a 9B parameter model continually trained on the Gemma-2 9B base model on the same 1 trillion token set. Both models are concurrently deployed and designed to address different types of prompts transparently routed through a custom-built orchestrator. The Fanar platform provides many other capabilities including a customized Islamic Retrieval Augmented Generation (RAG) system for handling religious prompts, a Recency RAG for summarizing information about current or recent events that have occurred after the pre-training data cut-off date. The platform provides additional cognitive capabilities including in-house bilingual speech recognition that supports multiple Arabic dialects, voice and image generation that is fine-tuned to better reflect regional characteristics. Finally, Fanar provides an attribution service that can be used to verify the authenticity of fact based generated content.
The design, development, and implementation of Fanar was entirely undertaken at Hamad Bin Khalifa University's Qatar Computing Research Institute (QCRI) and was sponsored by Qatar's Ministry of Communications and Information Technology to enable sovereign AI technology development.
△ Less
Submitted 18 January, 2025;
originally announced January 2025.
-
GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge
Authors:
Shammur Absar Chowdhury,
Hind Almerekhi,
Mucahid Kutlu,
Kaan Efe Keles,
Fatema Ahmad,
Tasnim Mohiuddin,
George Mikros,
Firoj Alam
Abstract:
This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine…
▽ More
This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine or authored by a human.'' The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
Authors:
Md Mubtasim Ahasan,
Md Fahim,
Tasnim Mohiuddin,
A K M Mahbubur Rahman,
Aman Chadha,
Tariq Iqbal,
M Ashraful Amin,
Md Mofijul Islam,
Amin Ahsan Ali
Abstract:
Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into…
▽ More
Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. Code, samples, and checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.
△ Less
Submitted 29 September, 2025; v1 submitted 19 October, 2024;
originally announced October 2024.
-
Data Selection Curriculum for Neural Machine Translation
Authors:
Tasnim Mohiuddin,
Philipp Koehn,
Vishrav Chaudhary,
James Cross,
Shruti Bhosale,
Shafiq Joty
Abstract:
Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. However, not all of the training data are equally useful to the model. Curriculum training aims to present the data to the NMT models in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model o…
▽ More
Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. However, not all of the training data are equally useful to the model. Curriculum training aims to present the data to the NMT models in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring that considers prediction scores of the emerging NMT model. Through comprehensive experiments on six language pairs comprising low- and high-resource languages from WMT'21, we have shown that our curriculum strategies consistently demonstrate better quality (up to +2.2 BLEU improvement) and faster convergence (approximately 50% fewer updates).
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT
Authors:
Tasnim Mohiuddin,
M Saiful Bari,
Shafiq Joty
Abstract:
The success of Neural Machine Translation (NMT) largely depends on the availability of large bitext training corpora. Due to the lack of such large corpora in low-resource language pairs, NMT systems often exhibit poor performance. Extra relevant monolingual data often helps, but acquiring it could be quite expensive, especially for low-resource languages. Moreover, domain mismatch between bitext…
▽ More
The success of Neural Machine Translation (NMT) largely depends on the availability of large bitext training corpora. Due to the lack of such large corpora in low-resource language pairs, NMT systems often exhibit poor performance. Extra relevant monolingual data often helps, but acquiring it could be quite expensive, especially for low-resource languages. Moreover, domain mismatch between bitext (train/test) and monolingual data might degrade the performance. To alleviate such issues, we propose AUGVIC, a novel data augmentation framework for low-resource NMT which exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly. It can diversify the in-domain bitext data with finer level control. Through extensive experiments on four low-resource language pairs comprising data from different domains, we have shown that our method is comparable to the traditional back-translation that uses extra in-domain monolingual data. When we combine the synthetic parallel data generated from AUGVIC with the ones from the extra monolingual data, we achieve further improvements. We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation. To understand the contributions of different components of AUGVIC, we perform an in-depth framework analysis.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks
Authors:
Tasnim Mohiuddin,
Prathyusha Jwalapuram,
Xiang Lin,
Shafiq Joty
Abstract:
Although coherence modeling has come a long way in developing novel models, their evaluation on downstream applications for which they are purportedly developed has largely been neglected. With the advancements made by neural approaches in applications such as machine translation (MT), summarization and dialog systems, the need for coherence evaluation of these tasks is now more crucial than ever.…
▽ More
Although coherence modeling has come a long way in developing novel models, their evaluation on downstream applications for which they are purportedly developed has largely been neglected. With the advancements made by neural approaches in applications such as machine translation (MT), summarization and dialog systems, the need for coherence evaluation of these tasks is now more crucial than ever. However, coherence models are typically evaluated only on synthetic tasks, which may not be representative of their performance in downstream applications. To investigate how representative the synthetic tasks are of downstream use cases, we conduct experiments on benchmarking well-known traditional and neural coherence models on synthetic sentence ordering tasks, and contrast this with their performance on three downstream applications: coherence evaluation for MT and summarization, and next utterance prediction in retrieval-based dialog. Our results demonstrate a weak correlation between the model performances in the synthetic tasks and the downstream applications, {motivating alternate training and evaluation methods for coherence models.
△ Less
Submitted 13 February, 2021; v1 submitted 30 April, 2020;
originally announced April 2020.
-
LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space
Authors:
Tasnim Mohiuddin,
M Saiful Bari,
Shafiq Joty
Abstract:
Most of the successful and predominant methods for bilingual lexicon induction (BLI) are mapping-based, where a linear mapping function is learned with the assumption that the word embedding spaces of different languages exhibit similar geometric structures (i.e., approximately isomorphic). However, several recent studies have criticized this simplified assumption showing that it does not hold in…
▽ More
Most of the successful and predominant methods for bilingual lexicon induction (BLI) are mapping-based, where a linear mapping function is learned with the assumption that the word embedding spaces of different languages exhibit similar geometric structures (i.e., approximately isomorphic). However, several recent studies have criticized this simplified assumption showing that it does not hold in general even for closely related languages. In this work, we propose a novel semi-supervised method to learn cross-lingual word embeddings for BLI. Our model is independent of the isomorphic assumption and uses nonlinear mapping in the latent space of two independently trained auto-encoders. Through extensive experiments on fifteen (15) different language pairs (in both directions) comprising resource-rich and low-resource languages from two different datasets, we demonstrate that our method outperforms existing models by a good margin. Ablation studies show the importance of different model components and the necessity of non-linear mapping.
△ Less
Submitted 21 October, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.
-
UXLA: A Robust Unsupervised Data Augmentation Framework for Zero-Resource Cross-Lingual NLP
Authors:
M Saiful Bari,
Tasnim Mohiuddin,
Shafiq Joty
Abstract:
Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from a s…
▽ More
Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language. At its core, UXLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on three diverse zero-resource cross-lingual transfer tasks. UXLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to its success.
△ Less
Submitted 26 June, 2021; v1 submitted 27 April, 2020;
originally announced April 2020.
-
A Unified Neural Coherence Model
Authors:
Han Cheol Moon,
Tasnim Mohiuddin,
Shafiq Joty,
Xu Chi
Abstract:
Recently, neural approaches to coherence modeling have achieved state-of-the-art results in several evaluation tasks. However, we show that most of these models often fail on harder tasks with more realistic application scenarios. In particular, the existing models underperform on tasks that require the model to be sensitive to local contexts such as candidate ranking in conversational dialogue an…
▽ More
Recently, neural approaches to coherence modeling have achieved state-of-the-art results in several evaluation tasks. However, we show that most of these models often fail on harder tasks with more realistic application scenarios. In particular, the existing models underperform on tasks that require the model to be sensitive to local contexts such as candidate ranking in conversational dialogue and in machine translation. In this paper, we propose a unified coherence model that incorporates sentence grammar, inter-sentence coherence relations, and global coherence patterns into a common neural framework. With extensive experiments on local and global discrimination tasks, we demonstrate that our proposed model outperforms existing models by a good margin, and establish a new state-of-the-art.
△ Less
Submitted 1 September, 2019;
originally announced September 2019.
-
Revisiting Adversarial Autoencoder for Unsupervised Word Translation with Cycle Consistency and Improved Training
Authors:
Tasnim Mohiuddin,
Shafiq Joty
Abstract:
Adversarial training has shown impressive success in learning bilingual dictionary without any parallel data by mapping monolingual embeddings to a shared space. However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs. In this work, we revisit adversarial autoencoder for unsupervised word translation and propose two novel extensions to it…
▽ More
Adversarial training has shown impressive success in learning bilingual dictionary without any parallel data by mapping monolingual embeddings to a shared space. However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs. In this work, we revisit adversarial autoencoder for unsupervised word translation and propose two novel extensions to it that yield more stable training and improved results. Our method includes regularization terms to enforce cycle consistency and input reconstruction, and puts the target encoders as an adversary against the corresponding discriminator. Extensive experimentations with European, non-European and low-resource languages show that our method is more robust and achieves better performance than recently proposed adversarial and non-adversarial approaches.
△ Less
Submitted 4 April, 2019;
originally announced April 2019.
-
Adaptation of Hierarchical Structured Models for Speech Act Recognition in Asynchronous Conversation
Authors:
Tasnim Mohiuddin,
Thanh-Tung Nguyen,
Shafiq Joty
Abstract:
We address the problem of speech act recognition (SAR) in asynchronous conversations (forums, emails). Unlike synchronous conversations (e.g., meetings, phone), asynchronous domains lack large labeled datasets to train an effective SAR model. In this paper, we propose methods to effectively leverage abundant unlabeled conversational data and the available labeled data from synchronous domains. We…
▽ More
We address the problem of speech act recognition (SAR) in asynchronous conversations (forums, emails). Unlike synchronous conversations (e.g., meetings, phone), asynchronous domains lack large labeled datasets to train an effective SAR model. In this paper, we propose methods to effectively leverage abundant unlabeled conversational data and the available labeled data from synchronous domains. We carry out our research in three main steps. First, we introduce a neural architecture based on hierarchical LSTMs and conditional random fields (CRF) for SAR, and show that our method outperforms existing methods when trained on in-domain data only. Second, we improve our initial SAR models by semi-supervised learning in the form of pretrained word embeddings learned from a large unlabeled conversational corpus. Finally, we employ adversarial training to improve the results further by leveraging the labeled data from synchronous domains and by explicitly modeling the distributional shift in two domains.
△ Less
Submitted 1 April, 2019;
originally announced April 2019.
-
Coherence Modeling of Asynchronous Conversations: A Neural Entity Grid Approach
Authors:
Tasnim Mohiuddin,
Shafiq Joty,
Dat Tien Nguyen
Abstract:
We propose a novel coherence model for written asynchronous conversations (e.g., forums, emails), and show its applications in coherence assessment and thread reconstruction tasks. We conduct our research in two steps. First, we propose improvements to the recently proposed neural entity grid model by lexicalizing its entity transitions. Then, we extend the model to asynchronous conversations by i…
▽ More
We propose a novel coherence model for written asynchronous conversations (e.g., forums, emails), and show its applications in coherence assessment and thread reconstruction tasks. We conduct our research in two steps. First, we propose improvements to the recently proposed neural entity grid model by lexicalizing its entity transitions. Then, we extend the model to asynchronous conversations by incorporating the underlying conversational structure in the entity grid representation and feature computation. Our model achieves state of the art results on standard coherence assessment tasks in monologue and conversations outperforming existing models. We also demonstrate its effectiveness in reconstructing thread structures.
△ Less
Submitted 6 May, 2018;
originally announced May 2018.
-
Charge carrier mobility degradation in graphene sheet under induced strain
Authors:
Raheel Shah,
Tariq M. Mohiuddin
Abstract:
Impact of induced strain on charge carrier mobility is investigated for a monolayer graphene sheet. Mobility is computed within Born approximation by including impurity scattering, surface roughness effects and interaction with lattice phonons. Unlike its sSi counterpart, strained graphene shows a drop in mobility with increasing strain. Main reason for this effect is decrease in Fermi velocity du…
▽ More
Impact of induced strain on charge carrier mobility is investigated for a monolayer graphene sheet. Mobility is computed within Born approximation by including impurity scattering, surface roughness effects and interaction with lattice phonons. Unlike its sSi counterpart, strained graphene shows a drop in mobility with increasing strain. Main reason for this effect is decrease in Fermi velocity due to induced distortions in the graphene honeycomb.
△ Less
Submitted 12 October, 2010; v1 submitted 25 August, 2010;
originally announced August 2010.
-
Uniaxial Strain in Graphene by Raman Spectroscopy: G peak splitting, Gruneisen Parameters and Sample Orientation
Authors:
T. M. G. Mohiuddin,
A. Lombardo,
R. R. Nair,
A. Bonetti,
G. Savini,
R. Jalil,
N. Bonini,
D. M. Basko,
C. Galiotis,
N. Marzari,
K. S. Novoselov,
A. K. Geim,
A. C. Ferrari
Abstract:
Graphene is the two-dimensional building block for carbon allotropes of every other dimensionality. Since its experimental discovery, graphene continues to attract enormous interest, in particular as a new kind of matter, in which electron transport is governed by a Dirac-like wave equation, and as a model system for studying electronic and phonon properties of other, more complex, graphitic mat…
▽ More
Graphene is the two-dimensional building block for carbon allotropes of every other dimensionality. Since its experimental discovery, graphene continues to attract enormous interest, in particular as a new kind of matter, in which electron transport is governed by a Dirac-like wave equation, and as a model system for studying electronic and phonon properties of other, more complex, graphitic materials[1-4]. Here, we uncover the constitutive relation of graphene and probe new physics of its optical phonons, by studying its Raman spectrum as a function of uniaxial strain. We find that the doubly degenerate E2g optical mode splits in two components, one polarized along the strain and the other perpendicular to it. This leads to the splitting of the G peak into two bands, which we call G+ and G-, by analogy with the effect of curvature on the nanotube G peak[5-7]. Both peaks red shift with increasing strain, and their splitting increases, in excellent agreement with first-principles calculations. Their relative intensities are found to depend on light polarization, which provides a useful tool to probe the graphene crystallographic orientation with respect to the strain. The singly degenerate 2D and 2D' bands also red shift, but do not split for small strains. We study the Gruneisen parameters for the phonons responsible for the G, D and D' peaks. These can be used to measure the amount of uniaxial or biaxial strain, providing a fundamental tool for nanoelectronics, where strain monitoring is of paramount importance[8, 9]
△ Less
Submitted 8 December, 2008;
originally announced December 2008.
-
Control of graphene's properties by reversible hydrogenation
Authors:
D. C. Elias,
R. R. Nair,
T. M. G. Mohiuddin,
S. V. Morozov,
P. Blake,
M. P. Halsall,
A. C. Ferrari,
D. W. Boukhvalov,
M. I. Katsnelson,
A. K. Geim,
K. S. Novoselov
Abstract:
Graphene - a monolayer of carbon atoms densely packed into a hexagonal lattice - has one of the strongest possible atomic bonds and can be viewed as a robust atomic-scale scaffold, to which other chemical species can be attached without destroying it. This notion of graphene as a giant flat molecule that can be altered chemically is supported by the observation of so-called graphene oxide, that…
▽ More
Graphene - a monolayer of carbon atoms densely packed into a hexagonal lattice - has one of the strongest possible atomic bonds and can be viewed as a robust atomic-scale scaffold, to which other chemical species can be attached without destroying it. This notion of graphene as a giant flat molecule that can be altered chemically is supported by the observation of so-called graphene oxide, that is graphene densely covered with hydroxyl and other groups. Unfortunately, graphene oxide is strongly disordered, poorly conductive and difficult to reduce to the original state. Nevertheless, one can imagine atoms or molecules being attached to the atomic scaffold in a strictly periodic manner, which should result in a different electronic structure and, essentially, a different crystalline material. A hypothetical example for this is graphane, a wide-gap semiconductor, in which hydrogen is bonded to each carbon site of graphene. Here we show that by exposing graphene to atomic hydrogen, it is possible to transform this highly-conductive semimetal into an insulator. Transmission electron microscopy reveals that the material retains the hexagonal lattice but its period becomes markedly shorter than that of graphene, providing direct evidence for a new graphene-based derivative. The reaction with hydrogen is found to be reversible so that the original metallic state and lattice spacing are restored by annealing and even the quantum Hall effect recovers. Our work proves the concept of chemical modification of graphene, which promises a whole range of new two-dimensional crystals with designed electronic and other properties.
△ Less
Submitted 26 October, 2008;
originally announced October 2008.
-
Effect of high-k environment on charge carrier mobility in graphene
Authors:
L. A. Ponomarenko,
R. Yang,
T. M. Mohiuddin,
S. M. Morozov,
A. A. Zhukov,
F. Schedin,
E. W. Hill,
K. S. Novoselov,
M. I. Katsnelson,
A. K. Geim
Abstract:
It is widely assumed that the dominant source of scattering in graphene is charged impurities in a substrate. We have tested this conjecture by studying graphene placed on various substrates and in high-k media. Unexpectedly, we have found no significant changes in carrier mobility either for different substrates or by using glycerol, ethanol and water as a top dielectric layer. This suggests th…
▽ More
It is widely assumed that the dominant source of scattering in graphene is charged impurities in a substrate. We have tested this conjecture by studying graphene placed on various substrates and in high-k media. Unexpectedly, we have found no significant changes in carrier mobility either for different substrates or by using glycerol, ethanol and water as a top dielectric layer. This suggests that Coulomb impurities are not the scattering mechanism that limits the mean free path currently attainable for graphene on a substrate.
△ Less
Submitted 8 May, 2009; v1 submitted 6 September, 2008;
originally announced September 2008.
-
Quantum-Hall activation gaps in graphene
Authors:
A. J. M. Giesbers,
U. Zeitler,
M. I. Katsnelson,
L. A. Ponomarenko,
T. M. G. Mohiuddin,
J. C. Maan
Abstract:
We have measured the quantum-Hall activation gaps in graphene at filling factors $ν=2$ and $ν=6$ for magnetic fields up to 32 T and temperatures from 4 K to 300 K. The $ν=6$ gap can be described by thermal excitation to broadened Landau levels with a width of 400 K. In contrast, the gap measured at $ν=2$ is strongly temperature and field dependent and approaches the expected value for sharp Land…
▽ More
We have measured the quantum-Hall activation gaps in graphene at filling factors $ν=2$ and $ν=6$ for magnetic fields up to 32 T and temperatures from 4 K to 300 K. The $ν=6$ gap can be described by thermal excitation to broadened Landau levels with a width of 400 K. In contrast, the gap measured at $ν=2$ is strongly temperature and field dependent and approaches the expected value for sharp Landau levels for fields $B > 20$ T and temperatures $T > 100$ K. We explain this surprising behavior by a narrowing of the lowest Landau level.
△ Less
Submitted 12 October, 2007; v1 submitted 19 June, 2007;
originally announced June 2007.