Skip to main content

Showing 1–11 of 11 results for author: Sheikh, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2510.24702  [pdf, ps, other

    cs.CL cs.AI

    Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

    Authors: Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

    Abstract: Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data prot… ▽ More

    Submitted 3 March, 2026; v1 submitted 28 October, 2025; originally announced October 2025.

  2. arXiv:2510.14307  [pdf, ps, other

    cs.CL cs.AI

    MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

    Authors: Sathyanarayanan Ramamoorthy, Vishwa Shah, Simran Khanuja, Zaid Sheikh, Shan Jie, Ann Chia, Shearman Chua, Graham Neubig

    Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multil… ▽ More

    Submitted 16 October, 2025; originally announced October 2025.

  3. arXiv:2404.02408  [pdf, other

    cs.CL

    CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models

    Authors: Zaid Sheikh, Antonios Anastasopoulos, Shruti Rijhwani, Lindia Tjuatja, Robbie Jimerson, Graham Neubig

    Abstract: Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguisti… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Live demo at https://cmulab.dev

  4. arXiv:2305.15097  [pdf

    cs.CV cs.AI

    Computer Vision for Construction Progress Monitoring: A Real-Time Object Detection Approach

    Authors: Jiesheng Yang, Andreas Wilde, Karsten Menzel, Md Zubair Sheikh, Boris Kuznetsov

    Abstract: Construction progress monitoring (CPM) is essential for effective project management, ensuring on-time and on-budget delivery. Traditional CPM methods often rely on manual inspection and reporting, which are time-consuming and prone to errors. This paper proposes a novel approach for automated CPM using state-of-the-art object detection algorithms. The proposed method leverages e.g. YOLOv8's real-… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 15 Pages

  5. A deep-learning search for technosignatures of 820 nearby stars

    Authors: Peter Xiangyuan Ma, Cherry Ng, Leandro Rizk, Steve Croft, Andrew P. V. Siemion, Bryan Brzycki, Daniel Czech, Jamie Drew, Vishal Gajjar, John Hoang, Howard Isaacson, Matt Lebofsky, David MacMahon, Imke de Pater, Danny C. Price, Sofia Z. Sheikh, S. Pete Worden

    Abstract: The goal of the Search for Extraterrestrial Intelligence (SETI) is to quantify the prevalence of technological life beyond Earth via their "technosignatures". One theorized technosignature is narrowband Doppler drifting radio signals. The principal challenge in conducting SETI in the radio domain is developing a generalized technique to reject human radio frequency interference (RFI). Here, we pre… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: 10 pages of main paper followed by 16 pages of methods; 17 figures total and 7 tables; published in Nature Astronomy

  6. arXiv:2203.13901  [pdf, other

    cs.CL

    AUTOLEX: An Automatic Framework for Linguistic Exploration

    Authors: Aditi Chaudhary, Zaid Sheikh, David R Mortensen, Antonios Anastasopoulos, Graham Neubig

    Abstract: Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammar descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions is a fraught process, as creating descriptions which describe the language in "its own terms" without bias or error requires both a deep… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: 9 pages

  7. arXiv:2203.10899  [pdf, other

    astro-ph.EP astro-ph.IM cs.CR physics.pop-ph

    The Case for Technosignatures: Why They May Be Abundant, Long-lived, Highly Detectable, and Unambiguous

    Authors: Jason T. Wright, Jacob Haqq-Misra, Adam Frank, Ravi Kopparapu, Manasvi Lingam, Sofia Z. Sheikh

    Abstract: The intuition suggested by the Drake equation implies that technology should be less prevalent than biology in the galaxy. However, it has been appreciated for decades in the SETI community that technosignatures could be more abundant, longer-lived, more detectable, and less ambiguous than biosignatures. We collect the arguments for and against technosignatures' ubiquity and discuss the implicatio… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: Published in ApJ Letters

    Journal ref: 2022 ApJL 927 L30

  8. arXiv:2011.00767  [pdf, other

    cs.CL

    Reducing Confusion in Active Learning for Part-Of-Speech Tagging

    Authors: Aditi Chaudhary, Antonios Anastasopoulos, Zaid Sheikh, Graham Neubig

    Abstract: Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative training instances, where annotating these instances may reduce a… ▽ More

    Submitted 20 November, 2020; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: To appear in TACL 2020. This is a pre-MIT Press publication version

  9. arXiv:2010.01160  [pdf, other

    cs.CL

    Automatic Extraction of Rules Governing Morphological Agreement

    Authors: Aditi Chaudhary, Antonios Anastasopoulos, Adithya Pratapa, David R. Mortensen, Zaid Sheikh, Yulia Tsvetkov, Graham Neubig

    Abstract: Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focu… ▽ More

    Submitted 5 October, 2020; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: Accepted at EMNLP 2020

  10. arXiv:1908.08983  [pdf, other

    cs.CL

    A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers

    Authors: Aditi Chaudhary, Jiateng Xie, Zaid Sheikh, Graham Neubig, Jaime G. Carbonell

    Abstract: Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lower-resourced languages. However, there are now several proposed approaches involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective… ▽ More

    Submitted 23 August, 2019; originally announced August 2019.

    Comments: Accepted at EMNLP 2019

  11. arXiv:1902.08899  [pdf, other

    cs.CL

    The ARIEL-CMU Systems for LoReHLT18

    Authors: Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W Black, Jaime Carbonell, Graham V. Horwood , et al. (5 additional authors not shown)

    Abstract: This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

    Submitted 24 February, 2019; originally announced February 2019.