arXiv:2510.24702 [pdf, ps, other]

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Authors: Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

Abstract: Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data prot… ▽ More Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training. △ Less

Submitted 3 March, 2026; v1 submitted 28 October, 2025; originally announced October 2025.

arXiv:2510.14307 [pdf, ps, other]

MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Authors: Sathyanarayanan Ramamoorthy, Vishwa Shah, Simran Khanuja, Zaid Sheikh, Shan Jie, Ann Chia, Shearman Chua, Graham Neubig

Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multil… ▽ More This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin △ Less

Submitted 16 October, 2025; originally announced October 2025.

arXiv:2404.02408 [pdf, other]

CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models

Authors: Zaid Sheikh, Antonios Anastasopoulos, Shruti Rijhwani, Lindia Tjuatja, Robbie Jimerson, Graham Neubig

Abstract: Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguisti… ▽ More Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguistic Annotation Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages, even with limited training data. We describe various tools and APIs that are currently available and how developers can easily add new models/functionality to the framework. Code is available at https://github.com/neulab/cmulab along with a live demo at https://cmulab.dev △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Live demo at https://cmulab.dev

arXiv:2305.15097 [pdf]

Computer Vision for Construction Progress Monitoring: A Real-Time Object Detection Approach

Authors: Jiesheng Yang, Andreas Wilde, Karsten Menzel, Md Zubair Sheikh, Boris Kuznetsov

Abstract: Construction progress monitoring (CPM) is essential for effective project management, ensuring on-time and on-budget delivery. Traditional CPM methods often rely on manual inspection and reporting, which are time-consuming and prone to errors. This paper proposes a novel approach for automated CPM using state-of-the-art object detection algorithms. The proposed method leverages e.g. YOLOv8's real-… ▽ More Construction progress monitoring (CPM) is essential for effective project management, ensuring on-time and on-budget delivery. Traditional CPM methods often rely on manual inspection and reporting, which are time-consuming and prone to errors. This paper proposes a novel approach for automated CPM using state-of-the-art object detection algorithms. The proposed method leverages e.g. YOLOv8's real-time capabilities and high accuracy to identify and track construction elements within site images and videos. A dataset was created, consisting of various building elements and annotated with relevant objects for training and validation. The performance of the proposed approach was evaluated using standard metrics, such as precision, recall, and F1-score, demonstrating significant improvement over existing methods. The integration of Computer Vision into CPM provides stakeholders with reliable, efficient, and cost-effective means to monitor project progress, facilitating timely decision-making and ultimately contributing to the successful completion of construction projects. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 15 Pages

arXiv:2301.12670 [pdf, other]

doi 10.1038/s41550-022-01872-z

A deep-learning search for technosignatures of 820 nearby stars

Authors: Peter Xiangyuan Ma, Cherry Ng, Leandro Rizk, Steve Croft, Andrew P. V. Siemion, Bryan Brzycki, Daniel Czech, Jamie Drew, Vishal Gajjar, John Hoang, Howard Isaacson, Matt Lebofsky, David MacMahon, Imke de Pater, Danny C. Price, Sofia Z. Sheikh, S. Pete Worden

Abstract: The goal of the Search for Extraterrestrial Intelligence (SETI) is to quantify the prevalence of technological life beyond Earth via their "technosignatures". One theorized technosignature is narrowband Doppler drifting radio signals. The principal challenge in conducting SETI in the radio domain is developing a generalized technique to reject human radio frequency interference (RFI). Here, we pre… ▽ More The goal of the Search for Extraterrestrial Intelligence (SETI) is to quantify the prevalence of technological life beyond Earth via their "technosignatures". One theorized technosignature is narrowband Doppler drifting radio signals. The principal challenge in conducting SETI in the radio domain is developing a generalized technique to reject human radio frequency interference (RFI). Here, we present the most comprehensive deep-learning based technosignature search to date, returning 8 promising ETI signals of interest for re-observation as part of the Breakthrough Listen initiative. The search comprises 820 unique targets observed with the Robert C. Byrd Green Bank Telescope, totaling over 480, hr of on-sky data. We implement a novel beta-Convolutional Variational Autoencoder to identify technosignature candidates in a semi-unsupervised manner while keeping the false positive rate manageably low. This new approach presents itself as a leading solution in accelerating SETI and other transient research into the age of data-driven astronomy. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: 10 pages of main paper followed by 16 pages of methods; 17 figures total and 7 tables; published in Nature Astronomy

arXiv:2203.13901 [pdf, other]

AUTOLEX: An Automatic Framework for Linguistic Exploration

Authors: Aditi Chaudhary, Zaid Sheikh, David R Mortensen, Antonios Anastasopoulos, Graham Neubig

Abstract: Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammar descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions is a fraught process, as creating descriptions which describe the language in "its own terms" without bias or error requires both a deep… ▽ More Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammar descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions is a fraught process, as creating descriptions which describe the language in "its own terms" without bias or error requires both a deep understanding of the language at hand and linguistics as a whole. We propose an automatic framework AutoLEX that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena. Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order, across several languages. We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible. △ Less

Submitted 25 March, 2022; originally announced March 2022.

Comments: 9 pages

arXiv:2203.10899 [pdf, other]

doi 10.3847/2041-8213/ac5824

The Case for Technosignatures: Why They May Be Abundant, Long-lived, Highly Detectable, and Unambiguous

Authors: Jason T. Wright, Jacob Haqq-Misra, Adam Frank, Ravi Kopparapu, Manasvi Lingam, Sofia Z. Sheikh

Abstract: The intuition suggested by the Drake equation implies that technology should be less prevalent than biology in the galaxy. However, it has been appreciated for decades in the SETI community that technosignatures could be more abundant, longer-lived, more detectable, and less ambiguous than biosignatures. We collect the arguments for and against technosignatures' ubiquity and discuss the implicatio… ▽ More The intuition suggested by the Drake equation implies that technology should be less prevalent than biology in the galaxy. However, it has been appreciated for decades in the SETI community that technosignatures could be more abundant, longer-lived, more detectable, and less ambiguous than biosignatures. We collect the arguments for and against technosignatures' ubiquity and discuss the implications of some properties of technological life that fundamentally differ from nontechnological life in the context of modern astrobiology: It can spread among the stars to many sites, it can be more easily detected at large distances, and it can produce signs that are unambiguously technological. As an illustration in terms of the Drake equation, we consider two Drake-like equations, for technosignatures (calculating N(tech)) and biosignatures (calculating N(bio)). We argue that Earth and humanity may be poor guides to the longevity term L and that its maximum value could be very large, in that technology can outlive its creators and even its host star. We conclude that while the Drake equation implies that N(bio)>>N(tech), it is also plausible that N(tech)>>N(bio). As a consequence, as we seek possible indicators of extraterrestrial life, for instance, via characterization of the atmospheres of habitable exoplanets, we should search for both biosignatures and technosignatures. This exercise also illustrates ways in which biosignature and technosignature searches can complement and supplement each other and how methods of technosignature search, including old ideas from SETI, can inform the search for biosignatures and life generally. △ Less

Submitted 21 March, 2022; originally announced March 2022.

Comments: Published in ApJ Letters

Journal ref: 2022 ApJL 927 L30

arXiv:2011.00767 [pdf, other]

Reducing Confusion in Active Learning for Part-Of-Speech Tagging

Authors: Aditi Chaudhary, Antonios Anastasopoulos, Zaid Sheikh, Graham Neubig

Abstract: Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative training instances, where annotating these instances may reduce a… ▽ More Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative training instances, where annotating these instances may reduce a large number of errors. However, in an empirical study across six typologically diverse languages (German, Swedish, Galician, North Sami, Persian, and Ukrainian), we found the surprising result that even in an oracle scenario where we know the true uncertainty of predictions, these current heuristics are far from optimal. Based on this analysis, we pose the problem of AL as selecting instances which maximally reduce the confusion between particular pairs of output tags. Extensive experimentation on the aforementioned languages shows that our proposed AL strategy outperforms other AL strategies by a significant margin. We also present auxiliary results demonstrating the importance of proper calibration of models, which we ensure through cross-view training, and analysis demonstrating how our proposed strategy selects examples that more closely follow the oracle data distribution. △ Less

Submitted 20 November, 2020; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: To appear in TACL 2020. This is a pre-MIT Press publication version

arXiv:2010.01160 [pdf, other]

Automatic Extraction of Rules Governing Morphological Agreement

Authors: Aditi Chaudhary, Antonios Anastasopoulos, Adithya Pratapa, David R. Mortensen, Zaid Sheikh, Yulia Tsvetkov, Graham Neubig

Abstract: Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focu… ▽ More Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results. Using cross-lingual transfer, even with no expert annotations in the language of interest, our framework extracts a grammatical specification which is nearly equivalent to those created with large amounts of gold-standard annotated data. We confirm this finding with human expert evaluations of the rules that our framework produces, which have an average accuracy of 78%. We release an interface demonstrating the extracted rules at https://neulab.github.io/lase/. △ Less

Submitted 5 October, 2020; v1 submitted 2 October, 2020; originally announced October 2020.

Comments: Accepted at EMNLP 2020

arXiv:1908.08983 [pdf, other]

A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers

Authors: Aditi Chaudhary, Jiateng Xie, Zaid Sheikh, Graham Neubig, Jaime G. Carbonell

Abstract: Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lower-resourced languages. However, there are now several proposed approaches involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective… ▽ More Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lower-resourced languages. However, there are now several proposed approaches involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective training data based on model predictions. This paper poses the question: given this recent progress, and limited human annotation, what is the most effective method for efficiently creating high-quality entity recognizers in under-resourced languages? Based on extensive experimentation using both simulated and real human annotation, we find a dual-strategy approach best, starting with a cross-lingual transferred model, then performing targeted annotation of only uncertain entity spans in the target language, minimizing annotator effort. Results demonstrate that cross-lingual transfer is a powerful tool when very little data can be annotated, but an entity-targeted annotation strategy can achieve competitive accuracy quickly, with just one-tenth of training data. △ Less

Submitted 23 August, 2019; originally announced August 2019.

Comments: Accepted at EMNLP 2019

arXiv:1902.08899 [pdf, other]

The ARIEL-CMU Systems for LoReHLT18

Authors: Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W Black, Jaime Carbonell, Graham V. Horwood , et al. (5 additional authors not shown)

Abstract: This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech). This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech). △ Less

Submitted 24 February, 2019; originally announced February 2019.

Showing 1–11 of 11 results for author: Sheikh, Z