-
Proprioceptive-only State Estimation for Legged Robots with Set-Coverage Measurements of Learned Dynamics
Authors:
Abhijeet M. Kulkarni,
Ioannis Poulakakis,
Guoquan Huang
Abstract:
Proprioceptive-only state estimation is attractive for legged robots since it is computationally cheaper and is unaffected by perceptually degraded conditions. The history of joint-level measurements contains rich information that can be used to infer the dynamics of the system and subsequently produce navigational measurements. Recent approaches produce these estimates with learned measurement mo…
▽ More
Proprioceptive-only state estimation is attractive for legged robots since it is computationally cheaper and is unaffected by perceptually degraded conditions. The history of joint-level measurements contains rich information that can be used to infer the dynamics of the system and subsequently produce navigational measurements. Recent approaches produce these estimates with learned measurement models and fuse with IMU data, under a Gaussian noise assumption. However, this assumption can easily break down with limited training data and render the estimates inconsistent and potentially divergent. In this work, we propose a proprioceptive-only state estimation framework for legged robots that characterizes the measurement noise using set-coverage statements that do not assume any distribution. We develop a practical and computationally inexpensive method to use these set-coverage measurements with a Gaussian filter in a systematic way. We validate the approach in both simulation and two real-world quadrupedal datasets. Comparison with the Gaussian baselines shows that our proposed method remains consistent and is not prone to drift under real noise scenarios.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
Authors:
Tianyu Xu,
Sieun Kim,
Qianhui Zheng,
Ruoyu Xu,
Tejasvi Ravi,
Anuva Kulkarni,
Katrina Passarella-Ward,
Junyi Zhu,
Adarsh Kowdle
Abstract:
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in paral…
▽ More
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Authors:
Ajinkya Kulkarni,
Sandipana Dowerah,
Atharva Kulkarni,
Tanel Alumäe,
Mathew Magimai Doss
Abstract:
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated…
▽ More
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
△ Less
Submitted 6 March, 2026;
originally announced March 2026.
-
PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis
Authors:
Jeet Bandhu Lahiri,
Parshva Runwal,
Arvasu Kulkarni,
Mahir Jain,
Aditya Ray Mishra,
Siddharth Panwar,
Sandeep Singh
Abstract:
EEG foundation models are typically pretrained on narrow-source clinical archives and evaluated on benchmarks from the same ecosystem, leaving unclear whether representations encode neural physiology or recording-distribution artifacts. We introduce PRISM (Population Representative Invariant Signal Model), a masked autoencoder ablated along two axes -- pretraining population and downstream adaptat…
▽ More
EEG foundation models are typically pretrained on narrow-source clinical archives and evaluated on benchmarks from the same ecosystem, leaving unclear whether representations encode neural physiology or recording-distribution artifacts. We introduce PRISM (Population Representative Invariant Signal Model), a masked autoencoder ablated along two axes -- pretraining population and downstream adaptation -- with architecture and preprocessing fixed. We compare a narrow-source EU/US corpus (TUH + PhysioNet) against a geographically diverse pool augmented with multi-center South Asian clinical recordings across multiple EEG systems. Three findings emerge. First, narrow-source pretraining yields stronger linear probes on distribution-matched benchmarks, while diverse pretraining produces more adaptable representations under fine-tuning -- a trade-off invisible under single-protocol evaluation. Trained on three source corpora, PRISM matches or outperforms REVE (92 datasets, 60,000+ hours) on the majority of tasks, demonstrating that targeted diversity can substitute for indiscriminate scale and that dataset count is a confounding variable in model comparison. Second, on a clinically challenging and previously untested task -- distinguishing epilepsy from diagnostic mimickers via interictal EEG -- the diverse checkpoint outperforms the narrow-source checkpoint by +12.3 pp balanced accuracy, the largest gap across all evaluations. Third, systematic inconsistencies between EEG-Bench and EEG-FM-Bench reverse model rankings on identical datasets by up to 24 pp; we identify six concrete sources including split construction, checkpoint selection, segment length, and normalization, showing these factors compound non-additively.
△ Less
Submitted 28 February, 2026;
originally announced March 2026.
-
Designing AI Tutors for Interest-Based Learning: Insights from Human Instructors
Authors:
Abhishek Kulkarni,
Sharon Lynn Chu
Abstract:
Interest-based learning (IBL) is a paradigm of instruction in which educational content is contextualized using learners' interests to enhance content relevance. IBL has been shown to result in improved learning outcomes. Unfortunately, high effort is needed for instructors to design and deliver IBL content for individual students. LLMs in the form of AI tutors may allow for IBL to scale across ma…
▽ More
Interest-based learning (IBL) is a paradigm of instruction in which educational content is contextualized using learners' interests to enhance content relevance. IBL has been shown to result in improved learning outcomes. Unfortunately, high effort is needed for instructors to design and deliver IBL content for individual students. LLMs in the form of AI tutors may allow for IBL to scale across many students. Designing an AI tutor for IBL, however, first requires an understanding of how IBL is implemented in teaching scenarios. This paper presents a study that seeks to derive this understanding from an analysis of how human instructors design and deliver IBL content. We studied 14 one-to-one online tutoring sessions (28 participants) in which tutors designed and delivered a lesson tailored to a student's self-identified interest. Using lesson artifacts, tutoring transcripts, interviews, and questionnaires, findings include themes on how tutors integrate interests during instruction and why. Finally, actionable design implications are presented for LLM-powered AI tutors that aim to deliver IBL at scale.
△ Less
Submitted 27 February, 2026;
originally announced February 2026.
-
E3VA: Enhancing Emotional Expressiveness in Virtual Conversational Agents
Authors:
Abhishek Kulkarni,
Alexander Barquero,
Pavitra Lahari,
Aryaan Shaikh,
Sarah Brown
Abstract:
With the advent of generative AI and large language models, embodied conversational agents are becoming synonymous with online interactions. These agents possess vast amounts of knowledge but suffer from exhibiting limited emotional expressiveness. Without adequate expressions, agents might fail to adapt to users' emotions, which may result in a sub-optimal user experience and engagement. Most cur…
▽ More
With the advent of generative AI and large language models, embodied conversational agents are becoming synonymous with online interactions. These agents possess vast amounts of knowledge but suffer from exhibiting limited emotional expressiveness. Without adequate expressions, agents might fail to adapt to users' emotions, which may result in a sub-optimal user experience and engagement. Most current systems prioritize content based responses, neglecting the emotional context of conversations. Research in this space is currently limited to specific contexts, like mental health. To bridge this gap, our project proposes the implementation of expressive features in a virtual conversational agent which will utilize sentiment analysis and natural language processing to inform the generation of empathetic, expressive responses. The project delivers a functional conversational agent capable of assessing and responding to user emotions accordingly. We posit this will enhance usability, engagement, and the overall quality of conversations and present results from an exploratory pilot study investigating the same.
△ Less
Submitted 25 February, 2026;
originally announced February 2026.
-
Disentangling Geometry, Performance, and Training in Language Models
Authors:
Atharva Kulkarni,
Jacob Mitchell Springer,
Arjun Subramonian,
Swabha Swayamdipta
Abstract:
Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, i…
▽ More
Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMo-style language models trained under controlled variation, reveal several key findings. While the best-performing models often exhibit a high effective rank, this trend is not universal across tasks and training setups. Contrary to prior work, we find that low effective rank does not cause late-stage performance degradation in small models, but instead co-occurs with it; we find adversarial cases where low-rank models do not exhibit saturation. Moreover, we show that effective rank is strongly influenced by pre-training hyperparameters, such as batch size and weight decay, which in-turn affect the model's performance. Lastly, extending our analysis to other geometric metrics and final-layer representation, we find that these metrics are largely aligned, but none can reliably predict downstream performance. Overall, our findings suggest that the model's geometry, as captured by existing metrics, primarily reflects training choices rather than performance.
△ Less
Submitted 23 February, 2026;
originally announced February 2026.
-
Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems
Authors:
Ali Faraz,
Raja Kolla,
Ashish Kulkarni,
Shubham Agarwal
Abstract:
Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong mu…
▽ More
Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.
△ Less
Submitted 18 February, 2026;
originally announced February 2026.
-
Long-Context Long-Form Question Answering for Legal Domain
Authors:
Anagha Kulkarni,
Parin Rajesh Jhaveri,
Prasha Shrestha,
Yu Tong Han,
Reza Amini,
Behrouz Madahian
Abstract:
Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e.…
▽ More
Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e. requires long-context) and is required to be comprehensive (i.e. a long-form answer). In this paper, we address the challenges of long-context question answering in context of long-form answers given the idiosyncrasies of legal documents. We propose a question answering system that can (a) deconstruct domain-specific vocabulary for better retrieval from source documents, (b) parse complex document layouts while isolating sections and footnotes and linking them appropriately, (c) generate comprehensive answers using precise domain-specific vocabulary. We also introduce a coverage metric that classifies the performance into recall-based coverage categories allowing human users to evaluate the recall with ease. We curate a QA dataset by leveraging the expertise of professionals from fields such as law and corporate tax. Through comprehensive experiments and ablation studies, we demonstrate the usability and merit of the proposed system.
△ Less
Submitted 6 February, 2026;
originally announced February 2026.
-
Evaluating Large Language Models on Solved and Unsolved Problems in Graph Theory: Implications for Computing Education
Authors:
Adithya Kulkarni,
Mohna Chakraborty,
Jay Bagga
Abstract:
Large Language Models are increasingly used by students to explore advanced material in computer science, including graph theory. As these tools become integrated into undergraduate and graduate coursework, it is important to understand how reliably they support mathematically rigorous thinking. This study examines the performance of a LLM on two related graph theoretic problems: a solved problem…
▽ More
Large Language Models are increasingly used by students to explore advanced material in computer science, including graph theory. As these tools become integrated into undergraduate and graduate coursework, it is important to understand how reliably they support mathematically rigorous thinking. This study examines the performance of a LLM on two related graph theoretic problems: a solved problem concerning the gracefulness of line graphs and an open problem for which no solution is currently known. We use an eight stage evaluation protocol that reflects authentic mathematical inquiry, including interpretation, exploration, strategy formation, and proof construction.
The model performed strongly on the solved problem, producing correct definitions, identifying relevant structures, recalling appropriate results without hallucination, and constructing a valid proof confirmed by a graph theory expert. For the open problem, the model generated coherent interpretations and plausible exploratory strategies but did not advance toward a solution. It did not fabricate results and instead acknowledged uncertainty, which is consistent with the explicit prompting instructions that directed the model to avoid inventing theorems or unsupported claims.
These findings indicate that LLMs can support exploration of established material but remain limited in tasks requiring novel mathematical insight or critical structural reasoning. For computing education, this distinction highlights the importance of guiding students to use LLMs for conceptual exploration while relying on independent verification and rigorous argumentation for formal problem solving.
△ Less
Submitted 4 February, 2026;
originally announced February 2026.
-
An Agentic Software Framework for Data Governance under DPDP
Authors:
Apurva Kulkarni,
Chandrashekar Ramanathan
Abstract:
Despite the rise of data-driven software systems in the modern digital landscape, data governance under a legal framework remains a critical challenge. In India, the Digital Personal Data Protection (DPDP) Act mandates rigorous data privacy and compliance requirements, necessitating software frameworks that are both ethical and regulation-aware. From a software development perspective, traditional…
▽ More
Despite the rise of data-driven software systems in the modern digital landscape, data governance under a legal framework remains a critical challenge. In India, the Digital Personal Data Protection (DPDP) Act mandates rigorous data privacy and compliance requirements, necessitating software frameworks that are both ethical and regulation-aware. From a software development perspective, traditional compliance tools often rely on hard-coded rules and static configurations, making them inflexible to dynamic policy updates or evolving legal contexts. Additionally, their monolithic architectures obscure decision-making processes, creating black-box behavior in critical governance workflows. Developing responsible AI software demands transparency, traceability, and adaptive enforcement mechanisms that make ethical decisions explainable. To address this challenge, a novel agentic framework is introduced to embed compliance logic directly into software agents that govern and adapt data policies. In this paper, the implementation focuses on the DPDP Act. The framework integrates KYU Agent and Compliance Agent for this purpose. KYU (Know-YourUser) Agent supports semantic understanding, user trustworthiness modelling and Compliance Agent uses data sensitivity reasoning within a goal-driven, agentic pipeline. The proposed framework, built using an open-sourced agentic framework and has been evaluated across ten diverse domains, including healthcare, education, and e-commerce. Its effectiveness under DPDP, measured via an Anonymization Score, demonstrates scalable, compliant data governance through masking, pseudonymization, and generalization strategies tailored to domain-specific needs. The proposed framework delivers scalable, transparent, and compliant data governance through collaborative agents, dynamic policy enforcement, and domain-aware anonymization.
△ Less
Submitted 3 January, 2026;
originally announced January 2026.
-
Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing
Authors:
Panagiotis Theocharopoulos,
Ajinkya Kulkarni,
Mathew Magimai. -Doss
Abstract:
Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is…
▽ More
Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.
△ Less
Submitted 29 December, 2025;
originally announced December 2025.
-
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
Authors:
Akshay Kulkarni,
Tsui-Wei Weng,
Vivek Narayanaswamy,
Shusen Liu,
Wesam A. Sakla,
Kowshik Thopalli
Abstract:
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learned features to be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics for a systematic analysis of LVLM SAEs. This uncove…
▽ More
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learned features to be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics for a systematic analysis of LVLM SAEs. This uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) user-desired concepts are often absent in the SAE, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks.
△ Less
Submitted 30 March, 2026; v1 submitted 11 December, 2025;
originally announced December 2025.
-
Efficient and Compliant Control Framework for Versatile Human-Humanoid Collaborative Transportation
Authors:
Shubham S. Kumbhar,
Abhijeet M. Kulkarni,
Panagiotis Artemiadis
Abstract:
We present a control framework that enables humanoid robots to perform collaborative transportation tasks with a human partner. The framework supports both translational and rotational motions, which are fundamental to co-transport scenarios. It comprises three components: a high-level planner, a low-level controller, and a stiffness modulation mechanism. At the planning level, we introduce the In…
▽ More
We present a control framework that enables humanoid robots to perform collaborative transportation tasks with a human partner. The framework supports both translational and rotational motions, which are fundamental to co-transport scenarios. It comprises three components: a high-level planner, a low-level controller, and a stiffness modulation mechanism. At the planning level, we introduce the Interaction Linear Inverted Pendulum (I-LIP), which, combined with an admittance model and an MPC formulation, generates dynamically feasible footstep plans. These are executed by a QP-based whole-body controller that accounts for the coupled humanoid-object dynamics. Stiffness modulation regulates robot-object interaction, ensuring convergence to the desired relative configuration defined by the distance between the object and the robot's center of mass. We validate the effectiveness of the framework through real-world experiments conducted on the Digit humanoid platform. To quantify collaboration quality, we propose an efficiency metric that captures both task performance and inter-agent coordination. We show that this metric highlights the role of compliance in collaborative tasks and offers insights into desirable trajectory characteristics across both high- and low-level control layers. Finally, we showcase experimental results on collaborative behaviors, including translation, turning, and combined motions such as semi circular trajectories, representative of naturally occurring co-transportation tasks.
△ Less
Submitted 8 December, 2025;
originally announced December 2025.
-
Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem
Authors:
Shayne Longpre,
Christopher Akiki,
Campbell Lund,
Atharva Kulkarni,
Emily Chen,
Irene Solaiman,
Avijit Ghosh,
Yacine Jernite,
Lucie-Aimée Kaffee
Abstract:
Since 2019, the Hugging Face Model Hub has been the primary global platform for sharing open weight AI models. By releasing a dataset of the complete history of weekly model downloads (June 2020-August 2025) alongside model metadata, we provide the most rigorous examination to-date of concentration dynamics and evolving characteristics in the open model economy. Our analysis spans 851,000 models,…
▽ More
Since 2019, the Hugging Face Model Hub has been the primary global platform for sharing open weight AI models. By releasing a dataset of the complete history of weekly model downloads (June 2020-August 2025) alongside model metadata, we provide the most rigorous examination to-date of concentration dynamics and evolving characteristics in the open model economy. Our analysis spans 851,000 models, over 200 aggregated attributes per model, and 2.2B downloads. We document a fundamental rebalancing of economic power: US open-weight industry dominance by Google, Meta, and OpenAI has declined sharply in favor of unaffiliated developers, community organizations, and, as of 2025, Chinese industry, with DeepSeek and Qwen models potentially heralding a new consolidation of market power. We identify statistically significant shifts in model properties, a 17X increase in average model size, rapid growth in multimodal generation (3.4X), quantization (5X), and mixture-of-experts architectures (7X), alongside concerning declines in data transparency, with open weights models surpassing truly open source models for the first time in 2025. We expose a new layer of developer intermediaries that has emerged, focused on quantizing and adapting base models for both efficiency and artistic expression. To enable continued research and oversight, we release the complete dataset with an interactive dashboard for real-time monitoring of concentration dynamics and evolving properties in the open model economy.
△ Less
Submitted 27 November, 2025;
originally announced December 2025.
-
Accent Placement Models for Rigvedic Sanskrit Text
Authors:
Akhil Rajeev P,
Annarao Kulkarni
Abstract:
The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i)…
▽ More
The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5.
Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship.
△ Less
Submitted 28 November, 2025;
originally announced November 2025.
-
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
Authors:
Guduru Manoj,
Neel Prabhanjan Rachamalla,
Ashish Kulkarni,
Gautam Rajeev,
Jay Piplodiya,
Arul Menezes,
Shaharukh Khan,
Souvik Rana,
Manya Sah,
Chandra Khatri,
Shubham Agarwal
Abstract:
In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of syntheti…
▽ More
In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
△ Less
Submitted 16 November, 2025; v1 submitted 13 November, 2025;
originally announced November 2025.
-
Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching
Authors:
Uday Bhaskar,
Rishabh Bhattacharya,
Avinash Patel,
Sarthak Khoche,
Praveen Anil Kulkarni,
Naresh Manwani
Abstract:
Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by lev…
▽ More
Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.
△ Less
Submitted 12 November, 2025;
originally announced November 2025.
-
DiagnoLLM: A Hybrid Bayesian Neural Language Framework for Interpretable Disease Diagnosis
Authors:
Bowen Xu,
Xinyue Zeng,
Jiazhen Hu,
Tuo Wang,
Adithya Kulkarni
Abstract:
Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations. We present \texttt{DiagnoLLM}, a hybrid framework that integrates Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis. DiagnoLLM begins with GP-unmix, a Gaussian Process-based hierarchical mod…
▽ More
Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations. We present \texttt{DiagnoLLM}, a hybrid framework that integrates Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis. DiagnoLLM begins with GP-unmix, a Gaussian Process-based hierarchical model that infers cell-type-specific gene expression profiles from bulk and single-cell RNA-seq data while modeling biological uncertainty. These features, combined with regulatory priors from eQTL analysis, power a neural classifier that achieves high predictive performance in Alzheimer's Disease (AD) detection (88.0\% accuracy). To support human understanding and trust, we introduce an LLM-based reasoning module that translates model outputs into audience-specific diagnostic reports, grounded in clinical features, attribution signals, and domain knowledge. Human evaluations confirm that these reports are accurate, actionable, and appropriately tailored for both physicians and patients. Our findings show that LLMs, when deployed as post-hoc reasoners rather than end-to-end predictors, can serve as effective communicators within hybrid diagnostic pipelines.
△ Less
Submitted 16 November, 2025; v1 submitted 7 November, 2025;
originally announced November 2025.
-
MUTANT: A Recipe for Multilingual Tokenizer Design
Authors:
Souvik Rana,
Arul Menezes,
Ashish Kulkarni,
Chandra Khatri,
Shubham Agarwal
Abstract:
Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods like Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remai…
▽ More
Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods like Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present MUTANT, a recipe for building multilingual tokenizers, with careful vocabulary and training data design, language-aware pre-tokenization, and subword and multiword aware training. We also introduce MUTANT-Indic, a tokenizer for India-specific multilingual LLMs, that produces linguistically coherent tokens and achieves state-of-the-art performance. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5%$ over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.
△ Less
Submitted 22 March, 2026; v1 submitted 5 November, 2025;
originally announced November 2025.
-
Learning Neural Observer-Predictor Models for Limb-level Sampling-based Locomotion Planning
Authors:
Abhijeet M. Kulkarni,
Ioannis Poulakakis,
Guoquan Huang
Abstract:
Accurate full-body motion prediction is essential for the safe, autonomous navigation of legged robots, enabling critical capabilities like limb-level collision checking in cluttered environments. Simplified kinematic models often fail to capture the complex, closed-loop dynamics of the robot and its low-level controller, limiting their predictions to simple planar motion. To address this, we pres…
▽ More
Accurate full-body motion prediction is essential for the safe, autonomous navigation of legged robots, enabling critical capabilities like limb-level collision checking in cluttered environments. Simplified kinematic models often fail to capture the complex, closed-loop dynamics of the robot and its low-level controller, limiting their predictions to simple planar motion. To address this, we present a learning-based observer-predictor framework that accurately predicts this motion. Our method features a neural observer with provable UUB guarantees that provides a reliable latent state estimate from a history of proprioceptive measurements. This stable estimate initializes a computationally efficient predictor, designed for the rapid, parallel evaluation of thousands of potential trajectories required by modern sampling-based planners. We validated the system by integrating our neural predictor into an MPPI-based planner on a Vision 60 quadruped. Hardware experiments successfully demonstrated effective, limb-aware motion planning in a challenging, narrow passage and over small objects, highlighting our system's ability to provide a robust foundation for high-performance, collision-aware planning on dynamic robotic platforms.
△ Less
Submitted 26 October, 2025;
originally announced October 2025.
-
NTIRE 2025 Challenge on Low Light Image Enhancement: Methods and Results
Authors:
Xiaoning Liu,
Zongwei Wu,
Florin-Alexandru Vasluianu,
Hailong Yan,
Bin Ren,
Yulun Zhang,
Shuhang Gu,
Le Zhang,
Ce Zhu,
Radu Timofte,
Kangbiao Shi,
Yixu Feng,
Tao Hu,
Yu Cao,
Peng Wu,
Yijin Liang,
Yanning Zhang,
Qingsen Yan,
Han Zhou,
Wei Dong,
Yan Min,
Mohab Kishawy,
Jun Chen,
Pengpeng Yu,
Anjin Park
, et al. (80 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Low-Light Image Enhancement (LLIE) Challenge, highlighting the proposed solutions and final outcomes. The objective of the challenge is to identify effective networks capable of producing brighter, clearer, and visually compelling images under diverse and challenging conditions. A remarkable total of 762 participants registered for the c…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Low-Light Image Enhancement (LLIE) Challenge, highlighting the proposed solutions and final outcomes. The objective of the challenge is to identify effective networks capable of producing brighter, clearer, and visually compelling images under diverse and challenging conditions. A remarkable total of 762 participants registered for the competition, with 28 teams ultimately submitting valid entries. This paper thoroughly evaluates the state-of-the-art advancements in LLIE, showcasing the significant progress.
△ Less
Submitted 15 October, 2025;
originally announced October 2025.
-
Non-Linear Precoding via Dirty Paper Coding for Near-Field Downlink MISO Communications
Authors:
Akash Kulkarni,
Rajshekhar V Bhat
Abstract:
In 6G systems, extremely large-scale antenna arrays operating at terahertz frequencies extend the near-field region to typical user distances from the base station, enabling near-field communication (NFC) with fine spatial resolution through beamfocusing. Existing multiuser NFC systems predominantly employ linear precoding techniques such as zero-forcing (ZF), which suffer from performance degrada…
▽ More
In 6G systems, extremely large-scale antenna arrays operating at terahertz frequencies extend the near-field region to typical user distances from the base station, enabling near-field communication (NFC) with fine spatial resolution through beamfocusing. Existing multiuser NFC systems predominantly employ linear precoding techniques such as zero-forcing (ZF), which suffer from performance degradation due to the high transmit power required to suppress interference. This paper proposes a nonlinear precoding framework based on Dirty Paper Coding (DPC), which pre-cancels known interference to maximize the sum-rate performance. We formulate and solve the corresponding sum-rate maximization problems, deriving optimal power allocation strategies for both DPC and ZF schemes. Extensive simulations demonstrate that DPC achieves substantial sum-rate gains over ZF across various near-field configurations, with the most pronounced improvements observed for closely spaced users.
△ Less
Submitted 20 November, 2025; v1 submitted 15 October, 2025;
originally announced October 2025.
-
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
Authors:
Chung-En Sun,
Ge Yan,
Akshay Kulkarni,
Tsui-Wei Weng
Abstract:
Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervis…
▽ More
Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine
△ Less
Submitted 10 October, 2025;
originally announced October 2025.
-
Scalable Offline Metrics for Autonomous Driving
Authors:
Animikh Aich,
Adwait Kulkarni,
Eshed Ohn-Bar
Abstract:
Real-world evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e. by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor…
▽ More
Real-world evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e. by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is understudied, particularly across diverse closed-loop metrics and complex urban maneuvers. In this work, we revisit this undervalued question in policy evaluation through an extensive set of experiments across diverse conditions and metrics. Based on analysis in simulation, we find an even worse correlation between offline and online settings than reported by prior studies, casting doubts on the validity of current evaluation practices and metrics for driving policies. Next, we bridge the gap between offline and online evaluation. We investigate an offline metric based on epistemic uncertainty, which aims to capture events that are likely to cause errors in closed-loop settings. The resulting metric achieves over 13% improvement in correlation compared to previous offline metrics. We further validate the generalization of our findings beyond the simulation environment in real-world settings, where even greater gains are observed.
△ Less
Submitted 9 November, 2025; v1 submitted 9 October, 2025;
originally announced October 2025.
-
VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
Authors:
Dhruv Jain,
Harshit Shukla,
Gautam Rajeev,
Ashish Kulkarni,
Chandra Khatri,
Shubham Agarwal
Abstract:
Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks largely focus on isolated capabilities such as transcription or question answering and do not systematically evaluate agentic behavior or adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehens…
▽ More
Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks largely focus on isolated capabilities such as transcription or question answering and do not systematically evaluate agentic behavior or adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark for evaluating SpeechLMs in realistic spoken agentic settings, comprising 6,000+ synthetic spoken queries spanning single-tool invocations, multi-tool workflows, multi-turn dialogue, and safety evaluations across English and six Indic languages. To ensure speaker diversity, we further simulate speaker variability using a novel sampling strategy that selects audios for TTS voice conversion based on speaker embeddings to maximize acoustic diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Across agentic tasks, ASR-LLM pipelines outperform end-to-end SpeechLMs, achieving up to 60.6% average parameter-filling accuracy on English, while SpeechLMs exhibit lower performance and sharper degradation on Indic languages. All models struggle in sequential workflows and safety evaluations, highlighting persistent limitations in tool orchestration, multilingual generalization, and safety robustness. VoiceAgentBench is publicly available on Hugging Face at https://huggingface.co/datasets/krutrim-ai-labs/VoiceAgentBench, and the codebase is released at https://github.com/ola-krutrim/VoiceAgentBench.
△ Less
Submitted 13 February, 2026; v1 submitted 9 October, 2025;
originally announced October 2025.
-
Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
Authors:
Neel Prabhanjan Rachamalla,
Aravind Konakalla,
Gautam Rajeev,
Ashish Kulkarni,
Chandra Khatri,
Shubham Agarwal
Abstract:
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop…
▽ More
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
△ Less
Submitted 8 October, 2025;
originally announced October 2025.
-
AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives
Authors:
Khalid Mehtab Khan,
Anagha Kulkarni
Abstract:
Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core ch…
▽ More
Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model's awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model's vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.
△ Less
Submitted 3 November, 2025; v1 submitted 6 October, 2025;
originally announced October 2025.
-
DBF-MA: A Differential Bayesian Filtering Planner for Multi-Agent Autonomous Racing Overtakes
Authors:
Trent Weiss,
Amar Kulkarni,
Madhur Behl
Abstract:
A significant challenge in autonomous racing is to generate overtaking maneuvers. Racing agents must execute these maneuvers on complex racetracks with little room for error. Optimization techniques and graph-based methods have been proposed, but these methods often rely on oversimplified assumptions for collision-avoidance and dynamic constraints. In this work, we present an approach to trajector…
▽ More
A significant challenge in autonomous racing is to generate overtaking maneuvers. Racing agents must execute these maneuvers on complex racetracks with little room for error. Optimization techniques and graph-based methods have been proposed, but these methods often rely on oversimplified assumptions for collision-avoidance and dynamic constraints. In this work, we present an approach to trajectory synthesis based on an extension of the Differential Bayesian Filtering framework. Our approach for collision-free trajectory synthesis frames the problem as one of Bayesian Inference over the space of Composite Bezier Curves. Our method is derivative-free, does not require a spherical approximation of the vehicle footprint, linearization of constraints, or simplifying upper bounds on collision avoidance. We conduct a closed-loop analysis of DBF-MA and find it successfully overtakes an opponent in 87% of tested scenarios, outperforming existing methods in autonomous overtaking.
△ Less
Submitted 1 October, 2025; v1 submitted 26 September, 2025;
originally announced September 2025.
-
CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems
Authors:
Soham Bhattacharjee,
Mukund K Roy,
Yathish Poojary,
Bhargav Dave,
Mihir Raj,
Vandan Mujadia,
Baban Gain,
Pruthwik Mishra,
Arafat Ahsan,
Parameswari Krishnamurthy,
Ashwath Rao,
Gurpreet Singh Josan,
Preeti Dubey,
Aadil Amin Kak,
Anna Rao Kulkarni,
Narendra VG,
Sunita Arora,
Rakesh Balbantray,
Prasenjit Majumdar,
Karunesh K Arora,
Asif Ekbal,
Dipti Mishra Sharma
Abstract:
India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied do…
▽ More
India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus's value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs
Authors:
Debarpan Bhattacharya,
Apoorva Kulkarni,
Sriram Ganapathy
Abstract:
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the…
▽ More
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
△ Less
Submitted 30 January, 2026; v1 submitted 20 September, 2025;
originally announced September 2025.
-
Snail Homing and Mating Search Algorithm for Weight Optimization of Stepped-Transmission Shaft
Authors:
Kaustav Saha,
Ishaan R Kale,
Vivek Patel,
Anand J Kulkarni,
Puskaraj D Sonawwanay
Abstract:
In this paper, the steeped-transmission shaft design problem is proposed for weight optimization. The bio-inspired search-based Snail Homing and Mating Search (SHMS) algorithm is utilized to solve the problem. It is inspired by the social behaviour of snails and their inherent nature of finding better homes, and mate. The proposed steeped-transmission shaft design problem is modelled considering t…
▽ More
In this paper, the steeped-transmission shaft design problem is proposed for weight optimization. The bio-inspired search-based Snail Homing and Mating Search (SHMS) algorithm is utilized to solve the problem. It is inspired by the social behaviour of snails and their inherent nature of finding better homes, and mate. The proposed steeped-transmission shaft design problem is modelled considering the fatigue loading, combined bending, torsion loads, and the principle of Modified Goodman criteria. The forces diagram and the bending moment diagrams are obtained using the MDSOLIDS software. The forces and bending moment are then used to mathematical model the objective function and constraints. The SHMS algorithm has yielded the desired solution with reasonable computational cost. The constraints are handled using a static penalty function approach. The statistical results obtained using SHMS algorithm are further used for generating CAD model. The analysis is carried out in ANSYS Workbench. Further, the deflection obtained from SHMS algorithm and ANSYS Workbench are compared and results are discussed in details.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
ODoQ: Oblivious DNS-over-QUIC
Authors:
Aditya Kulkarni,
Tamal Das,
Vivek Balachandran
Abstract:
The Domain Name System (DNS), which converts domain names to their respective IP addresses, has advanced enhancements aimed at safeguarding DNS data and users' identity from attackers. The recent privacy-focused advancements have enabled the IETF to standardize several protocols. Nevertheless, these protocols tend to focus on either strengthening user privacy (like Oblivious DNS and Oblivious DNS-…
▽ More
The Domain Name System (DNS), which converts domain names to their respective IP addresses, has advanced enhancements aimed at safeguarding DNS data and users' identity from attackers. The recent privacy-focused advancements have enabled the IETF to standardize several protocols. Nevertheless, these protocols tend to focus on either strengthening user privacy (like Oblivious DNS and Oblivious DNS-over-HTTPS) or reducing resolution latency (as demonstrated by DNS-over-QUIC). Achieving both within a single protocol remains a key challenge, which we address in this paper. Our proposed protocol -- 'Oblivious DNS-over-QUIC' (ODoQ) -- leverages the benefits of the QUIC protocol and incorporates an intermediary proxy server to protect the client's identity from exposure to the recursive resolver.
△ Less
Submitted 8 December, 2025; v1 submitted 14 September, 2025;
originally announced September 2025.
-
Bridging the Gap in Phishing Detection: A Comprehensive Phishing Dataset Collector
Authors:
Aditya Kulkarni,
Shahil Manishbhai Patel,
Shivam Pradip Tirmare,
Vivek Balachandran,
Tamal Das
Abstract:
To combat phishing attacks -- aimed at luring web users to divulge their sensitive information -- various phishing detection approaches have been proposed. As attackers focus on devising new tactics to bypass existing detection solutions, researchers have adapted by integrating machine learning and deep learning into phishing detection. Phishing dataset collection is vital to developing effective…
▽ More
To combat phishing attacks -- aimed at luring web users to divulge their sensitive information -- various phishing detection approaches have been proposed. As attackers focus on devising new tactics to bypass existing detection solutions, researchers have adapted by integrating machine learning and deep learning into phishing detection. Phishing dataset collection is vital to developing effective phishing detection approaches, which highly depend on the diversity of the gathered datasets. The lack of diversity in the dataset results in a biased model. Since phishing websites are often short-lived, collecting them is also a challenge. Consequently, very few phishing webpage dataset repositories exist to date. No single repository comprehensively consolidates all phishing elements corresponding to a phishing webpage, namely, URL, webpage source code, screenshot, and related webpage resources. This paper introduces a resource collection tool designed to gather various resources associated with a URL, such as CSS, Javascript, favicons, webpage images, and screenshots. Our tool leverages PhishTank as the primary source for obtaining active phishing URLs. Our tool fetches several additional webpage resources compared to PyWebCopy Python library, which provides webpage content for a given URL. Additionally, we share a sample dataset generated using our tool comprising 4,056 legitimate and 5,666 phishing URLs along with their associated resources. We also remark on the top correlated phishing features with their associated class label found in our dataset. Our tool offers a comprehensive resource set that can aid researchers in developing effective phishing detection approaches.
△ Less
Submitted 11 September, 2025;
originally announced September 2025.
-
Phishing Webpage Detection: Unveiling the Threat Landscape and Investigating Detection Techniques
Authors:
Aditya Kulkarni,
Vivek Balachandran,
Tamal Das
Abstract:
In the realm of cybersecurity, phishing stands as a prevalent cyber attack, where attackers employ various tactics to deceive users into gathering their sensitive information, potentially leading to identity theft or financial gain. Researchers have been actively working on advancing phishing webpage detection approaches to detect new phishing URLs, bolstering user protection. Nonetheless, the eve…
▽ More
In the realm of cybersecurity, phishing stands as a prevalent cyber attack, where attackers employ various tactics to deceive users into gathering their sensitive information, potentially leading to identity theft or financial gain. Researchers have been actively working on advancing phishing webpage detection approaches to detect new phishing URLs, bolstering user protection. Nonetheless, the ever-evolving strategies employed by attackers, aimed at circumventing existing detection approaches and tools, present an ongoing challenge to the research community. This survey presents a systematic categorization of diverse phishing webpage detection approaches, encompassing URL-based, webpage content-based, and visual techniques. Through a comprehensive review of these approaches and an in-depth analysis of existing literature, our study underscores current research gaps in phishing webpage detection. Furthermore, we suggest potential solutions to address some of these gaps, contributing valuable insights to the ongoing efforts to combat phishing attacks.
△ Less
Submitted 10 September, 2025;
originally announced September 2025.
-
Phish-Blitz: Advancing Phishing Detection with Comprehensive Webpage Resource Collection and Visual Integrity Preservation
Authors:
Duddu Hriday,
Aditya Kulkarni,
Vivek Balachandran,
Tamal Das
Abstract:
Phishing attacks are increasingly prevalent, with adversaries creating deceptive webpages to steal sensitive information. Despite advancements in machine learning and deep learning for phishing detection, attackers constantly develop new tactics to bypass detection models. As a result, phishing webpages continue to reach users, particularly those unable to recognize phishing indicators. To improve…
▽ More
Phishing attacks are increasingly prevalent, with adversaries creating deceptive webpages to steal sensitive information. Despite advancements in machine learning and deep learning for phishing detection, attackers constantly develop new tactics to bypass detection models. As a result, phishing webpages continue to reach users, particularly those unable to recognize phishing indicators. To improve detection accuracy, models must be trained on large datasets containing both phishing and legitimate webpages, including URLs, webpage content, screenshots, and logos. However, existing tools struggle to collect the required resources, especially given the short lifespan of phishing webpages, limiting dataset comprehensiveness. In response, we introduce Phish-Blitz, a tool that downloads phishing and legitimate webpages along with their associated resources, such as screenshots. Unlike existing tools, Phish-Blitz captures live webpage screenshots and updates resource file paths to maintain the original visual integrity of the webpage. We provide a dataset containing 8,809 legitimate and 5,000 phishing webpages, including all associated resources. Our dataset and tool are publicly available on GitHub, contributing to the research community by offering a more complete dataset for phishing detection.
△ Less
Submitted 10 September, 2025;
originally announced September 2025.
-
Overcoming DNSSEC Islands of Security: A TLS and IP-Based Certificate Solution
Authors:
Aduma Rishith,
Aditya Kulkarni,
Tamal Das,
Vivek Balachandran
Abstract:
The Domain Name System (DNS) serves as the backbone of the Internet, primarily translating domain names to IP addresses. Over time, various enhancements have been introduced to strengthen the integrity of DNS. Among these, DNSSEC stands out as a leading cryptographic solution. It protects against attacks (such as DNS spoofing) by establishing a chain of trust throughout the DNS nameserver hierarch…
▽ More
The Domain Name System (DNS) serves as the backbone of the Internet, primarily translating domain names to IP addresses. Over time, various enhancements have been introduced to strengthen the integrity of DNS. Among these, DNSSEC stands out as a leading cryptographic solution. It protects against attacks (such as DNS spoofing) by establishing a chain of trust throughout the DNS nameserver hierarchy. However, DNSSEC's effectiveness is compromised when there is a break in this chain, resulting in "Islands of Security", where domains can authenticate locally but not across hierarchical levels, leading to a loss of trust and validation between them. Leading approaches to addressing these issues were centralized, with a single authority maintaining some kind of bulletin board. This approach requires significantly more infrastructure and places excessive trust in the entity responsible for managing it properly. In this paper, we propose a decentralized approach to addressing gaps in DNSSEC's chain of trust, commonly referred to as "Islands of Security". We leverage TLS and IP-based certificates to enable end-to-end authentication between hierarchical levels, eliminating the need for uniform DNSSEC deployment across every level of the DNS hierarchy. This approach enhances the overall integrity of DNSSEC, while reducing dependence on registrars for maintaining signature records to verify the child nameserver's authenticity. By offering a more flexible and efficient solution, our method strengthens DNS security and streamlines deployment across diverse environments.
△ Less
Submitted 8 December, 2025; v1 submitted 10 September, 2025;
originally announced September 2025.
-
GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models
Authors:
Tuo Wang,
Adithya Kulkarni,
Tyler Cody,
Peter A. Beling,
Yujun Yan,
Dawei Zhou
Abstract:
Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Languag…
▽ More
Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at https://github.com/ODYSSEYWT/GUQ.
△ Less
Submitted 9 September, 2025;
originally announced September 2025.
-
Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models
Authors:
Sandipana Dowerah,
Atharva Kulkarni,
Ajinkya Kulkarni,
Hoan My Tran,
Joonas Kalda,
Artem Fedorchenko,
Benoit Fauve,
Damien Lolive,
Tanel Alumäe,
Matthew Magimai Doss
Abstract:
Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, curr…
▽ More
Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.
△ Less
Submitted 2 September, 2025;
originally announced September 2025.
-
Enhancing Semantic Document Retrieval- Employing Group Steiner Tree Algorithm with Domain Knowledge Enrichment
Authors:
Apurva Kulkarni,
Chandrashekar Ramanathan,
Vinu E Venugopal
Abstract:
Retrieving pertinent documents from various data sources with diverse characteristics poses a significant challenge for Document Retrieval Systems. The complexity of this challenge is further compounded when accounting for the semantic relationship between data and domain knowledge. While existing retrieval systems using semantics (usually represented as Knowledge Graphs created from open-access r…
▽ More
Retrieving pertinent documents from various data sources with diverse characteristics poses a significant challenge for Document Retrieval Systems. The complexity of this challenge is further compounded when accounting for the semantic relationship between data and domain knowledge. While existing retrieval systems using semantics (usually represented as Knowledge Graphs created from open-access resources and generic domain knowledge) hold promise in delivering relevant outcomes, their precision may be compromised due to the absence of domain-specific information and reliance on outdated knowledge sources. In this research, the primary focus is on two key contributions- a) the development of a versatile algorithm- 'Semantic-based Concept Retrieval using Group Steiner Tree' that incorporates domain information to enhance semantic-aware knowledge representation and data access, and b) the practical implementation of the proposed algorithm within a document retrieval system using real-world data. To assess the effectiveness of the SemDR system, research work conducts performance evaluations using a benchmark consisting of 170 real-world search queries. Rigorous evaluation and verification by domain experts are conducted to ensure the validity and accuracy of the results. The experimental findings demonstrate substantial advancements when compared to the baseline systems, with precision and accuracy achieving levels of 90% and 82% respectively, signifying promising improvements.
△ Less
Submitted 28 August, 2025;
originally announced August 2025.
-
GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization
Authors:
Luyi Ma,
Wanjia Zhang,
Kai Zhao,
Abhishek Kulkarni,
Lalitesh Morishetti,
Anjana Ganesh,
Ashish Ranjan,
Aashika Padmanabhan,
Jianpeng Xu,
Jason Cho,
Praveen Kanumala,
Kaushiki Nag,
Sumit Dutta,
Kamiya Motwani,
Malay Patel,
Evren Korpeoglu,
Sushant Kumar,
Kannan Achan
Abstract:
Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence represe…
▽ More
Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.
△ Less
Submitted 19 July, 2025;
originally announced July 2025.
-
Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review
Authors:
Maha Tufail Agro,
Atharva Kulkarni,
Karima Kadaoui,
Zeerak Talat,
Hanan Aldarmaki
Abstract:
Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices,…
▽ More
Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices, and performance, and present a discussion of challenges in end-to-end ASR for code-switching. Our analysis thus provides insights on current research efforts and available resources as well as opportunities and gaps to guide future research.
△ Less
Submitted 10 July, 2025;
originally announced July 2025.
-
DISPROTBENCH: Uncovering the Functional Limits of Protein Structure Prediction Models in Intrinsically Disordered Regions
Authors:
Xinyue Zeng,
Tuo Wang,
Adithya Kulkarni,
Alexander Lu,
Alexandra Ni,
Phoebe Xing,
Junhan Zhao,
Siwei Chen,
Dawei Zhou
Abstract:
Intrinsically disordered regions (IDRs) play central roles in cellular function, yet remain poorly evaluated by existing protein structure prediction benchmarks. Current evaluations largely focus on well-folded domains, overlooking three fundamental challenges in realistic biological settings: the structural complexity of proteins, the resulting low availability of reliable ground truth, and predi…
▽ More
Intrinsically disordered regions (IDRs) play central roles in cellular function, yet remain poorly evaluated by existing protein structure prediction benchmarks. Current evaluations largely focus on well-folded domains, overlooking three fundamental challenges in realistic biological settings: the structural complexity of proteins, the resulting low availability of reliable ground truth, and prediction uncertainty that can propagate into high-risk downstream failures, such as in drug discovery, protein-protein interaction modeling, and functional annotation. We present DisProtBench, an IDR-centric benchmark that explicitly incorporates prediction uncertainty into the evaluation of protein structure prediction models (PSPMs). To address structural complexity and ground-truth scarcity, we curate and unify a large-scale, multi-modal dataset spanning disease-relevant IDRs, GPCR-ligand interactions, and multimeric protein complexes. To assess predictive uncertainty, we introduce Functional Uncertainty Sensitivity (FUS), a novel prediction uncertainty-stratified metric that quantifies downstream task performance under prediction uncertainty. Using this benchmark, we conduct a systematic evaluation of state-of-the-art PSPMs and reveal clear, task-dependent failure modes. Protein-protein interaction prediction degrades sharply in IDRs, while structure-based drug discovery remains comparatively robust. These effects are largely invisible to standard global accuracy metrics, which overestimate functional reliability under prediction uncertainty. We have open-sourced our benchmark and the codebase at https://github.com/Susan571/DisProtBench.
△ Less
Submitted 10 February, 2026; v1 submitted 18 June, 2025;
originally announced July 2025.
-
Non-exchangeable Conformal Prediction for Temporal Graph Neural Networks
Authors:
Tuo Wang,
Jian Kang,
Yujun Yan,
Adithya Kulkarni,
Dawei Zhou
Abstract:
Conformal prediction for graph neural networks (GNNs) offers a promising framework for quantifying uncertainty, enhancing GNN reliability in high-stakes applications. However, existing methods predominantly focus on static graphs, neglecting the evolving nature of real-world graphs. Temporal dependencies in graph structure, node attributes, and ground truth labels violate the fundamental exchangea…
▽ More
Conformal prediction for graph neural networks (GNNs) offers a promising framework for quantifying uncertainty, enhancing GNN reliability in high-stakes applications. However, existing methods predominantly focus on static graphs, neglecting the evolving nature of real-world graphs. Temporal dependencies in graph structure, node attributes, and ground truth labels violate the fundamental exchangeability assumption of standard conformal prediction methods, limiting their applicability. To address these challenges, in this paper, we introduce NCPNET, a novel end-to-end conformal prediction framework tailored for temporal graphs. Our approach extends conformal prediction to dynamic settings, mitigating statistical coverage violations induced by temporal dependencies. To achieve this, we propose a diffusion-based non-conformity score that captures both topological and temporal uncertainties within evolving networks. Additionally, we develop an efficiency-aware optimization algorithm that improves the conformal prediction process, enhancing computational efficiency and reducing coverage violations. Extensive experiments on diverse real-world temporal graphs, including WIKI, REDDIT, DBLP, and IBM Anti-Money Laundering dataset, demonstrate NCPNET's capability to ensure guaranteed coverage in temporal graphs, achieving up to a 31% reduction in prediction set size on the WIKI dataset, significantly improving efficiency compared to state-of-the-art methods. Our data and code are available at https://github.com/ODYSSEYWT/NCPNET.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios
Authors:
Mohna Chakraborty,
Adithya Kulkarni,
Qi Li
Abstract:
Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabiliti…
▽ More
Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Authors:
Tuomas Oikarinen,
Ge Yan,
Akshay Kulkarni,
Tsui-Wei Weng
Abstract:
Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelin…
▽ More
Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by ~13x. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by ~3x. Together, these techniques reduce the evaluation cost by ~40x, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.
△ Less
Submitted 2 December, 2025; v1 submitted 9 June, 2025;
originally announced June 2025.
-
Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning
Authors:
Atharv Kulkarni,
Vivek Srikumar
Abstract:
In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feed…
▽ More
In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feedback from the environment in the form of scalar rewards. These rewards penalize execution failures and assign positive values when a query returns a correct answer. We use the rewards within the Group Relative Policy Optimization (GRPO) framework. We use a tabular reasoning benchmark to test and evaluate our findings. We find that with only weak supervision in the form of question-answer pairs, RL-tuning improves the accuracy of model generated SQL code from 31.49 to 49.83 while reducing error percentage from 25.43% to 14.71%. This improvement allowed the model nearly match the performance performance to the larger SQLCoder-70B model. Our work demonstrates the potential of using execution-based feedback to improve symbolic reasoning capabilities of LLMs.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning
Authors:
Atharv Kulkarni,
Kushagra Dixit,
Vivek Srikumar,
Dan Roth,
Vivek Gupta
Abstract:
Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C…
▽ More
Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion
Authors:
Ajinkya Kulkarni,
Sandipana Dowerah,
Tanel Alumae,
Mathew Magimai. -Doss
Abstract:
Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ens…
▽ More
Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score-embedding fusion. The N-pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score-embedding fusion shows an optimal trade-off between in-domain and out-of-domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Chandomitra: Towards Generating Structured Sanskrit Poetry from Natural Language Inputs
Authors:
Manoj Balaji Jagadeeshan,
Samarth Bhatia,
Pretam Ray,
Harshul Raj Surana,
Akhil Rajeev P,
Priya Mishra,
Annarao Kulkarni,
Ganesh Ramakrishnan,
Prathosh AP,
Pawan Goyal
Abstract:
Text Generation has achieved remarkable performance using large language models. It has also been recently well-studied that these large language models are capable of creative generation tasks but prominently for high-resource languages. This prompts a fundamental question: Is there a way to utilize these (large) language models for structured poetry generation in a low-resource language, such as…
▽ More
Text Generation has achieved remarkable performance using large language models. It has also been recently well-studied that these large language models are capable of creative generation tasks but prominently for high-resource languages. This prompts a fundamental question: Is there a way to utilize these (large) language models for structured poetry generation in a low-resource language, such as Sanskrit? We present Chandomitra, an English input to structured Sanskrit Poetry translation dataset, specifically adhering to the Anushtubh meter. We benchmark various open and closed models, and scrutinize specialized techniques such as constrained decoding and instruction fine-tuning, for the proposed task. Our constrained decoding methodology achieves 99.86% syntactic accuracy in generating metrically valid Sanskrit poetry, outperforming GPT-4o (1-shot: 31.24%). Our best-performing instruction-tuned model, on the other hand, performs better in semantic coherence with the English input, at the expense of slightly lower syntactic accuracy. Human evaluation further reveals that instruction fine-tuned model is better able to capture the poetic aspects. Data and Code are available.
△ Less
Submitted 16 January, 2026; v1 submitted 31 May, 2025;
originally announced June 2025.