arXiv:2604.12374 [pdf, ps, other]

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Authors: NVIDIA, :, Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye, Abhibha Gupta, Abhilash Somasamudramath, Abhinav Khattar, Adeola Adesoba, Adi Renduchintala, Adil Asif, Aditya Agrawal, Aditya Vavre, Ahmad Kiswani, Aishwarya Padmakumar, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Gronskiy, Alex Kondratenko, Alex Neefus, Alex Steiner, Alex Yang , et al. (522 additional authors not shown)

Abstract: We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, a… ▽ More We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace. △ Less

Submitted 14 April, 2026; originally announced April 2026.

arXiv:2604.06170 [pdf, ps, other]

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Authors: Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal

Abstract: The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research disc… ▽ More The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle. △ Less

Submitted 7 April, 2026; originally announced April 2026.

Comments: 19 pages, 7 figures, 8 tables, ACL main (Oral)

arXiv:2604.03449 [pdf, ps, other]

Neural Operators for Multi-Task Control and Adaptation

Authors: David Sewell, Xingjian Li, Stepan Tretiakov, Krishna Kumar, David Fridovich-Keil

Abstract: Neural operator methods have emerged as powerful tools for learning mappings between infinite-dimensional function spaces, yet their potential in optimal control remains largely unexplored. We focus on multi-task control problems, whose solution is a mapping from task description (e.g., cost or dynamics functions) to optimal control law (e.g., feedback policy). We approximate these solution operat… ▽ More Neural operator methods have emerged as powerful tools for learning mappings between infinite-dimensional function spaces, yet their potential in optimal control remains largely unexplored. We focus on multi-task control problems, whose solution is a mapping from task description (e.g., cost or dynamics functions) to optimal control law (e.g., feedback policy). We approximate these solution operators using a permutation-invariant neural operator architecture. Across a range of parametric optimal control environments and a locomotion benchmark, a single operator trained via behavioral cloning accurately approximates the solution operator and generalizes to unseen tasks, out-of-distribution settings, and varying amounts of task observations. We further show that the branch-trunk structure of our neural operator architecture enables efficient and flexible adaptation to new tasks. We develop structured adaptation strategies ranging from lightweight updates to full-network fine-tuning, achieving strong performance across different data and compute settings. Finally, we introduce meta-trained operator variants that optimize the initialization for few-shot adaptation. These methods enable rapid task adaptation with limited data and consistently outperform a popular meta-learning baseline. Together, our results demonstrate that neural operators provide a unified and efficient framework for multi-task control and adaptation. △ Less

Submitted 3 April, 2026; originally announced April 2026.

Comments: 25 pages, 10 figures, 2 tables

arXiv:2604.03231 [pdf, ps, other]

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Authors: Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan

Abstract: Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we i… ▽ More Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance. △ Less

Submitted 3 April, 2026; originally announced April 2026.

Comments: 16 pages, 10 figures, 5 tables

arXiv:2604.02276 [pdf, ps, other]

De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules

Authors: Keerat Guliani, Deepkamal Gill, David Landsman, Nima Eshraghi, Krishna Kumar, Lovedeep Gondara

Abstract: Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific pr… ▽ More Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment. △ Less

Submitted 2 April, 2026; originally announced April 2026.

arXiv:2603.17175 [pdf, ps, other]

Domain-informed explainable boosting machines for trustworthy lateral spread predictions

Authors: Cheng-Hsi Hsiao, Krishna Kumar, Ellen M. Rathje

Abstract: Explainable Boosting Machines (EBMs) provide transparent predictions through additive shape functions, enabling direct inspection of feature contributions. However, EBMs can learn non-physical relationships that reduce their reliability in natural hazard applications. This study presents a domain-informed framework to improve the physical consistency of EBMs for lateral spreading prediction. Our a… ▽ More Explainable Boosting Machines (EBMs) provide transparent predictions through additive shape functions, enabling direct inspection of feature contributions. However, EBMs can learn non-physical relationships that reduce their reliability in natural hazard applications. This study presents a domain-informed framework to improve the physical consistency of EBMs for lateral spreading prediction. Our approach modifies learned shape functions based on domain knowledge. These modifications correct non-physical behavior while maintaining data-driven patterns. We apply the method to the 2011 Christchurch earthquake dataset and correct non-physical trends observed in the original EBM. The resulting model produces more physically consistent global and local explanations, with an acceptable tradeoff in accuracy (4--5\%). △ Less

Submitted 17 March, 2026; originally announced March 2026.

Comments: 33 pages, 16 figures

arXiv:2603.16983 [pdf, ps, other]

Formal verification of tree-based machine learning models for lateral spreading

Authors: Krishna Kumar

Abstract: Machine learning models for geotechnical hazard prediction can achieve high accuracy while learning physically inconsistent relationships from sparse or biased training data. Current remedies (post-hoc explainability, such as SHAP and LIME, and training-time constraints) either diagnose individual predictions approximately or restrict model capacity without providing exhaustive guarantees. This pa… ▽ More Machine learning models for geotechnical hazard prediction can achieve high accuracy while learning physically inconsistent relationships from sparse or biased training data. Current remedies (post-hoc explainability, such as SHAP and LIME, and training-time constraints) either diagnose individual predictions approximately or restrict model capacity without providing exhaustive guarantees. This paper encodes trained tree ensembles as logical formulas in a Satisfiability Modulo Theories (SMT) solver and checks physical specifications across the entire input domain, not just sampled points. Four geotechnical specifications (water table depth, PGA monotonicity, distance safety, and flat-ground safety) are formalized as decidable logical formulas and verified via SMT against both XGBoost ensembles and Explainable Boosting Machines (EBMs) trained on the 2011 Christchurch earthquake lateral spreading dataset (7,291 sites, four features). The SMT solver either produces a concrete counterexample where a specification fails or proves that no violation exists. The unconstrained EBM (80.1% accuracy) violates all four specifications. A fully constrained EBM (67.2%) satisfies three of four specifications, demonstrating that iterative constraint application guided by verification can progressively improve physical consistency. A Pareto analysis of 33 model variants reveals a persistent trade-off, as none of the variants studied achieve both greater than 80% accuracy and full compliance with the specified set. SHAP analysis of specification counterexamples shows that the offending feature can rank last, demonstrating that post-hoc explanations do not substitute for formal verification. These results establish a verify-fix-verify engineering loop and a formal certification for deploying physically consistent ML models in safety-critical geotechnical applications. △ Less

Submitted 17 March, 2026; originally announced March 2026.

ACM Class: I.2.6; D.2.4; J.2

arXiv:2603.16839 [pdf, ps, other]

Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Authors: Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

Abstract: Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combin… ▽ More Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm △ Less

Submitted 17 March, 2026; originally announced March 2026.

Comments: 12 pages, 11 figures, 13 tables, 26 references. Code: https://github.com/pushing-the-frontier/slide-forge-llm Dataset: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts

arXiv:2603.08468 [pdf, ps, other]

Integrating Lagrangian Neural Networks into the Dyna Framework for Reinforcement Learning

Authors: Shreya Das, Kundan Kumar, Muhammad Iqbal, Outi Savolainen, Dominik Baumann, Laura Ruotsalainen, Simo Särkkä

Abstract: Model-based reinforcement learning (MBRL) is sample-efficient but depends on the accuracy of the learned dynamics, which are often modeled using black-box methods that do not adhere to physical laws. Those methods tend to produce inaccurate predictions when presented with data that differ from the original training set. In this work, we employ Lagrangian neural networks (LNNs), which enforce an un… ▽ More Model-based reinforcement learning (MBRL) is sample-efficient but depends on the accuracy of the learned dynamics, which are often modeled using black-box methods that do not adhere to physical laws. Those methods tend to produce inaccurate predictions when presented with data that differ from the original training set. In this work, we employ Lagrangian neural networks (LNNs), which enforce an underlying Lagrangian structure to train the model within a Dyna-based MBRL framework. Furthermore, we train the LNN using stochastic gradient-based and state-estimation-based optimizers to learn the network's weights. The state-estimation-based method converges faster than the stochastic gradient-based method during neural network training. Simulation results are provided to illustrate the effectiveness of the proposed LNN-based Dyna framework for MBRL. △ Less

Submitted 9 March, 2026; originally announced March 2026.

Comments: 5 pages, 3 figures

arXiv:2602.19446 [pdf, ps, other]

doi 10.1145/3744916.3787766

"Write in English, Nobody Understands Your Language Here": A Study of Non-English Trends in Open-Source Repositories

Authors: Masudul Hasan Masud Bhuiyan, Manish Kumar Bala Kumar, Cristian-Alexandru Staicu

Abstract: The open-source software (OSS) community has historically been dominated by English as the primary language for code, documentation, and developer interactions. However, with growing global participation and better support for non-Latin scripts through standards like Unicode, OSS is gradually becoming more multilingual. This study investigates the extent to which OSS is becoming more multilingual,… ▽ More The open-source software (OSS) community has historically been dominated by English as the primary language for code, documentation, and developer interactions. However, with growing global participation and better support for non-Latin scripts through standards like Unicode, OSS is gradually becoming more multilingual. This study investigates the extent to which OSS is becoming more multilingual, analyzing 9.14 billion GitHub issues, pull requests, and discussions, and 62,500 repositories across five programming languages and 30 natural languages, covering the period from 2015 to 2025. We examine six research questions to track changes in language use across communication, code, and documentation. We find that multilingual participation has steadily increased, especially in Korean, Chinese, and Russian. This growth appears not only in issues and discussions but also in code comments, string literals, and documentation files. While this shift reflects greater inclusivity and language diversity in OSS, it also creates language tension. The ability to express oneself in a native language can clash with shared norms around English use, especially in collaborative settings. Non-English or multilingual projects tend to receive less visibility and participation, suggesting that language remains both a resource and a barrier, shaping who gets heard, who contributes, and how open collaboration unfolds. △ Less

Submitted 22 February, 2026; originally announced February 2026.

arXiv:2602.16935 [pdf, ps, other]

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

Authors: Justin Albrethsen, Yash Datta, Kunal Kumar, Sharath Rajasekar

Abstract: While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates a "Safety Gap" where adversarial tactics, like Crescendo and ActorAttack, slowly bleed malicious intent across turn boundaries to bypass stateless filters. We introduce DeepContext, a sta… ▽ More While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates a "Safety Gap" where adversarial tactics, like Crescendo and ActorAttack, slowly bleed malicious intent across turn boundaries to bypass stateless filters. We introduce DeepContext, a stateful monitoring framework designed to map the temporal trajectory of user intent. DeepContext discards the isolated evaluation model in favor of a Recurrent Neural Network (RNN) architecture that ingests a sequence of fine-tuned turn-level embeddings. By propagating a hidden state across the conversation, DeepContext captures the incremental accumulation of risk that stateless models overlook. Our evaluation demonstrates that DeepContext significantly outperforms existing baselines in multi-turn jailbreak detection, achieving a state-of-the-art F1 score of 0.84, which represents a substantial improvement over both hyperscaler cloud-provider guardrails and leading open-weight models such as Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67). Furthermore, DeepContext maintains a sub-20ms inference overhead on a T4 GPU, ensuring viability for real-time applications. These results suggest that modeling the sequential evolution of intent is a more effective and computationally efficient alternative to deploying massive, stateless models. △ Less

Submitted 18 February, 2026; originally announced February 2026.

Comments: 18 Pages, 7 Tables, 1 Figure

ACM Class: F.2.2; I.2.7

arXiv:2602.15866 [pdf, ps, other]

NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

Authors: Dhiman Goswami, Jai Kruthunz Naveen Kumar, Sanchari Das

Abstract: Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Medi… ▽ More Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts. △ Less

Submitted 26 January, 2026; originally announced February 2026.

Journal ref: In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL) 2026

arXiv:2602.11232 [pdf, ps, other]

Yaksha-Prashna: Understanding eBPF Bytecode Network Function Behavior

Authors: Animesh Singh, K Shiv Kumar, S. VenkataKeerthy, Pragna Mamidipaka, R V B R N Aaseesh, Sayandeep Sen, Palanivel Kodeswaran, Theophilus A. Benson, Ramakrishna Upadrasta, Praveen Tammana

Abstract: Many cloud infrastructure organizations increasingly rely on third-party eBPF-based network functions for use cases like security, observability, and load balancing, so that not everyone requires a team of highly skilled eBPF experts. However, the network functions from third parties (e.g., F5, Palo Alto) are available in bytecode format to cloud operators, giving little or no understanding of the… ▽ More Many cloud infrastructure organizations increasingly rely on third-party eBPF-based network functions for use cases like security, observability, and load balancing, so that not everyone requires a team of highly skilled eBPF experts. However, the network functions from third parties (e.g., F5, Palo Alto) are available in bytecode format to cloud operators, giving little or no understanding of their functional correctness and interaction with other network functions in a chain. Also, eBPF developers want to provide proof of functional correctness for their developed network functions without disclosing the source code to the operators. We design Yaksha-Prashna, a system that allows operators/developers to assert and query bytecode's conformance to its specification and dependencies on other bytecodes. Our work builds domain-specific models that enable us to employ scalable program analysis to extract and model eBPF programs. Using Yaksha-Prashna language, we express 24 properties on standard and non-standard eBPF-based network functions with 200-1000x speedup over the state-of-the-art work. △ Less

Submitted 11 February, 2026; originally announced February 2026.

arXiv:2602.06965 [pdf, ps, other]

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Authors: Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, Imran Razzak

Abstract: Multimodal large language models have advanced rapidly, but their adoption in medicine is constrained by limited domain coverage, imperfect modality alignment, and insufficient grounded reasoning. We introduce MedMO, a medical multimodal foundation model built on a general MLLM architecture and trained exclusively on large-scale domain-specific data. MedMO uses a multi-stage training recipe that i… ▽ More Multimodal large language models have advanced rapidly, but their adoption in medicine is constrained by limited domain coverage, imperfect modality alignment, and insufficient grounded reasoning. We introduce MedMO, a medical multimodal foundation model built on a general MLLM architecture and trained exclusively on large-scale domain-specific data. MedMO uses a multi-stage training recipe that includes cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone, instruction tuning with multi-task supervision spanning captioning, VQA, report generation, retrieval, and bounding-box disease localization, and reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU signal to improve spatial grounding and step-by-step reasoning in challenging clinical settings. Across modalities and tasks, MedMO surpasses strong open-source medical baselines. MedMO-8B-Next achieves consistent gains on VQA benchmarks, improving by 6.6% on average over Fleming-VL-8B, including gains of 6.0% on MMMU-Med, 9.8% on PMC-VQA, and 21.3% on MedXpertQA. On text-based QA, it improves by 14.4% over Fleming-VL-8B, driven by gains of 8.4% on MMLU-Med and 30.1% on MedQA. For medical report generation, it improves by 6.7% on MIMIC-CXR. MedMO-8B-Next also demonstrates strong grounding performance, reaching 56.1 IoU on Bacteria, which is a 47.8 IoU gain over Fleming-VL-8B. At smaller scale, MedMO-4B-Next remains competitive and exceeds Fleming-VL-8B across VQA, QA, and report generation. Evaluations spanning radiology, ophthalmology, and pathology microscopy further confirm broad cross-modality generalization. Project is available at https://genmilab.github.io/MedMO-Page △ Less

Submitted 11 March, 2026; v1 submitted 6 February, 2026; originally announced February 2026.

Comments: 21 pages, 6 figures and 4 tables

arXiv:2601.15118 [pdf, ps, other]

WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

Authors: Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid

Abstract: Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We pres… ▽ More Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification. △ Less

Submitted 22 January, 2026; v1 submitted 21 January, 2026; originally announced January 2026.

Comments: Accepted at ICASSP 2026

arXiv:2601.11801 [pdf, ps, other]

RobotDesignGPT: Automated Robot Design Synthesis using Vision Language Models

Authors: Nitish Sontakke, K. Niranjan Kumar, Sehoon Ha

Abstract: Robot design is a nontrivial process that involves careful consideration of multiple criteria, including user specifications, kinematic structures, and visual appearance. Therefore, the design process often relies heavily on domain expertise and significant human effort. The majority of current methods are rule-based, requiring the specification of a grammar or a set of primitive components and mo… ▽ More Robot design is a nontrivial process that involves careful consideration of multiple criteria, including user specifications, kinematic structures, and visual appearance. Therefore, the design process often relies heavily on domain expertise and significant human effort. The majority of current methods are rule-based, requiring the specification of a grammar or a set of primitive components and modules that can be composed to create a design. We propose a novel automated robot design framework, RobotDesignGPT, that leverages the general knowledge and reasoning capabilities of large pre-trained vision-language models to automate the robot design synthesis process. Our framework synthesizes an initial robot design from a simple user prompt and a reference image. Our novel visual feedback approach allows us to greatly improve the design quality and reduce unnecessary manual feedback. We demonstrate that our framework can design visually appealing and kinematically valid robots inspired by nature, ranging from legged animals to flying creatures. We justify the proposed framework by conducting an ablation study and a user study. △ Less

Submitted 16 January, 2026; originally announced January 2026.

arXiv:2512.24365 [pdf, ps, other]

Deep Learning in Geotechnical Engineering: A Critical Assessment of PINNs and Operator Learning

Authors: Krishna Kumar

Abstract: Deep learning methods -- physics-informed neural networks (PINNs), deep operator networks (DeepONet), and graph network simulators (GNS) -- are increasingly proposed for geotechnical problems. This paper tests these methods against traditional solvers on canonical problems: wave propagation and beam-foundation interaction. PINNs run 90,000 times slower than finite difference with larger errors. De… ▽ More Deep learning methods -- physics-informed neural networks (PINNs), deep operator networks (DeepONet), and graph network simulators (GNS) -- are increasingly proposed for geotechnical problems. This paper tests these methods against traditional solvers on canonical problems: wave propagation and beam-foundation interaction. PINNs run 90,000 times slower than finite difference with larger errors. DeepONet requires thousands of training simulations and breaks even only after millions of evaluations. Multi-layer perceptrons fail catastrophically when extrapolating beyond training data -- the common case in geotechnical prediction. GNS shows promise for geometry-agnostic simulation but faces scaling limits and cannot capture path-dependent soil behavior. For inverse problems, automatic differentiation through traditional solvers recovers material parameters with sub-percent accuracy in seconds. We recommend: use automatic differentiation for inverse problems; apply site-based cross-validation to account for spatial autocorrelation; reserve neural networks for problems where traditional solvers are genuinely expensive and predictions remain within the training envelope. When a method is four orders of magnitude slower with less accuracy, it is not a viable replacement for proven solvers. △ Less

Submitted 30 December, 2025; originally announced December 2025.

arXiv:2512.18120 [pdf, ps, other]

Learning Generalizable Neural Operators for Inverse Problems

Authors: Adam J. Thorpe, Stepan Tretiakov, Dibakar Roy Sarkar, Krishna Kumar, Ufuk Topcu

Abstract: Inverse problems challenge existing neural operator architectures because ill-posed inverse maps violate continuity, uniqueness, and stability assumptions. We introduce B2B${}^{-1}$, an inverse basis-to-basis neural operator framework that addresses this limitation. Our key innovation is to decouple function representation from the inverse map. We learn neural basis functions for the input and out… ▽ More Inverse problems challenge existing neural operator architectures because ill-posed inverse maps violate continuity, uniqueness, and stability assumptions. We introduce B2B${}^{-1}$, an inverse basis-to-basis neural operator framework that addresses this limitation. Our key innovation is to decouple function representation from the inverse map. We learn neural basis functions for the input and output spaces, then train inverse models that operate on the resulting coefficient space. This structure allows us to learn deterministic, invertible, and probabilistic models within a single framework, and to choose models based on the degree of ill-posedness. We evaluate our approach on six inverse PDE benchmarks, including two novel datasets, and compare against existing invertible neural operator baselines. We learn probabilistic models that capture uncertainty and input variability, and remain robust to measurement noise due to implicit denoising in the coefficient calculation. Our results show consistent re-simulation performance across varying levels of ill-posedness. By separating representation from inversion, our framework enables scalable surrogate models for inverse problems that generalize across instances, domains, and degrees of ill-posedness. △ Less

Submitted 19 December, 2025; originally announced December 2025.

arXiv:2512.15945 [pdf, ps, other]

Privacy Discourse and Emotional Dynamics in Mental Health Information Interaction on Reddit

Authors: Jai Kruthunz Naveen Kumar, Aishwarya Umeshkumar Surani, Harkirat Singh, Sanchari Das

Abstract: Reddit is a major venue for mental-health information interaction and peer support, where privacy concerns increasingly surface in user discourse. Thus, we analyze privacy-related discussions across 14 mental-health and regulatory subreddits, comprising 10,119 posts and 65,385 comments collected with a custom web scraper. Using lexicon-based sentiment analysis, we quantify emotional alignment betw… ▽ More Reddit is a major venue for mental-health information interaction and peer support, where privacy concerns increasingly surface in user discourse. Thus, we analyze privacy-related discussions across 14 mental-health and regulatory subreddits, comprising 10,119 posts and 65,385 comments collected with a custom web scraper. Using lexicon-based sentiment analysis, we quantify emotional alignment between communities via cosine similarity of sentiment distributions, observing high similarity for Bipolar and ADHD (0.877), Anxiety and Depression (0.849), and MentalHealthSupport and MentalIllness (0.989) subreddits. We also construct keyword dictionaries to tag privacy-related themes (e.g., HIPAA, GDPR) and perform temporal analysis from 2020 to 2025, finding a 50% increase in privacy discourse with intermittent regulatory spikes. A chi-square test of independence across subreddit domains indicates significant distributional differences. The results characterize how privacy-oriented discussion co-varies with user sentiment in online mental-health communities. △ Less

Submitted 17 December, 2025; originally announced December 2025.

arXiv:2512.10078 [pdf, ps, other]

Empirical Hardness in Multi-Agent Pathfinding: Research Challenges and Opportunities

Authors: Jingyao Ren, Eric Ewing, T. K. Satish Kumar, Sven Koenig, Nora Ayanian

Abstract: Multi-agent pathfinding (MAPF) is the problem of finding collision-free paths for a team of agents on a map. Although MAPF is NP-hard, the hardness of solving individual instances varies significantly, revealing a gap between theoretical complexity and actual hardness. This paper outlines three key research challenges in MAPF empirical hardness to understand such phenomena. The first challenge, kn… ▽ More Multi-agent pathfinding (MAPF) is the problem of finding collision-free paths for a team of agents on a map. Although MAPF is NP-hard, the hardness of solving individual instances varies significantly, revealing a gap between theoretical complexity and actual hardness. This paper outlines three key research challenges in MAPF empirical hardness to understand such phenomena. The first challenge, known as algorithm selection, is determining the best-performing algorithms for a given instance. The second challenge is understanding the key instance features that affect MAPF empirical hardness, such as structural properties like phase transition and backbone/backdoor. The third challenge is how to leverage our knowledge of MAPF empirical hardness to effectively generate hard MAPF instances or diverse benchmark datasets. This work establishes a foundation for future empirical hardness research and encourages deeper investigation into these promising and underexplored areas. △ Less

Submitted 10 December, 2025; originally announced December 2025.

Comments: Published in AAMAS-25

Journal ref: Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, 2025, Pages 2885-2889

arXiv:2512.08545 [pdf]

Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks

Authors: Indrajit Kar, Kalathur Chenchu Kishore Kumar

Abstract: Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operation… ▽ More Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation. △ Less

Submitted 9 December, 2025; originally announced December 2025.

Comments: 22 pages, 2 tables, 9 figures

arXiv:2512.01059 [pdf, ps, other]

Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction

Authors: Anantha Padmanaban Krishna Kumar

Abstract: Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \e… ▽ More Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47\% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25\% top-1 accuracy with a 38\% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05\%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47\% to the range 0.03\% to 0.06\%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps. △ Less

Submitted 30 November, 2025; originally announced December 2025.

Comments: 7 pages total (6 pages main text, 1 page references), 1 figures, 2 tables. Code available at https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps

arXiv:2511.21635 [pdf, ps, other]

Mechanisms of Non-Monotonic Scaling in Vision Transformers

Authors: Anantha Padmanaban Krishna Kumar

Abstract: Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [… ▽ More Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb. △ Less

Submitted 26 November, 2025; originally announced November 2025.

Comments: 16 pages total (11 pages main text, 1 pages references, 4 pages appendix), 5 figures, 11 tables. Code available at https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb

arXiv:2511.21038 [pdf, ps, other]

Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Authors: Anantha Padmanaban Krishna Kumar

Abstract: Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three a… ▽ More Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl. △ Less

Submitted 25 November, 2025; originally announced November 2025.

Comments: 13 pages total (7 pages main text, 3 pages references, 3 pages appendix), 2 figures, 14 tables. Code available at https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl

arXiv:2511.06143 [pdf, ps, other]

Enhancing Robustness of Graph Neural Networks through p-Laplacian

Authors: Anuj Kumar Sirohi, Subhanu Halder, Kabir Kumar, Sandeep Kumar

Abstract: With the increase of data in day-to-day life, businesses and different stakeholders need to analyze the data for better predictions. Traditionally, relational data has been a source of various insights, but with the increase in computational power and the need to understand deeper relationships between entities, the need to design new techniques has arisen. For this graph data analysis has become… ▽ More With the increase of data in day-to-day life, businesses and different stakeholders need to analyze the data for better predictions. Traditionally, relational data has been a source of various insights, but with the increase in computational power and the need to understand deeper relationships between entities, the need to design new techniques has arisen. For this graph data analysis has become an extraordinary tool for understanding the data, which reveals more realistic and flexible modelling of complex relationships. Recently, Graph Neural Networks (GNNs) have shown great promise in various applications, such as social network analysis, recommendation systems, drug discovery, and more. However, many adversarial attacks can happen over the data, whether during training (poisoning attack) or during testing (evasion attack), which can adversely manipulate the desired outcome from the GNN model. Therefore, it is crucial to make the GNNs robust to such attacks. The existing robustness methods are computationally demanding and perform poorly when the intensity of attack increases. This paper presents a computationally efficient framework, namely, pLAPGNN, based on weighted p-Laplacian for making GNNs robust. Empirical evaluation on real datasets establishes the efficacy and efficiency of the proposed method. △ Less

Submitted 8 November, 2025; originally announced November 2025.

Comments: Accepted at 5th Workshop on Graphs and more Complex Structures For Learning and Reasoning (GCLR), The 40th AAAI Conference on Artificial Intelligence (AAAI-26)

arXiv:2511.05757 [pdf, ps, other]

Zero-Shot Function Encoder-Based Differentiable Predictive Control

Authors: Hassan Iqbal, Xingjian Li, Tyler Ingebrand, Adam Thorpe, Krishna Kumar, Ufuk Topcu, Ján Drgoňa

Abstract: We introduce a differentiable framework for zero-shot adaptive control over parametric families of nonlinear dynamical systems. Our approach integrates a function encoder-based neural ODE (FE-NODE) for modeling system dynamics with a differentiable predictive control (DPC) for offline self-supervised learning of explicit control policies. The FE-NODE captures nonlinear behaviors in state transitio… ▽ More We introduce a differentiable framework for zero-shot adaptive control over parametric families of nonlinear dynamical systems. Our approach integrates a function encoder-based neural ODE (FE-NODE) for modeling system dynamics with a differentiable predictive control (DPC) for offline self-supervised learning of explicit control policies. The FE-NODE captures nonlinear behaviors in state transitions and enables zero-shot adaptation to new systems without retraining, while the DPC efficiently learns control policies across system parameterizations, thus eliminating costly online optimization common in classical model predictive control. We demonstrate the efficiency, accuracy, and online adaptability of the proposed method across a range of nonlinear systems with varying parametric scenarios, highlighting its potential as a general-purpose tool for fast zero-shot adaptive control. △ Less

Submitted 14 April, 2026; v1 submitted 7 November, 2025; originally announced November 2025.

arXiv:2511.05456 [pdf, ps, other]

Parameter-Efficient Conditioning for Material Generalization in Graph-Based Simulators

Authors: Naveen Raj Manoharan, Hassan Iqbal, Krishna Kumar

Abstract: Graph network-based simulators (GNS) have demonstrated strong potential for learning particle-based physics (such as fluids, deformable solids, and granular flows) while generalizing to unseen geometries due to their inherent inductive biases. However, existing models are typically trained for a single material type and fail to generalize across distinct constitutive behaviors, limiting their appl… ▽ More Graph network-based simulators (GNS) have demonstrated strong potential for learning particle-based physics (such as fluids, deformable solids, and granular flows) while generalizing to unseen geometries due to their inherent inductive biases. However, existing models are typically trained for a single material type and fail to generalize across distinct constitutive behaviors, limiting their applicability in real-world engineering settings. Using granular flows as a running example, we propose a parameter-efficient conditioning mechanism that makes the GNS model adaptive to material parameters. We identify that sensitivity to material properties is concentrated in the early message-passing (MP) layers, a finding we link to the local nature of constitutive models (e.g., Mohr-Coulomb) and their effects on information propagation. We empirically validate this by showing that fine-tuning only the first few (1-5) of 10 MP layers of a pretrained model achieves comparable test performance as compared to fine-tuning the entire network. Building on this insight, we propose a parameter-efficient Feature-wise Linear Modulation (FiLM) conditioning mechanism designed to specifically target these early layers. This approach produces accurate long-term rollouts on unseen, interpolated, or moderately extrapolated values (e.g., up to 2.5 degrees for friction angle and 0.25 kPa for cohesion) when trained exclusively on as few as 12 short simulation trajectories from new materials, representing a 5-fold data reduction compared to a baseline multi-task learning method. Finally, we validate the model's utility by applying it to an inverse problem, successfully identifying unknown cohesion parameters from trajectory data. This approach enables the use of GNS in inverse design and closed-loop control tasks where material properties are treated as design variables. △ Less

Submitted 7 November, 2025; originally announced November 2025.

arXiv:2510.24081 [pdf, ps, other]

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Authors: Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Sadallah, Abeer Kashar, Abolade Daud, Abosede Grace Olanihun, Adamu Labaran Mohammed, Adeyemi Praise, Adhikarinayum Meerajita Sharma, Aditi Gupta, Afitab Iyigun, Afonso Simplício, Ahmed Essouaied, Aicha Chorana, Akhil Eppa, Akintunde Oladipo, Akshay Ramesh, Aleksei Dorkin, Alfred Malengo Kondoro, Alham Fikri Aji, Ali Eren Çetintaş, Allan Hanbury, Alou Dembele, Alp Niksarli , et al. (313 additional authors not shown)

Abstract: To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five co… ▽ More To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded. △ Less

Submitted 28 October, 2025; originally announced October 2025.

Comments: Preprint

arXiv:2510.09904 [pdf, ps, other]

Stability of Transformers under Layer Normalization

Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Krishna Kumar, Markos A. Katsoulakis

Abstract: Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into t… ▽ More Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs. △ Less

Submitted 10 October, 2025; originally announced October 2025.

arXiv:2509.22793 [pdf, ps, other]

DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models

Authors: Komal Kumar, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Ivan Laptev, Hisham Cholakkal

Abstract: Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and reta… ▽ More Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{https://github.com/MAXNORM8650/DEFT}{DEFTBase}. △ Less

Submitted 26 September, 2025; originally announced September 2025.

Comments: 13 Figures, 21 pages, accepted in NeurIPS 2025

arXiv:2509.21464 [pdf, ps, other]

Residual Vector Quantization For Communication-Efficient Multi-Agent Perception

Authors: Dereje Shenkut, B. V. K Vijaya Kumar

Abstract: Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compr… ▽ More Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compresses feature dimensions via a simple bottleneck network followed by multi-stage residual vector quantization (RVQ). This allows only per-pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32-bit float features to 6-30 bpp per agent with minimal accuracy loss. On DAIR-V2X real-world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw-feature CP, and at 6-12 bpp it enables ultra-low-bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi-agent collaborative perception with a step toward practical V2X deployment. △ Less

Submitted 7 February, 2026; v1 submitted 25 September, 2025; originally announced September 2025.

Comments: Accepted at ICASSP 2026. 5 pages

arXiv:2509.18404 [pdf, ps, other]

Zero-Shot Transferable Solution Method for Parametric Optimal Control Problems

Authors: Xingjian Li, Kelvin Kan, Deepanshu Verma, Krishna Kumar, Stanley Osher, Ján Drgoňa

Abstract: This paper presents a transferable solution method for optimal control problems with varying objectives using function encoder (FE) policies. Traditional optimization-based approaches must be re-solved whenever objectives change, resulting in prohibitive computational costs for applications requiring frequent evaluation and adaptation. The proposed method learns a reusable set of neural basis func… ▽ More This paper presents a transferable solution method for optimal control problems with varying objectives using function encoder (FE) policies. Traditional optimization-based approaches must be re-solved whenever objectives change, resulting in prohibitive computational costs for applications requiring frequent evaluation and adaptation. The proposed method learns a reusable set of neural basis functions that spans the control policy space, enabling efficient zero-shot adaptation to new tasks through either projection from data or direct mapping from problem specifications. The key idea is an offline-online decomposition: basis functions are learned once during offline imitation learning, while online adaptation requires only lightweight coefficient estimation. Numerical experiments across diverse dynamics, dimensions, and cost structures show our method delivers near-optimal performance with minimal overhead when generalizing across tasks, enabling semi-global feedback policies suitable for real-time deployment. △ Less

Submitted 11 March, 2026; v1 submitted 22 September, 2025; originally announced September 2025.

Comments: 11 pages, 6 figures, 3 tables

arXiv:2509.10452 [pdf, ps, other]

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

Authors: Akshat Pandey, Karun Kumar, Raphael Tang

Abstract: Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variationa… ▽ More Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios. △ Less

Submitted 12 September, 2025; originally announced September 2025.

Comments: 5 pages, 2 figures

arXiv:2509.07526 [pdf, ps, other]

Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Authors: Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

Abstract: Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio's centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best… ▽ More Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio's centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data. △ Less

Submitted 22 January, 2026; v1 submitted 9 September, 2025; originally announced September 2025.

Comments: Accepted at ASRU 2025

arXiv:2509.02923 [pdf, ps, other]

A Narrative Review of Clinical Decision Support Systems in Offloading Footwear for Diabetes-Related Foot Ulcers

Authors: Kunal Kumar, Muhammad Ashad Kabir, Luke Donnan, Sayed Ahmed

Abstract: Offloading footwear helps prevent and treat diabetic foot ulcers (DFUs) by lowering plantar pressure (PP), yet prescription decisions remain fragmented: feature selection varies, personalization is limited, and evaluation practices differ. We performed a narrative review of 45 studies (12 guidelines/protocols, 25 knowledge-based systems, 8 machine-learning applications) published to Aug 2025. We t… ▽ More Offloading footwear helps prevent and treat diabetic foot ulcers (DFUs) by lowering plantar pressure (PP), yet prescription decisions remain fragmented: feature selection varies, personalization is limited, and evaluation practices differ. We performed a narrative review of 45 studies (12 guidelines/protocols, 25 knowledge-based systems, 8 machine-learning applications) published to Aug 2025. We thematically analyzed knowledge type, decision logic, evaluation methods, and enabling technologies. Guidelines emphasize PP thresholds (<=200 kPa or >=25--30\% reduction) but rarely yield actionable, feature-level outputs. Knowledge-based systems use rule- and sensor-driven logic, integrating PP monitoring, adherence tracking, and usability testing. ML work introduces predictive, optimization, and generative models with high computational accuracy but limited explainability and clinical validation. Evaluation remains fragmented: protocols prioritize biomechanical tests; knowledge-based systems assess usability/adherence; ML studies focus on technical accuracy with weak linkage to long-term outcomes. From this synthesis we propose a five-part CDSS framework: (1) a minimum viable dataset; (2) a hybrid architecture combining rules, optimization, and explainable ML; (3) structured feature-level outputs; (4) continuous validation and evaluation; and (5) integration with clinical and telehealth workflows. This framework aims to enable scalable, patient-centered CDSSs for DFU care; prioritizing interoperable datasets, explainable models, and outcome-focused evaluation will be key to clinical adoption. △ Less

Submitted 2 September, 2025; originally announced September 2025.

Comments: 44 pages, 2 figures, and 3 tables

arXiv:2507.09005 [pdf, ps, other]

From images to properties: a NeRF-driven framework for granular material parameter inversion

Authors: Cheng-Hsi Hsiao, Krishna Kumar

Abstract: We introduce a novel framework that integrates Neural Radiance Fields (NeRF) with Material Point Method (MPM) simulation to infer granular material properties from visual observations. Our approach begins by generating synthetic experimental data, simulating an plow interacting with sand. The experiment is rendered into realistic images as the photographic observations. These observations include… ▽ More We introduce a novel framework that integrates Neural Radiance Fields (NeRF) with Material Point Method (MPM) simulation to infer granular material properties from visual observations. Our approach begins by generating synthetic experimental data, simulating an plow interacting with sand. The experiment is rendered into realistic images as the photographic observations. These observations include multi-view images of the experiment's initial state and time-sequenced images from two fixed cameras. Using NeRF, we reconstruct the 3D geometry from the initial multi-view images, leveraging its capability to synthesize novel viewpoints and capture intricate surface details. The reconstructed geometry is then used to initialize material point positions for the MPM simulation, where the friction angle remains unknown. We render images of the simulation under the same camera setup and compare them to the observed images. By employing Bayesian optimization, we minimize the image loss to estimate the best-fitting friction angle. Our results demonstrate that friction angle can be estimated with an error within 2 degrees, highlighting the effectiveness of inverse analysis through purely visual observations. This approach offers a promising solution for characterizing granular materials in real-world scenarios where direct measurement is impractical or impossible. △ Less

Submitted 11 July, 2025; originally announced July 2025.

arXiv:2507.07247 [pdf, ps, other]

Attentions Under the Microscope: A Comparative Study of Resource Utilization for Variants of Self-Attention

Authors: Zhengyu Tian, Anantha Padmanaban Krishna Kumar, Hemant Krishnakumar, Reza Rawassizadeh

Abstract: As large language models (LLMs) and visual language models (VLMs) grow in scale and application, attention mechanisms have become a central computational bottleneck due to their high memory and time complexity. While many efficient attention variants have been proposed, there remains a lack of rigorous evaluation on their actual energy usage and hardware resource demands during training. In this w… ▽ More As large language models (LLMs) and visual language models (VLMs) grow in scale and application, attention mechanisms have become a central computational bottleneck due to their high memory and time complexity. While many efficient attention variants have been proposed, there remains a lack of rigorous evaluation on their actual energy usage and hardware resource demands during training. In this work, we benchmark eight attention mechanisms in training GPT-2 architecture, measuring key metrics including training time, GPU memory usage, FLOPS, CPU usage, and power consumption. Our results reveal that attention mechanisms with optimized kernel implementations, including Flash Attention, Locality-Sensitive Hashing (LSH) Attention, and Multi-Head Latent Attention (MLA), achieve the best energy efficiency. We further show that lower GPU power alone does not guarantee reduced energy use, as training time plays an equally important role. Our study highlights the importance of energy-aware benchmarking in attention design and provides a practical insight for selecting resource-efficient mechanisms. All our codes are available at GitHub. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: 6 pages, 8 figures

arXiv:2507.04137 [pdf]

Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach

Authors: Keshav Kumar

Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that re… ▽ More Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs. △ Less

Submitted 16 October, 2025; v1 submitted 5 July, 2025; originally announced July 2025.

arXiv:2506.08885 [pdf, ps, other]

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Authors: Danush Khanna, Gurucharan Marthi Krishna Kumar, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das

Abstract: Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent ge… ▽ More Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md. △ Less

Submitted 28 September, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

arXiv:2505.13550 [pdf, ps, other]

JIR-Arena: The First Benchmark Dataset for Just-in-time Information Recommendation

Authors: Ke Yang, Kevin Ros, Shankar Kumar Senthil Kumar, ChengXiang Zhai

Abstract: Just-in-time Information Recommendation (JIR) is a service designed to deliver the most relevant information precisely when users need it, , addressing their knowledge gaps with minimal effort and boosting decision-making and efficiency in daily life. Advances in device-efficient deployment of foundation models and the growing use of intelligent wearable devices have made always-on JIR assistants… ▽ More Just-in-time Information Recommendation (JIR) is a service designed to deliver the most relevant information precisely when users need it, , addressing their knowledge gaps with minimal effort and boosting decision-making and efficiency in daily life. Advances in device-efficient deployment of foundation models and the growing use of intelligent wearable devices have made always-on JIR assistants feasible. However, there has been no systematic effort to formally define JIR tasks or establish evaluation frameworks. To bridge this gap, we present the first mathematical definition of JIR tasks and associated evaluation metrics. Additionally, we introduce JIR-Arena, a multimodal benchmark dataset featuring diverse, information-request-intensive scenarios to evaluate JIR systems across critical dimensions: i) accurately inferring user information needs, ii) delivering timely and relevant recommendations, and iii) avoiding irrelevant content that may distract users. Developing a JIR benchmark dataset poses challenges due to subjectivity in estimating user information needs and uncontrollable system variables affecting reproducibility. To address these, JIR-Arena: i) combines input from multiple humans and large AI models to approximate information need distributions; ii) assesses JIR quality through information retrieval outcomes using static knowledge base snapshots; and iii) employs a multi-turn, multi-entity validation framework to improve objectivity and generality. Furthermore, we implement a baseline JIR system capable of processing real-time information streams aligned with user inputs. Our evaluation of this baseline system on JIR-Arena indicates that while foundation model-based JIR systems simulate user needs with reasonable precision, they face challenges in recall and effective content retrieval. To support future research in this new area, we fully release our code and data. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.04738 [pdf, ps, other]

SetONet: A Set-Based Operator Network for Solving PDEs with Variable-Input Sampling

Authors: Stepan Tretiakov, Xingjian Li, Krishna Kumar

Abstract: Most neural-operator surrogates for PDEs inherit from DeepONet-style formulations the requirement that the input function be sampled at a fixed, ordered set of sensors. This assumption limits applicability to problems with variable sensor layouts, missing data, point sources, and sample-based representations of densities. We propose SetONet, which addresses this gap by recasting the operator input… ▽ More Most neural-operator surrogates for PDEs inherit from DeepONet-style formulations the requirement that the input function be sampled at a fixed, ordered set of sensors. This assumption limits applicability to problems with variable sensor layouts, missing data, point sources, and sample-based representations of densities. We propose SetONet, which addresses this gap by recasting the operator input as an unordered set of coordinate-value observations and encoding it with permutation-invariant aggregation inside a standard branch-trunk operator network while preserving the DeepONet synthesis mechanism and lightweight end-to-end training. A structured variant, SetONet-Key, aggregates sensor information through learnable query tokens and a position-only key pathway, thereby decoupling sampling geometry from sensor values. The method is assessed on four classical operator-learning benchmarks under fixed layouts, variable layouts, and evaluation-time sensor drop-off, and on four problems with inherently unstructured point-cloud inputs, including heat conduction with multiple point sources, advection-diffusion, phase-screen diffraction, and optimal transport problems. In parameter-matched studies, SetONet-Key achieves lower error than the DeepONet baseline on fixed-sensor benchmarks and remains reliable when layouts vary or sensors are dropped at evaluation. Comparisons across pooling rules show that attention-based aggregation is typically more robust than mean or sum pooling. On the point-cloud problems, SetONet operates directly on the native input representation, without rasterization or multi-stage preprocessing, and outperforms the larger VIDON baseline. △ Less

Submitted 1 April, 2026; v1 submitted 7 May, 2025; originally announced May 2025.

ACM Class: I.2; G.1.8

arXiv:2505.04572 [pdf, other]

Stow: Robotic Packing of Items into Fabric Pods

Authors: Nicolas Hudson, Josh Hooks, Rahul Warrier, Curt Salisbury, Ross Hartley, Kislay Kumar, Bhavana Chandrashekhar, Paul Birkmeyer, Bosch Tang, Matt Frost, Shantanu Thakar, Tony Piaskowy, Petter Nilsson, Josh Petersen, Neel Doshi, Alan Slatter, Ankit Bhatia, Cassie Meeker, Yuechuan Xue, Dylan Cox, Alex Kyriazis, Bai Lou, Nadeem Hasan, Asif Rana, Nikhil Chacko , et al. (12 additional authors not shown)

Abstract: This paper presents a compliant manipulation system capable of placing items onto densely packed shelves. The wide diversity of items and strict business requirements for high producing rates and low defect generation have prohibited warehouse robotics from performing this task. Our innovations in hardware, perception, decision-making, motion planning, and control have enabled this system to perfo… ▽ More This paper presents a compliant manipulation system capable of placing items onto densely packed shelves. The wide diversity of items and strict business requirements for high producing rates and low defect generation have prohibited warehouse robotics from performing this task. Our innovations in hardware, perception, decision-making, motion planning, and control have enabled this system to perform over 500,000 stows in a large e-commerce fulfillment center. The system achieves human levels of packing density and speed while prioritizing work on overhead shelves to enhance the safety of humans working alongside the robots. △ Less

Submitted 7 May, 2025; originally announced May 2025.

arXiv:2504.11397 [pdf, other]

MLPs and KANs for data-driven learning in physical problems: A performance comparison

Authors: Raghav Pant, Sikan Li, Xingjian Li, Hassan Iqbal, Krishna Kumar

Abstract: There is increasing interest in solving partial differential equations (PDEs) by casting them as machine learning problems. Recently, there has been a spike in exploring Kolmogorov-Arnold Networks (KANs) as an alternative to traditional neural networks represented by Multi-Layer Perceptrons (MLPs). While showing promise, their performance advantages in physics-based problems remain largely unexplo… ▽ More There is increasing interest in solving partial differential equations (PDEs) by casting them as machine learning problems. Recently, there has been a spike in exploring Kolmogorov-Arnold Networks (KANs) as an alternative to traditional neural networks represented by Multi-Layer Perceptrons (MLPs). While showing promise, their performance advantages in physics-based problems remain largely unexplored. Several critical questions persist: Can KANs capture complex physical dynamics and under what conditions might they outperform traditional architectures? In this work, we present a comparative study of KANs and MLPs for learning physical systems governed by PDEs. We assess their performance when applied in deep operator networks (DeepONet) and graph network-based simulators (GNS), and test them on physical problems that vary significantly in scale and complexity. Drawing inspiration from the Kolmogorov Representation Theorem, we examine the behavior of KANs and MLPs across shallow and deep network architectures. Our results reveal that although KANs do not consistently outperform MLPs when configured as deep neural networks, they demonstrate superior expressiveness in shallow network settings, significantly outpacing MLPs in accuracy over our test cases. This suggests that KANs are a promising choice, offering a balance of efficiency and accuracy in applications involving physical systems. △ Less

Submitted 15 April, 2025; originally announced April 2025.

Comments: 30 pages, 18 figures, 8 tables

arXiv:2504.08766 [pdf, other]

Towards scientific machine learning for granular material simulations -- challenges and opportunities

Authors: Marc Fransen, Andreas Fürst, Deepak Tunuguntla, Daniel N. Wilke, Benedikt Alkin, Daniel Barreto, Johannes Brandstetter, Miguel Angel Cabrera, Xinyan Fan, Mengwu Guo, Bram Kieskamp, Krishna Kumar, John Morrissey, Jonathan Nuttall, Jin Ooi, Luisa Orozco, Stefanos-Aldo Papanicolopulos, Tongming Qu, Dingena Schott, Takayuki Shuku, WaiChing Sun, Thomas Weinhart, Dongwei Ye, Hongyang Cheng

Abstract: Micro-scale mechanisms, such as inter-particle and particle-fluid interactions, govern the behaviour of granular systems. While particle-scale simulations provide detailed insights into these interactions, their computational cost is often prohibitive. Attended by researchers from both the granular materials (GM) and machine learning (ML) communities, a recent Lorentz Center Workshop on "Machine L… ▽ More Micro-scale mechanisms, such as inter-particle and particle-fluid interactions, govern the behaviour of granular systems. While particle-scale simulations provide detailed insights into these interactions, their computational cost is often prohibitive. Attended by researchers from both the granular materials (GM) and machine learning (ML) communities, a recent Lorentz Center Workshop on "Machine Learning for Discrete Granular Media" brought the ML community up to date with GM challenges. This position paper emerged from the workshop discussions. We define granular materials and identify seven key challenges that characterise their distinctive behaviour across various scales and regimes, ranging from gas-like to fluid-like and solid-like. Addressing these challenges is essential for developing robust and efficient digital twins for granular systems in various industrial applications. To showcase the potential of ML to the GM community, we present classical and emerging machine/deep learning techniques that have been, or could be, applied to granular materials. We reviewed sequence-based learning models for path-dependent constitutive behaviour, followed by encoder-decoder type models for representing high-dimensional data. We then explore graph neural networks and recent advances in neural operator learning. Lastly, we discuss model-order reduction and probabilistic learning techniques for high-dimensional parameterised systems, which are crucial for quantifying uncertainties arising from physics-based and data-driven models. We present a workflow aimed at unifying data structures and modelling pipelines and guiding readers through the selection, training, and deployment of ML surrogates for granular material simulations. Finally, we illustrate the workflow's practical use with two representative examples, focusing on granular materials in solid-like and fluid-like regimes. △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: 35 pages, 17 figures

arXiv:2503.22678 [pdf, ps, other]

Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

Authors: Mohammad Almansoori, Komal Kumar, Hisham Cholakkal

Abstract: In this work, we introduce MedAgentSim, an open-source simulated clinical environment with doctor, patient, and measurement agents designed to evaluate and enhance LLM performance in dynamic diagnostic settings. Unlike prior approaches, our framework requires doctor agents to actively engage with patients through multi-turn conversations, requesting relevant medical examinations (e.g., temperature… ▽ More In this work, we introduce MedAgentSim, an open-source simulated clinical environment with doctor, patient, and measurement agents designed to evaluate and enhance LLM performance in dynamic diagnostic settings. Unlike prior approaches, our framework requires doctor agents to actively engage with patients through multi-turn conversations, requesting relevant medical examinations (e.g., temperature, blood pressure, ECG) and imaging results (e.g., MRI, X-ray) from a measurement agent to mimic the real-world diagnostic process. Additionally, we incorporate self improvement mechanisms that allow models to iteratively refine their diagnostic strategies. We enhance LLM performance in our simulated setting by integrating multi-agent discussions, chain-of-thought reasoning, and experience-based knowledge retrieval, facilitating progressive learning as doctor agents interact with more patients. We also introduce an evaluation benchmark for assessing the LLM's ability to engage in dynamic, context-aware diagnostic interactions. While MedAgentSim is fully automated, it also supports a user-controlled mode, enabling human interaction with either the doctor or patient agent. Comprehensive evaluations in various simulated diagnostic scenarios demonstrate the effectiveness of our approach. Our code, simulation tool, and benchmark are available at \href{https://medagentsim.netlify.app/}. △ Less

Submitted 1 October, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

Comments: 14 page, 4 figures, 61 references, presented in MICCAI (Oral)

arXiv:2503.13389 [pdf]

Investigating the effect of CPT in lateral spreading prediction using Explainable AI

Authors: Cheng-Hsi Hsiao, Ellen Rathje, Krishna Kumar

Abstract: This study proposes an autoencoder approach to extract latent features from cone penetration test profiles to evaluate the potential of incorporating CPT data in an AI model. We employ autoencoders to compress 200 CPT profiles of soil behavior type index (Ic) and normalized cone resistance (qc1Ncs) into ten latent features while preserving critical information. We then utilize the extracted latent… ▽ More This study proposes an autoencoder approach to extract latent features from cone penetration test profiles to evaluate the potential of incorporating CPT data in an AI model. We employ autoencoders to compress 200 CPT profiles of soil behavior type index (Ic) and normalized cone resistance (qc1Ncs) into ten latent features while preserving critical information. We then utilize the extracted latent features with site parameters to train XGBoost models for predicting lateral spreading occurrences in the 2011 Christchurch earthquake. Models using the latent CPT features outperformed models with conventional CPT metrics or no CPT data, achieving over 83% accuracy. Explainable AI revealed the most crucial latent feature corresponding to soil behavior between 1-3 meter depths, highlighting this depth range's criticality for liquefaction evaluation. The autoencoder approach provides an automated technique for condensing CPT profiles into informative latent features for machine-learning liquefaction models. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.11698 [pdf, ps, other]

A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems for Artificial Intelligence

Authors: Yudhishthira Kundu, Manroop Kaur, Tripty Wig, Kriti Kumar, Pushpanjali Kumari, Vivek Puri, Manish Arora

Abstract: Cerebras' wafer-scale engine (WSE) technology merges multiple dies on a single wafer. It addresses the challenges of memory bandwidth, latency, and scalability, making it suitable for artificial intelligence. This work evaluates the WSE-3 architecture and compares it with leading GPU-based AI accelerators, notably Nvidia's H100 and B200. The work highlights the advantages of WSE-3 in performance p… ▽ More Cerebras' wafer-scale engine (WSE) technology merges multiple dies on a single wafer. It addresses the challenges of memory bandwidth, latency, and scalability, making it suitable for artificial intelligence. This work evaluates the WSE-3 architecture and compares it with leading GPU-based AI accelerators, notably Nvidia's H100 and B200. The work highlights the advantages of WSE-3 in performance per watt and memory scalability and provides insights into the challenges in manufacturing, thermal management, and reliability. The results suggest that wafer-scale integration can surpass conventional architectures in several metrics, though work is required to address cost-effectiveness and long-term viability. △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: 11 pages

arXiv:2502.21321 [pdf, other]

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

Authors: Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan

Abstract: Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-tr… ▽ More Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training. △ Less

Submitted 24 March, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

Comments: 32 pages, 7 figures, 3 tables, 377 references. Github Repo: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training

arXiv:2502.16095 [pdf, ps, other]

Good Representation, Better Explanation: Role of Convolutional Neural Networks in Transformer-Based Remote Sensing Image Captioning

Authors: Swadhin Das, Saarthak Gupta, Kamal Kumar, Raksha Sharma

Abstract: Remote Sensing Image Captioning (RSIC) is the process of generating meaningful descriptions from remote sensing images. Recently, it has gained significant attention, with encoder-decoder models serving as the backbone for generating meaningful captions. The encoder extracts essential visual features from the input image, transforming them into a compact representation, while the decoder utilizes… ▽ More Remote Sensing Image Captioning (RSIC) is the process of generating meaningful descriptions from remote sensing images. Recently, it has gained significant attention, with encoder-decoder models serving as the backbone for generating meaningful captions. The encoder extracts essential visual features from the input image, transforming them into a compact representation, while the decoder utilizes this representation to generate coherent textual descriptions. Recently, transformer-based models have gained significant popularity due to their ability to capture long-range dependencies and contextual information. The decoder has been well explored for text generation, whereas the encoder remains relatively unexplored. However, optimizing the encoder is crucial as it directly influences the richness of extracted features, which in turn affects the quality of generated captions. To address this gap, we systematically evaluate twelve different convolutional neural network (CNN) architectures within a transformer-based encoder framework to assess their effectiveness in RSIC. The evaluation consists of two stages: first, a numerical analysis categorizes CNNs into different clusters, based on their performance. The best performing CNNs are then subjected to human evaluation from a human-centric perspective by a human annotator. Additionally, we analyze the impact of different search strategies, namely greedy search and beam search, to ensure the best caption. The results highlight the critical role of encoder selection in improving captioning performance, demonstrating that specific CNN architectures significantly enhance the quality of generated descriptions for remote sensing images. By providing a detailed comparison of multiple encoders, this study offers valuable insights to guide advances in transformer-based image captioning models. △ Less

Submitted 3 July, 2025; v1 submitted 22 February, 2025; originally announced February 2025.

arXiv:2502.13982 [pdf, other]

Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics

Authors: Kabir Kumar

Abstract: Natural Language Processing (NLP) and Voice Recognition agents are rapidly evolving healthcare by enabling efficient, accessible, and professional patient support while automating grunt work. This report serves as my self project wherein models finetuned on medical call recordings are analysed through a two-stage system: Automatic Speech Recognition (ASR) for speech transcription and a Large Langu… ▽ More Natural Language Processing (NLP) and Voice Recognition agents are rapidly evolving healthcare by enabling efficient, accessible, and professional patient support while automating grunt work. This report serves as my self project wherein models finetuned on medical call recordings are analysed through a two-stage system: Automatic Speech Recognition (ASR) for speech transcription and a Large Language Model (LLM) for context-aware, professional responses. ASR, finetuned on phone call recordings provides generalised transcription of diverse patient speech over call, while the LLM matches transcribed text to medical diagnosis. A novel audio preprocessing strategy, is deployed to provide invariance to incoming recording/call data, laden with sufficient augmentation with noise/clipping to make the pipeline robust to the type of microphone and ambient conditions the patient might have while calling/recording. △ Less

Submitted 18 February, 2025; originally announced February 2025.

Showing 1–50 of 292 results for author: Kumar, K