In-Context Environments Induce Evaluation-Awareness in Language Models

Abstract: Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sa… ▽ More Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood. △ Less

Submitted 4 March, 2026; originally announced March 2026.

arXiv:2602.18782 [pdf, ps, other]

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Authors: Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary

Abstract: Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign repres… ▽ More Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs. △ Less

Submitted 21 February, 2026; originally announced February 2026.

arXiv:2602.15195 [pdf, ps, other]

Weight space Detection of Backdoors in LoRA Adapters

Authors: David Puertolas Merenciano, Ekaterina Vasyagina, Kevin Zhu, Javier Ferrando, Maheep Chaudhary

Abstract: LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor b… ▽ More LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method trigger-agnostic. For each attention projection (Q, K, V, O), our method extracts five spectral statistics from the low-rank update $ΔW$, yielding a 20-dimensional signature for each adapter. A logistic regression detector trained on this representation separates benign and poisoned adapters across three model families -- Llama-3.2-3B~\citep{llama3}, Qwen2.5-3B~\citep{qwen25}, and Gemma-2-2B~\citep{gemma2} -- on unseen test adapters drawn from instruction-following, reasoning, question-answering, code, and classification tasks. Across all three architectures, the detector achieves 100\% accuracy. △ Less

Submitted 7 April, 2026; v1 submitted 16 February, 2026; originally announced February 2026.

arXiv:2602.14444 [pdf, ps, other]

Broken Chains: The Cost of Incomplete Reasoning in LLMs

Authors: Ian Su, Gaurav Purushothaman, Jey Narayan, Ruhika Goel, Kevin Zhu, Sunishchal Dev, Yash More, Maheep Chaudhary

Abstract: Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, commen… ▽ More Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints. △ Less

Submitted 15 February, 2026; originally announced February 2026.

arXiv:2511.23131 [pdf, ps, other]

Towards Generalized Position-Based Dynamics

Authors: Manas Chaudhary, Chandradeep Pokhariya, Rahul Narain

Abstract: The position-based dynamics (PBD) algorithm is a popular and versatile technique for real-time simulation of deformable bodies, but is only applicable to forces that can be expressed as linearly compliant constraints. In this work, we explore a generalization of PBD that is applicable to arbitrary nonlinear force models. We do this by reformulating the implicit time integration equations in terms… ▽ More The position-based dynamics (PBD) algorithm is a popular and versatile technique for real-time simulation of deformable bodies, but is only applicable to forces that can be expressed as linearly compliant constraints. In this work, we explore a generalization of PBD that is applicable to arbitrary nonlinear force models. We do this by reformulating the implicit time integration equations in terms of the individual forces in the system, to which applying Gauss-Seidel iterations naturally leads to a PBD-type algorithm. As we demonstrate, our method allows simulation of data-driven cloth models [Sperl et al. 2020] that cannot be represented by existing variations of position-based dynamics, enabling performance improvements over the baseline Newton-based solver for high mesh resolutions. We also show our method's applicability to volumetric neo-Hookean elasticity with an inversion barrier. △ Less

Submitted 28 November, 2025; originally announced November 2025.

arXiv:2511.07772 [pdf, ps, other]

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Authors: Shourya Batra, Pierce Tillman, Samarth Gaggar, Shashank Kesineni, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma, Maheep Chaudhary

Abstract: As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently exp… ▽ More As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently expose sensitive details in their reasoning traces, even when final outputs appear safe. The challenge lies in preventing such leakage without compromising the model's reasoning capabilities, requiring a delicate balance between privacy and utility. We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model's Chain of Thought (CoT) by injecting targeted steering vectors into hidden state. We identify the high-leakage layers responsible for this behavior. Through experiments across multiple LLMs, we demonstrate that SALT achieves reductions including $18.2\%$ reduction in CPL on QwQ-32B, $17.9\%$ reduction in CPL on Llama-3.1-8B, and $31.2\%$ reduction in CPL on Deepseek in contextual privacy leakage dataset AirGapAgent-R while maintaining comparable task performance and utility. Our work establishes SALT as a practical approach for test-time privacy protection in reasoning-capable language models, offering a path toward safer deployment of LLM-based personal agents. △ Less

Submitted 20 November, 2025; v1 submitted 10 November, 2025; originally announced November 2025.

arXiv:2511.07482 [pdf, ps, other]

Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits

Authors: Dev Patel, Gabrielle Gervacio, Diekola Raimi, Kevin Zhu, Ryan Lagasse, Gabriel Grand, Ashwinee Panda, Maheep Chaudhary

Abstract: Large Language Models require substantial computational resources for inference, posing deployment challenges. While dynamic pruning offers superior efficiency over static methods through adaptive circuit selection, it exacerbates alignment degradation by retaining only input-dependent safety-critical circuit preservation across diverse inputs. As a result, addressing these heightened alignment vu… ▽ More Large Language Models require substantial computational resources for inference, posing deployment challenges. While dynamic pruning offers superior efficiency over static methods through adaptive circuit selection, it exacerbates alignment degradation by retaining only input-dependent safety-critical circuit preservation across diverse inputs. As a result, addressing these heightened alignment vulnerabilities remains critical. We introduce Alignment-Aware Probe Pruning (AAPP), a dynamic structured pruning method that adaptively preserves alignment-relevant circuits during inference, building upon Probe Pruning. Experiments on LLaMA 2-7B, Qwen2.5-14B-Instruct, and Gemma-3-12B-IT show AAPP improves refusal rates by 50\% at matched compute, enabling efficient yet safety-preserving LLM deployment. △ Less

Submitted 9 November, 2025; originally announced November 2025.

arXiv:2511.06437 [pdf, ps, other]

Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis

Authors: Abhishek More, Anthony Zhang, Nicole Bonilla, Ashvik Vivekan, Kevin Zhu, Parham Sharafoleslami, Maheep Chaudhary

Abstract: Chain-of-thought (CoT) prompting enables Large Language Models to solve complex problems, but deploying these models safely requires reliable confidence estimates, a capability where existing methods suffer from poor calibration and severe overconfidence on incorrect predictions. We propose Enhanced Dirichlet and Topology Risk (EDTR), a novel decoding strategy that combines topological analysis wi… ▽ More Chain-of-thought (CoT) prompting enables Large Language Models to solve complex problems, but deploying these models safely requires reliable confidence estimates, a capability where existing methods suffer from poor calibration and severe overconfidence on incorrect predictions. We propose Enhanced Dirichlet and Topology Risk (EDTR), a novel decoding strategy that combines topological analysis with Dirichlet-based uncertainty quantification to measure LLM confidence across multiple reasoning paths. EDTR treats each CoT as a vector in high-dimensional space and extracts eight topological risk features capturing the geometric structure of reasoning distributions: tighter, more coherent clusters indicate higher confidence while dispersed, inconsistent paths signal uncertainty. We evaluate EDTR against three state-of-the-art calibration methods across four diverse reasoning benchmarks spanning olympiad-level mathematics (AIME), grade school math (GSM8K), commonsense reasoning, and stock price prediction \cite{zhang2025aime, cobbe2021training, talmor-etal-2019-commonsenseqa, yahoo_finance}. EDTR achieves 41\% better calibration than competing methods with an average ECE of 0.287 and the best overall composite score of 0.672, while notably achieving perfect accuracy on AIME and exceptional calibration on GSM8K with an ECE of 0.107, domains where baselines exhibit severe overconfidence. Our work provides a geometric framework for understanding and quantifying uncertainty in multi-step LLM reasoning, enabling more reliable deployment where calibrated confidence estimates are essential. △ Less

Submitted 9 November, 2025; originally announced November 2025.

arXiv:2509.25238 [pdf, ps, other]

PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases

Authors: Sri Vatsa Vuddanti, Aarav Shah, Satwik Kumar Chittiprolu, Tony Song, Sunishchal Dev, Kevin Zhu, Maheep Chaudhary

Abstract: Tool-augmented language agents frequently fail in real-world deployment due to tool malfunctions--timeouts, API exceptions, or inconsistent outputs--triggering cascading reasoning errors and task abandonment. Existing agent training pipelines optimize only for success trajectories, failing to expose models to the tool failures that dominate real-world usage. We propose \textbf{PALADIN}, a generali… ▽ More Tool-augmented language agents frequently fail in real-world deployment due to tool malfunctions--timeouts, API exceptions, or inconsistent outputs--triggering cascading reasoning errors and task abandonment. Existing agent training pipelines optimize only for success trajectories, failing to expose models to the tool failures that dominate real-world usage. We propose \textbf{PALADIN}, a generalizable framework for equipping language agents with robust failure recovery capabilities. PALADIN trains on 50,000+ recovery-annotated trajectories constructed via systematic failure injection and expert demonstrations on an enhanced ToolBench dataset. Training uses LoRA-based fine-tuning to retain base capabilities while injecting recovery competence. At inference, PALADIN detects execution-time errors and retrieves the most similar case from a curated bank of 55+ failure exemplars aligned with ToolScan's taxonomy, then executes the corresponding recovery action. This approach generalizes to novel failures beyond the training distribution, retaining 95.2\% recovery performance on unseen tool APIs. Evaluation across PaladinEval and ToolReflectEval demonstrates consistent improvements in Recovery Rate (RR), Task Success Rate (TSR), Catastrophic Success Rate (CSR), and Efficiency Score (ES). PALADIN improves RR from 32.76% to 89.68% (+57% relative) over ToolBench and outperforms the strongest baseline CRITIC (76.34%) by +13.3%. Against vanilla agents, PALADIN achieves 89.86\% RR (+66% relative improvement from 23.75%). These results establish PALADIN as an effective method for building fault-tolerant agents capable of robust recovery in real-world tool environments. △ Less

Submitted 25 September, 2025; originally announced September 2025.

arXiv:2509.18116 [pdf, ps, other]

Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization

Authors: Nathan Egbuna, Saatvik Gaur, Sunishchal Dev, Ashwinee Panda, Maheep Chaudhary

Abstract: Test-time optimization remains impractical at scale due to prohibitive inference costs--techniques like iterative refinement and multi-step verification can require $10-100\times$ more compute per query than standard decoding. Latent space test-time optimization methods like LatentSeek offer a more direct approach by steering hidden representations, but still demand expensive per-query optimizatio… ▽ More Test-time optimization remains impractical at scale due to prohibitive inference costs--techniques like iterative refinement and multi-step verification can require $10-100\times$ more compute per query than standard decoding. Latent space test-time optimization methods like LatentSeek offer a more direct approach by steering hidden representations, but still demand expensive per-query optimization loops with multiple backward passes. We propose Amortized Latent Steering (ALS), which collapses this iterative optimization into a single offline-computed vector applied at constant cost during inference. ALS computes the mean difference between hidden states from successful versus unsuccessful generations, then uses this direction to calibrate the model's hidden representations: when decoding drifts away from the success manifold, ALS nudges activations back toward it. Across GSM8K and MATH-500 benchmarks, ALS achieves $2-5\times$ speedup over iterative methods while matching or surpassing greedy Chain-of-Thought (CoT) and Self-Consistency baselines, yielding up to 101% improvement in efficiency--accuracy trade-off. These results show that much of latent optimization's benefit can be captured offline, making sophisticated reasoning techniques viable for production deployment. Code is available at https://github.com/negbuna/ALS. △ Less

Submitted 7 November, 2025; v1 submitted 10 September, 2025; originally announced September 2025.

arXiv:2509.16254 [pdf, ps, other]

Imaging Modalities-Based Classification for Lung Cancer Detection

Authors: Sajim Ahmed, Muhammad Zain Chaudhary, Muhammad Zohaib Chaudhary, Mahmoud Abbass, Ahmed Sherif, Mohammad Mahbubur Rahman Khan Mamun

Abstract: Lung cancer continues to be the predominant cause of cancer-related mortality globally. This review analyzes various approaches, including advanced image processing methods, focusing on their efficacy in interpreting CT scans, chest radiographs, and biological markers. Notably, we identify critical gaps in the previous surveys, including the need for robust models that can generalize across divers… ▽ More Lung cancer continues to be the predominant cause of cancer-related mortality globally. This review analyzes various approaches, including advanced image processing methods, focusing on their efficacy in interpreting CT scans, chest radiographs, and biological markers. Notably, we identify critical gaps in the previous surveys, including the need for robust models that can generalize across diverse populations and imaging modalities. This comprehensive synthesis aims to serve as a foundational resource for researchers and clinicians, guiding future efforts toward more accurate and efficient lung cancer detection. Key findings reveal that 3D CNN architectures integrated with CT scans achieve the most superior performances, yet challenges such as high false positives, dataset variability, and computational complexity persist across modalities. △ Less

Submitted 17 September, 2025; originally announced September 2025.

Comments: Accepted at ICMI 2025

MSC Class: 68T07; 92C55

arXiv:2509.13334 [pdf, ps, other]

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness

Authors: Anand Swaroop, Akshat Nallani, Saksham Uboweja, Adiliia Uzdenova, Michael Nguyen, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma, Maheep Chaudhary

Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks, but recent work shows that reasoning steps often fail to causally influence the final answer, creating brittle and untrustworthy outputs. Prior approaches focus primarily on measuring faithfulness, while methods for systematically improving it remain limited. We introduc… ▽ More Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks, but recent work shows that reasoning steps often fail to causally influence the final answer, creating brittle and untrustworthy outputs. Prior approaches focus primarily on measuring faithfulness, while methods for systematically improving it remain limited. We introduce Faithful Reasoning via Intervention Training (FRIT), a scalable alignment method that trains models to produce causally consistent reasoning by learning from systematically corrupted examples. FRIT generates synthetic training data by intervening on individual reasoning steps in model-generated CoTs, creating faithful/unfaithful pairs that highlight when reasoning breaks down. We then apply Direct Preference Optimization to teach models to prefer causally consistent reasoning paths. Evaluating on Qwen3-8B and Mistral-7B-v0.1 across factual and symbolic reasoning tasks, FRIT increases faithful reasoning by $3.4$ percentage points for Mistral on GSM8K while improving accuracy by $7.6$ percentage points. Our approach provides the first scalable, supervision-free method for training language models to produce more reliable and interpretable reasoning, addressing a critical gap between reasoning performance and trustworthiness. We release our code at \href{https://github.com/Anut-py/frit}. △ Less

Submitted 10 September, 2025; originally announced September 2025.

arXiv:2509.13333 [pdf, ps, other]

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Authors: Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ryan Lagasse, Vasu Sharma, Ashwinee Panda

Abstract: Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation aware… ▽ More Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md. △ Less

Submitted 9 November, 2025; v1 submitted 10 September, 2025; originally announced September 2025.

arXiv:2508.15099 [pdf, ps, other]

Hydra: A Modular Architecture for Efficient Long-Context Reasoning

Authors: Siddharth Chaudhary, Dev Patel, Maheep Chaudhary, Bennett Browning

Abstract: The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key m… ▽ More The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key memory. We evaluate a 29M parameter model measuring logical chaining accuracy and throughput on synthetic sequences, plus throughput on WikiText. Ablation studies use component-specific synthetic datasets to isolate individual mechanisms. Hydra achieves $3.01\times$ and $3.0\times$ throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and $10\times$ accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component's contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval. △ Less

Submitted 16 October, 2025; v1 submitted 20 August, 2025; originally announced August 2025.

Comments: Updated with the new paper accepted to NeurIPS workshop

arXiv:2508.14067 [pdf, ps, other]

Punctuation and Predicates in Language Models

Authors: Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, Nandi Schoots

Abstract: In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance)… ▽ More In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability. △ Less

Submitted 11 August, 2025; originally announced August 2025.

arXiv:2507.21532 [pdf, ps, other]

Automatic Classification of User Requirements from Online Feedback -- A Replication Study

Authors: Meet Bhatt, Nic Boilard, Muhammad Rehan Chaudhary, Cole Thompson, Jacob Idoko, Aakash Sorathiya, Gouri Ginde

Abstract: Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-a… ▽ More Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), "Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning", which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study's replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study. △ Less

Submitted 29 July, 2025; originally announced July 2025.

Comments: 10 pages, 3 figures, Replication package available at https://zenodo.org/records/15626782, Accepted at AIRE 2025 (12th International Workshop on Artificial Intelligence and Requirements Engineering)

arXiv:2507.12367 [pdf, ps, other]

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Authors: Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia

Abstract: The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introdu… ▽ More The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark. △ Less

Submitted 21 July, 2025; v1 submitted 16 July, 2025; originally announced July 2025.

Comments: Version 2 of the dataset from: arXiv:2411.05830

arXiv:2505.16014 [pdf, ps, other]

Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains

Authors: Yash Saxena, Ankur Padia, Mandar S Chaudhary, Kalpa Gunaratna, Srinivasan Parthasarathy, Manas Gaur

Abstract: In sensitive domains, Retrieval-Augmented Generation (RAG) must be interpretable and robust because errors do not just mislead, they invite lawsuits, undermine scholarly credibility, and breach compliance. Stakeholders require traceable evidence, clear rationales for why specific evidence is selected, and safeguards against poisoned or misleading content. Yet current RAG pipelines rely on similari… ▽ More In sensitive domains, Retrieval-Augmented Generation (RAG) must be interpretable and robust because errors do not just mislead, they invite lawsuits, undermine scholarly credibility, and breach compliance. Stakeholders require traceable evidence, clear rationales for why specific evidence is selected, and safeguards against poisoned or misleading content. Yet current RAG pipelines rely on similarity-based retrieval with arbitrary top-k cutoffs, provide no explanation for selections, and remain vulnerable to poisoning attacks. We propose METEORA, which replaces these drawbacks with rationale-driven selection, using explicit reasoning to guide evidence choice, explain decisions, and improve robustness to RAG poisoning. METEORA operates in three stages: (1) a general-purpose LLM is preference-tuned to generate query-conditioned rationales using direct preference optimization; (2) these rationales drive an Evidence Chunk Selection Engine that pairs rationales with retrieved evidence for query-specific relevance and applies elbow detection to choose an adaptive cutoff (optionally expanding context with neighboring chunks); and (3) a Verifier LLM uses the rationales to detect and filter poisoned or misleading evidence before generation. Across six datasets, METEORA achieves 13.41% higher recall and, without expansion, 21.05% higher precision than the strongest baseline. It reduces the evidence needed for comparable recall by 80%, improving downstream answer accuracy by 33.34%, and strengthens adversarial defense by increasing F1 from 0.10 to 0.44. Code is available at: https://anonymous.4open.science/r/METEORA-DC46/README.md △ Less

Submitted 18 January, 2026; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.14376 [pdf, ps, other]

Graph-Guided Passage Retrieval for Author-Centric Structured Feedback

Authors: Maitreya Prafulla Chitale, Ketaki Mangesh Shetye, Harshit Gupta, Manav Chaudhary, Manish Shrivastava, Vasudeva Varma

Abstract: Obtaining high-quality, pre-submission feedback is a critical bottleneck in the academic publication lifecycle for researchers. We introduce AutoRev, an automated author-centric feedback system that generates structured, actionable guidance prior to formal peer review. AutoRev employs a graph-based retrieval-augmented generation framework that models each paper as a hierarchical document graph, in… ▽ More Obtaining high-quality, pre-submission feedback is a critical bottleneck in the academic publication lifecycle for researchers. We introduce AutoRev, an automated author-centric feedback system that generates structured, actionable guidance prior to formal peer review. AutoRev employs a graph-based retrieval-augmented generation framework that models each paper as a hierarchical document graph, integrating textual and structural representations to retrieve salient content efficiently. By leveraging graph-based passage retrieval, AutoRev substantially reduces LLM input context length, leading to higher-quality feedback generation. Experimental results demonstrate that AutoRev significantly outperforms baselines across multiple automatic evaluation metrics, while achieving strong performance in human evaluations. Code will be released upon acceptance. △ Less

Submitted 9 January, 2026; v1 submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.14300 [pdf, other]

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Authors: Maheep Chaudhary, Fazl Barez

Abstract: High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers. Our study focuses specificall… ▽ More High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers. Our study focuses specifically on backdoor-triggered responses -- where specific input phrases activate hidden vulnerabilities causing the model to generate unsafe content like violence, pornography, or hate speech. We address two key challenges: (1) identifying true causal indicators rather than surface correlations, and (2) preventing advanced models from deception -- deliberately evading monitoring systems. Hence, we approach this problem from an unsupervised lens by drawing parallels to human deception: just as humans exhibit physical indicators while lying, we investigate whether LLMs display distinct internal behavioral signatures when generating harmful content. Our study addresses two critical challenges: 1) designing monitoring systems that capture true causal indicators rather than superficial correlations; and 2)preventing intentional evasion by increasingly capable "Future models''. Our findings show that models can produce harmful content through causal mechanisms and can become deceptive by: (a) alternating between linear and non-linear representations, and (b) modifying feature relationships. To counter this, we developed Safety-Net -- a multi-detector framework that monitors different representation dimensions, successfully detecting harmful behavior even when information is shifted across representational spaces to evade individual monitors. Our evaluation shows 96% accuracy in detecting harmful cases using our unsupervised ensemble approach. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.04759 [pdf, ps, other]

Exploring Zero-Shot App Review Classification with ChatGPT: Challenges and Potential

Authors: Mohit Chaudhary, Chirag Jain, Preethu Rose Anish

Abstract: App reviews are a critical source of user feedback, offering valuable insights into an app's performance, features, usability, and overall user experience. Effectively analyzing these reviews is essential for guiding app development, prioritizing feature updates, and enhancing user satisfaction. Classifying reviews into functional and non-functional requirements play a pivotal role in distinguishi… ▽ More App reviews are a critical source of user feedback, offering valuable insights into an app's performance, features, usability, and overall user experience. Effectively analyzing these reviews is essential for guiding app development, prioritizing feature updates, and enhancing user satisfaction. Classifying reviews into functional and non-functional requirements play a pivotal role in distinguishing feedback related to specific app features (functional requirements) from feedback concerning broader quality attributes, such as performance, usability, and reliability (non-functional requirements). Both categories are integral to informed development decisions. Traditional approaches to classifying app reviews are hindered by the need for large, domain-specific datasets, which are often costly and time-consuming to curate. This study explores the potential of zero-shot learning with ChatGPT for classifying app reviews into four categories: functional requirement, non-functional requirement, both, or neither. We evaluate ChatGPT's performance on a benchmark dataset of 1,880 manually annotated reviews from ten diverse apps spanning multiple domains. Our findings demonstrate that ChatGPT achieves a robust F1 score of 0.842 in review classification, despite certain challenges and limitations. Additionally, we examine how factors such as review readability and length impact classification accuracy and conduct a manual analysis to identify review categories more prone to misclassification. △ Less

Submitted 7 May, 2025; originally announced May 2025.

arXiv:2504.16331 [pdf, ps, other]

Can Code Language Models Learn Clarification-Seeking Behaviors?

Authors: Jie JW Wu, Manav Chaudhary, Davit Abrahamyan, Arhaan Khaku, Anjiang Wei, Fatemeh H. Fard

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, a gap remains between their output and the problem-solving strategies of human developers. Unlike humans, who spend substantial time disambiguating requirements through iterative dialogue, LLMs often generate code despite ambiguities in natural language requirements, leading to unreliable solu… ▽ More Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, a gap remains between their output and the problem-solving strategies of human developers. Unlike humans, who spend substantial time disambiguating requirements through iterative dialogue, LLMs often generate code despite ambiguities in natural language requirements, leading to unreliable solutions. Different from prior work, we study whether a Code LLM can be fine-tuned to learn clarification-seeking behavior. While recent work has focused on LLM-based agents for iterative code generation, we argue that the ability to recognize and query ambiguous requirements should be intrinsic to the models themselves, especially in agentic AI where models and humans collaborate. We present ClarifyCoder, a framework with synthetic data generation and instruction-tuning that fine-tunes an LLM to identify ambiguities and request clarification before code generation. Our approach has two components: (1) a data synthesis technique that augments programming datasets with scenarios requiring clarification to generate clarification-aware training data, and (2) a fine-tuning strategy that teaches models to prioritize seeking clarification over immediate code generation when faced with incomplete or ambiguous requirements. We also provide an empirical analysis of integrating ClarifyCoder with standard fine-tuning for joint optimization of clarification-awareness and coding ability. Experimental results show that ClarifyCoder achieves a 63% communication rate (40% absolute increase) and a 52% good question rate (30% absolute increase) on ambiguous tasks, significantly improving LLMs' communication capabilities while maintaining code generation performance. △ Less

Submitted 26 September, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

arXiv:2502.02470 [pdf, ps, other]

Studying Cross-cluster Modularity in Neural Networks

Authors: Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots

Abstract: An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a "clusterability loss" function that encourages the formatio… ▽ More An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a "clusterability loss" function that encourages the formation of non-interacting clusters. We then investigate the emerging properties of these highly clustered models. We find our trained clustered models do not exhibit more task specialization, but do form smaller circuits. We investigate CNNs trained on MNIST and CIFAR, small transformers trained on modular addition, and GPT-2 and Pythia on the Wiki dataset, and Gemma on a Chemistry dataset. This investigation shows what to expect from clustered models. △ Less

Submitted 25 July, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

Comments: 8 pages, under review. arXiv admin note: text overlap with arXiv:2409.15747 (author note: this is an extension of that paper but has different authors)

arXiv:2412.09269 [pdf, other]

Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

Authors: Manav Chaudhary, Harshit Gupta, Savita Bhat, Vasudeva Varma

Abstract: Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments ac… ▽ More Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: Accepted at ICON 2024

arXiv:2411.07567 [pdf, other]

Uncertainty-Aware Test-Time Adaptation for Inverse Consistent Diffeomorphic Lung Image Registration

Authors: Muhammad F. A. Chaudhary, Stephanie M. Aguilera, Arie Nakhmani, Joseph M. Reinhardt, Surya P. Bhatt, Sandeep Bodduluri

Abstract: Diffeomorphic deformable image registration ensures smooth invertible transformations across inspiratory and expiratory chest CT scans. Yet, in practice, deep learning-based diffeomorphic methods struggle to capture large deformations between inspiratory and expiratory volumes, and therefore lack inverse consistency. Existing methods also fail to account for model uncertainty, which can be useful… ▽ More Diffeomorphic deformable image registration ensures smooth invertible transformations across inspiratory and expiratory chest CT scans. Yet, in practice, deep learning-based diffeomorphic methods struggle to capture large deformations between inspiratory and expiratory volumes, and therefore lack inverse consistency. Existing methods also fail to account for model uncertainty, which can be useful for improving performance. We propose an uncertainty-aware test-time adaptation framework for inverse consistent diffeomorphic lung registration. Our method uses Monte Carlo (MC) dropout to estimate spatial uncertainty that is used to improve model performance. We train and evaluate our method for inspiratory-to-expiratory CT registration on a large cohort of 675 subjects from the COPDGene study, achieving a higher Dice similarity coefficient (DSC) between the lung boundaries (0.966) compared to both VoxelMorph (0.953) and TransMorph (0.953). Our method demonstrates consistent improvements in the inverse registration direction as well with an overall DSC of 0.966, higher than VoxelMorph (0.958) and TransMorph (0.956). Paired t-tests indicate statistically significant improvements. △ Less

Submitted 12 November, 2024; originally announced November 2024.

Comments: 5 pages, 4 figures

arXiv:2409.14703 [pdf, other]

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

Authors: Siddhant Bikram Shah, Shuvam Shiwakoti, Maheep Chaudhary, Haohan Wang

Abstract: The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, ta… ▽ More The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP. △ Less

Submitted 27 October, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

Comments: Accepted to EMNLP 2024 (Main)

arXiv:2409.04478 [pdf, other]

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Authors: Maheep Chaudhary, Atticus Geiger

Abstract: A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GP… ▽ More A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: https://github.com/MaheepChaudhary/SAE-Ravel △ Less

Submitted 5 September, 2024; originally announced September 2024.

arXiv:2405.16129 [pdf, other]

iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers

Authors: Harshit Gupta, Manav Chaudhary, Tathagata Raha, Shivansh Subramanian, Vasudeva Varma

Abstract: This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default common-sense associations and exhibit unconventional thinking. We propo… ▽ More This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default common-sense associations and exhibit unconventional thinking. We propose a unique strategy to improve the performance of pre-trained language models, notably the Gemini 1.0 Pro Model, in both subtasks. We employ static and dynamic few-shot prompting techniques and introduce a model-generated reasoning strategy that utilizes the LLM's reasoning capabilities to improve performance. Our approach demonstrated significant improvements, showing that it performed better than the baseline models by a considerable margin but fell short of performing as well as the human annotators, thus highlighting the efficacy of the proposed strategies. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.11192 [pdf, other]

BrainStorm @ iREL at #SMM4H 2024: Leveraging Translation and Topical Embeddings for Annotation Detection in Tweets

Authors: Manav Chaudhary, Harshit Gupta, Vasudeva Varma

Abstract: The proliferation of LLMs in various NLP tasks has sparked debates regarding their reliability, particularly in annotation tasks where biases and hallucinations may arise. In this shared task, we address the challenge of distinguishing annotations made by LLMs from those made by human domain experts in the context of COVID-19 symptom detection from tweets in Latin American Spanish. This paper pres… ▽ More The proliferation of LLMs in various NLP tasks has sparked debates regarding their reliability, particularly in annotation tasks where biases and hallucinations may arise. In this shared task, we address the challenge of distinguishing annotations made by LLMs from those made by human domain experts in the context of COVID-19 symptom detection from tweets in Latin American Spanish. This paper presents BrainStorm @ iRELs approach to the SMM4H 2024 Shared Task, leveraging the inherent topical information in tweets, we propose a novel approach to identify and classify annotations, aiming to enhance the trustworthiness of annotated data. △ Less

Submitted 20 July, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

Comments: Accepted at SMM4H, colocated at ACL 2024

arXiv:2402.04958 [pdf, other]

Channel-Selective Normalization for Label-Shift Robust Test-Time Adaptation

Authors: Pedro Vianna, Muawiz Chaudhary, Paria Mehrbod, An Tang, Guy Cloutier, Guy Wolf, Michael Eickenberg, Eugene Belilovsky

Abstract: Deep neural networks have useful applications in many different tasks, however their performance can be severely affected by changes in the data distribution. For example, in the biomedical field, their performance can be affected by changes in the data (different machines, populations) between training and test datasets. To ensure robustness and generalization to real-world scenarios, test-time a… ▽ More Deep neural networks have useful applications in many different tasks, however their performance can be severely affected by changes in the data distribution. For example, in the biomedical field, their performance can be affected by changes in the data (different machines, populations) between training and test datasets. To ensure robustness and generalization to real-world scenarios, test-time adaptation has been recently studied as an approach to adjust models to a new data distribution during inference. Test-time batch normalization is a simple and popular method that achieved compelling performance on domain shift benchmarks. It is implemented by recalculating batch normalization statistics on test batches. Prior work has focused on analysis with test data that has the same label distribution as the training data. However, in many practical applications this technique is vulnerable to label distribution shifts, sometimes producing catastrophic failure. This presents a risk in applying test time adaptation methods in deployment. We propose to tackle this challenge by only selectively adapting channels in a deep network, minimizing drastic adaptation that is sensitive to label shifts. Our selection scheme is based on two principles that we empirically motivate: (1) later layers of networks are more sensitive to label shift (2) individual features can be sensitive to specific classes. We apply the proposed technique to three classification tasks, including CIFAR10-C, Imagenet-C, and diagnosis of fatty liver, where we explore both covariate and label distribution shifts. We find that our method allows to bring the benefits of TTA while significantly reducing the risk of failure common in other methods, while being robust to choice in hyperparameters. △ Less

Submitted 29 May, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted at the Conference on Lifelong Learning Agents (CoLLAs) 2024

arXiv:2308.14969 [pdf, other]

Uncovering the Hidden Cost of Model Compression

Authors: Diganta Misra, Muawiz Chaudhary, Agam Goyal, Bharat Runwal, Pin Yu Chen

Abstract: In an age dominated by resource-intensive foundation models, the ability to efficiently adapt to downstream tasks is crucial. Visual Prompting (VP), drawing inspiration from the prompting techniques employed in Large Language Models (LLMs), has emerged as a pivotal method for transfer learning in the realm of computer vision. As the importance of efficiency continues to rise, research into model c… ▽ More In an age dominated by resource-intensive foundation models, the ability to efficiently adapt to downstream tasks is crucial. Visual Prompting (VP), drawing inspiration from the prompting techniques employed in Large Language Models (LLMs), has emerged as a pivotal method for transfer learning in the realm of computer vision. As the importance of efficiency continues to rise, research into model compression has become indispensable in alleviating the computational burdens associated with training and deploying over-parameterized neural networks. A primary objective in model compression is to develop sparse and/or quantized models capable of matching or even surpassing the performance of their over-parameterized, full-precision counterparts. Although previous studies have explored the effects of model compression on transfer learning, its impact on visual prompting-based transfer remains unclear. This study aims to bridge this gap, shedding light on the fact that model compression detrimentally impacts the performance of visual prompting-based transfer, particularly evident in scenarios with low data volume. Furthermore, our findings underscore the adverse influence of sparsity on the calibration of downstream visual-prompted models. However, intriguingly, we also illustrate that such negative effects on calibration are not present when models are compressed via quantization. This empirical investigation underscores the need for a nuanced understanding beyond mere accuracy in sparse and quantized settings, thereby paving the way for further exploration in Visual Prompting techniques tailored for sparse and quantized models. △ Less

Submitted 15 March, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: Preprint

arXiv:2307.16851 [pdf, other]

Towards Trustworthy and Aligned Machine Learning: A Data-centric Survey with Causality Perspectives

Authors: Haoyang Liu, Maheep Chaudhary, Haohan Wang

Abstract: The trustworthiness of machine learning has emerged as a critical topic in the field, encompassing various applications and research areas such as robustness, security, interpretability, and fairness. The last decade saw the development of numerous methods addressing these challenges. In this survey, we systematically review these advancements from a data-centric perspective, highlighting the shor… ▽ More The trustworthiness of machine learning has emerged as a critical topic in the field, encompassing various applications and research areas such as robustness, security, interpretability, and fairness. The last decade saw the development of numerous methods addressing these challenges. In this survey, we systematically review these advancements from a data-centric perspective, highlighting the shortcomings of traditional empirical risk minimization (ERM) training in handling challenges posed by the data. Interestingly, we observe a convergence of these methods, despite being developed independently across trustworthy machine learning subfields. Pearl's hierarchy of causality offers a unifying framework for these techniques. Accordingly, this survey presents the background of trustworthy machine learning development using a unified set of concepts, connects this language to Pearl's causal hierarchy, and finally discusses methods explicitly inspired by causality literature. We provide a unified language with mathematical vocabulary to link these methods across robustness, adversarial robustness, interpretability, and fairness, fostering a more cohesive understanding of the field. Further, we explore the trustworthiness of large pretrained models. After summarizing dominant techniques like fine-tuning, parameter-efficient fine-tuning, prompting, and reinforcement learning with human feedback, we draw connections between them and the standard ERM. This connection allows us to build upon the principled understanding of trustworthy methods, extending it to these new techniques in large pretrained models, paving the way for future methods. Existing methods under this perspective are also reviewed. Lastly, we offer a brief summary of the applications of these methods and discuss potential future aspects related to our survey. For more information, please visit http://trustai.one. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: 47 pages, 13 figures

arXiv:2307.00132 [pdf, other]

iMETRE: Incorporating Markers of Entity Types for Relation Extraction

Authors: N Harsha Vardhan, Manav Chaudhary

Abstract: Sentence-level relation extraction (RE) aims to identify the relationship between 2 entities given a contextual sentence. While there have been many attempts to solve this problem, the current solutions have a lot of room to improve. In this paper, we approach the task of relationship extraction in the financial dataset REFinD. Our approach incorporates typed entity markers representations and var… ▽ More Sentence-level relation extraction (RE) aims to identify the relationship between 2 entities given a contextual sentence. While there have been many attempts to solve this problem, the current solutions have a lot of room to improve. In this paper, we approach the task of relationship extraction in the financial dataset REFinD. Our approach incorporates typed entity markers representations and various models finetuned on the dataset, which has allowed us to achieve an F1 score of 69.65% on the validation set. Through this paper, we discuss various approaches and possible limitations. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2305.14815 [pdf, other]

doi 10.18653/v1/2023.findings-emnlp.564

Machine Reading Comprehension using Case-based Reasoning

Authors: Dung Thai, Dhruv Agarwal, Mudit Chaudhary, Wenlong Zhao, Rajarshi Das, Manzil Zaheer, Jay-Yoon Lee, Hannaneh Hajishirzi, Andrew McCallum

Abstract: We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds upon the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a test question, CBR-MRC first retrieves a set of similar cases from a nonparame… ▽ More We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds upon the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a test question, CBR-MRC first retrieves a set of similar cases from a nonparametric memory and then predicts an answer by selecting the span in the test context that is most similar to the contextualized representations of answers in the retrieved cases. The semi-parametric nature of our approach allows it to attribute a prediction to the specific set of evidence cases, making it a desirable choice for building reliable and debuggable QA systems. We show that CBR-MRC provides high accuracy comparable with large reader models and outperforms baselines by 11.5 and 8.4 EM on NaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability of CBR-MRC in identifying not just the correct answer tokens but also the span with the most relevant supporting evidence. Lastly, we observe that contexts for certain question types show higher lexical diversity than others and find that CBR-MRC is robust to these variations while performance using fully-parametric methods drops. △ Less

Submitted 5 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 9 pages, 2 figures

arXiv:2304.04858 [pdf, other]

Simulated Annealing in Early Layers Leads to Better Generalization

Authors: Amirmohammad Sarfi, Zahra Karimpour, Muawiz Chaudhary, Nasir M. Khalid, Mirco Ravanelli, Sudhir Mudur, Eugene Belilovsky

Abstract: Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal inno… ▽ More Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2301.04709 [pdf, other]

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

Authors: Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard

Abstract: Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mecha… ▽ More Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering. △ Less

Submitted 8 May, 2025; v1 submitted 11 January, 2023; originally announced January 2023.

arXiv:2209.06067 [pdf, other]

SeRP: Self-Supervised Representation Learning Using Perturbed Point Clouds

Authors: Siddhant Garg, Mudit Chaudhary

Abstract: We present SeRP, a framework for Self-Supervised Learning of 3D point clouds. SeRP consists of encoder-decoder architecture that takes perturbed or corrupted point clouds as inputs and aims to reconstruct the original point cloud without corruption. The encoder learns the high-level latent representations of the points clouds in a low-dimensional subspace and recovers the original structure. In th… ▽ More We present SeRP, a framework for Self-Supervised Learning of 3D point clouds. SeRP consists of encoder-decoder architecture that takes perturbed or corrupted point clouds as inputs and aims to reconstruct the original point cloud without corruption. The encoder learns the high-level latent representations of the points clouds in a low-dimensional subspace and recovers the original structure. In this work, we have used Transformers and PointNet-based Autoencoders. The proposed framework also addresses some of the limitations of Transformers-based Masked Autoencoders which are prone to leakage of location information and uneven information density. We trained our models on the complete ShapeNet dataset and evaluated them on ModelNet40 as a downstream classification task. We have shown that the pretrained models achieved 0.5-1% higher classification accuracies than the networks trained from scratch. Furthermore, we also proposed VASP: Vector-Quantized Autoencoder for Self-supervised Representation Learning for Point Clouds that employs Vector-Quantization for discrete representation learning for Transformer-based autoencoders. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: 6 pages

arXiv:2205.01907 [pdf, other]

Cross-lingual Word Embeddings in Hyperbolic Space

Authors: Chandni Saxena, Mudit Chaudhary, Helen Meng

Abstract: Cross-lingual word embeddings can be applied to several natural language processing applications across multiple languages. Unlike prior works that use word embeddings based on the Euclidean space, this short paper presents a simple and effective cross-lingual Word2Vec model that adapts to the Poincaré ball model of hyperbolic space to learn unsupervised cross-lingual word representations from a G… ▽ More Cross-lingual word embeddings can be applied to several natural language processing applications across multiple languages. Unlike prior works that use word embeddings based on the Euclidean space, this short paper presents a simple and effective cross-lingual Word2Vec model that adapts to the Poincaré ball model of hyperbolic space to learn unsupervised cross-lingual word representations from a German-English parallel corpus. It has been shown that hyperbolic embeddings can capture and preserve hierarchical relationships. We evaluate the model on both hypernymy and analogy tasks. The proposed model achieves comparable performance with the vanilla Word2Vec model on the cross-lingual analogy task, the hypernymy task shows that the cross-lingual Poincaré Word2Vec model can capture latent hierarchical structure from free text across languages, which are absent from the Euclidean-based Word2Vec representations. Our results show that by preserving the latent hierarchical information, hyperbolic spaces can offer better representations for cross-lingual embeddings. △ Less

Submitted 25 June, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

Comments: 6 pages

arXiv:2204.08554 [pdf, other]

CBR-iKB: A Case-Based Reasoning Approach for Question Answering over Incomplete Knowledge Bases

Authors: Dung Thai, Srinivas Ravishankar, Ibrahim Abdelaziz, Mudit Chaudhary, Nandana Mihindukulasooriya, Tahira Naseem, Rajarshi Das, Pavan Kapanipathi, Achille Fokoue, Andrew McCallum

Abstract: Knowledge bases (KBs) are often incomplete and constantly changing in practice. Yet, in many question answering applications coupled with knowledge bases, the sparse nature of KBs is often overlooked. To this end, we propose a case-based reasoning approach, CBR-iKB, for knowledge base question answering (KBQA) with incomplete-KB as our main focus. Our method ensembles decisions from multiple reaso… ▽ More Knowledge bases (KBs) are often incomplete and constantly changing in practice. Yet, in many question answering applications coupled with knowledge bases, the sparse nature of KBs is often overlooked. To this end, we propose a case-based reasoning approach, CBR-iKB, for knowledge base question answering (KBQA) with incomplete-KB as our main focus. Our method ensembles decisions from multiple reasoning chains with a novel nonparametric reasoning algorithm. By design, CBR-iKB can seamlessly adapt to changes in KBs without any task-specific training or fine-tuning. Our method achieves 100% accuracy on MetaQA and establishes new state-of-the-art on multiple benchmarks. For instance, CBR-iKB achieves an accuracy of 70% on WebQSP under the incomplete-KB setting, outperforming the existing state-of-the-art method by 22.3%. △ Less

Submitted 18 April, 2022; originally announced April 2022.

Comments: 8 pages, 3 figurs, 4 tables

arXiv:2110.07878 [pdf, other]

Single volume lung biomechanics from chest computed tomography using a mode preserving generative adversarial network

Authors: Muhammad F. A. Chaudhary, Sarah E. Gerard, Di Wang, Gary E. Christensen, Christopher B. Cooper, Joyce D. Schroeder, Eric A. Hoffman, Joseph M. Reinhardt

Abstract: Local tissue expansion of the lungs is typically derived by registering computed tomography (CT) scans acquired at multiple lung volumes. However, acquiring multiple scans incurs increased radiation dose, time, and cost, and may not be possible in many cases, thus restricting the applicability of registration-based biomechanics. We propose a generative adversarial learning approach for estimating… ▽ More Local tissue expansion of the lungs is typically derived by registering computed tomography (CT) scans acquired at multiple lung volumes. However, acquiring multiple scans incurs increased radiation dose, time, and cost, and may not be possible in many cases, thus restricting the applicability of registration-based biomechanics. We propose a generative adversarial learning approach for estimating local tissue expansion directly from a single CT scan. The proposed framework was trained and evaluated on 2500 subjects from the SPIROMICS cohort. Once trained, the framework can be used as a registration-free method for predicting local tissue expansion. We evaluated model performance across varying degrees of disease severity and compared its performance with two image-to-image translation frameworks - UNet and Pix2Pix. Our model achieved an overall PSNR of 18.95 decibels, SSIM of 0.840, and Spearman's correlation of 0.61 at a high spatial resolution of 1 mm3. △ Less

Submitted 15 October, 2021; originally announced October 2021.

Comments: 5 pages, 5 figures

arXiv:2109.02941 [pdf, other]

Countering Online Hate Speech: An NLP Perspective

Authors: Mudit Chaudhary, Chandni Saxena, Helen Meng

Abstract: Online hate speech has caught everyone's attention from the news related to the COVID-19 pandemic, US elections, and worldwide protests. Online toxicity - an umbrella term for online hateful behavior, manifests itself in forms such as online hate speech. Hate speech is a deliberate attack directed towards an individual or a group motivated by the targeted entity's identity or opinions. The rising… ▽ More Online hate speech has caught everyone's attention from the news related to the COVID-19 pandemic, US elections, and worldwide protests. Online toxicity - an umbrella term for online hateful behavior, manifests itself in forms such as online hate speech. Hate speech is a deliberate attack directed towards an individual or a group motivated by the targeted entity's identity or opinions. The rising mass communication through social media further exacerbates the harmful consequences of online hate speech. While there has been significant research on hate-speech identification using Natural Language Processing (NLP), the work on utilizing NLP for prevention and intervention of online hate speech lacks relatively. This paper presents a holistic conceptual framework on hate-speech NLP countering methods along with a thorough survey on the current progress of NLP for countering online hate speech. It classifies the countering techniques based on their time of action, and identifies potential future research areas on this topic. △ Less

Submitted 7 September, 2021; originally announced September 2021.

Comments: 12 pages

arXiv:2108.06206 [pdf, other]

An Intelligent Recommendation-cum-Reminder System

Authors: Rohan Saxena, Maheep Chaudhary, Chandresh Kumar Maurya, Shitala Prasad

Abstract: Intelligent recommendation and reminder systems are the need of the fast-pacing life. Current intelligent systems such as Siri, Google Assistant, Microsoft Cortona, etc., have limited capability. For example, if you want to wake up at 6 am because you have an upcoming trip, you have to set the alarm manually. Besides, these systems do not recommend or remind what else to carry, such as carrying an… ▽ More Intelligent recommendation and reminder systems are the need of the fast-pacing life. Current intelligent systems such as Siri, Google Assistant, Microsoft Cortona, etc., have limited capability. For example, if you want to wake up at 6 am because you have an upcoming trip, you have to set the alarm manually. Besides, these systems do not recommend or remind what else to carry, such as carrying an umbrella during a likely rain. The present work proposes a system that takes an email as input and returns a recommendation-cumreminder list. As a first step, we parse the emails, recognize the entities using named entity recognition (NER). In the second step, information retrieval over the web is done to identify nearby places, climatic conditions, etc. Imperative sentences from the reviews of all places are extracted and passed to the object extraction module. The main challenge lies in extracting the objects (items) of interest from the review. To solve it, a modified Machine Reading Comprehension-NER (MRC-NER) model is trained to tag objects of interest by formulating annotation rules as a query. The objects so found are recommended to the user one day in advance. The final reminder list of objects is pruned by our proposed model for tracking objects kept during the "packing activity." Eventually, when the user leaves for the event/trip, an alert is sent containing the reminding list items. Our approach achieves superior performance compared to several baselines by as much as 30% on recall and 10% on precision. △ Less

Submitted 9 August, 2021; originally announced August 2021.

Comments: 9

arXiv:2107.09539 [pdf, other]

Parametric Scattering Networks

Authors: Shanel Gauthier, Benjamin Thérien, Laurent Alsène-Racicot, Muawiz Chaudhary, Irina Rish, Eugene Belilovsky, Michael Eickenberg, Guy Wolf

Abstract: The wavelet scattering transform creates geometric invariants and deformation stability. In multiple signal domains, it has been shown to yield more discriminative representations compared to other non-learned representations and to outperform learned representations in certain tasks, particularly on limited labeled data and highly structured signals. The wavelet filters used in the scattering tra… ▽ More The wavelet scattering transform creates geometric invariants and deformation stability. In multiple signal domains, it has been shown to yield more discriminative representations compared to other non-learned representations and to outperform learned representations in certain tasks, particularly on limited labeled data and highly structured signals. The wavelet filters used in the scattering transform are typically selected to create a tight frame via a parameterized mother wavelet. In this work, we investigate whether this standard wavelet filterbank construction is optimal. Focusing on Morlet wavelets, we propose to learn the scales, orientations, and aspect ratios of the filters to produce problem-specific parameterizations of the scattering transform. We show that our learned versions of the scattering transform yield significant performance gains in small-sample classification settings over the standard scattering transform. Moreover, our empirical results suggest that traditional filterbank constructions may not always be necessary for scattering transforms to extract effective representations. △ Less

Submitted 15 August, 2022; v1 submitted 20 July, 2021; originally announced July 2021.

ACM Class: F.2.2; I.2.7

arXiv:2102.09024 [pdf, other]

Deep Learning Approaches for Forecasting Strawberry Yields and Prices Using Satellite Images and Station-Based Soil Parameters

Authors: Mohita Chaudhary, Mohamed Sadok Gastli, Lobna Nassar, Fakhri Karray

Abstract: Computational tools for forecasting yields and prices for fresh produce have been based on traditional machine learning approaches or time series modelling. We propose here an alternate approach based on deep learning algorithms for forecasting strawberry yields and prices in Santa Barbara county, California. Building the proposed forecasting model comprises three stages: first, the station-based… ▽ More Computational tools for forecasting yields and prices for fresh produce have been based on traditional machine learning approaches or time series modelling. We propose here an alternate approach based on deep learning algorithms for forecasting strawberry yields and prices in Santa Barbara county, California. Building the proposed forecasting model comprises three stages: first, the station-based ensemble model (ATT-CNN-LSTM-SeriesNet_Ens) with its compound deep learning components, SeriesNet with Gated Recurrent Unit (GRU) and Convolutional Neural Network LSTM with Attention layer (Att-CNN-LSTM), are trained and tested using the station-based soil temperature and moisture data of SantaBarbara as input and the corresponding strawberry yields or prices as output. Secondly, the remote sensing ensemble model (SIM_CNN-LSTM_Ens), which is an ensemble model of Convolutional NeuralNetwork LSTM (CNN-LSTM) models, is trained and tested using satellite images of the same county as input mapped to the same yields and prices as output. These two ensembles forecast strawberry yields and prices with minimal forecasting errors and highest model correlation for five weeks ahead forecasts.Finally, the forecasts of these two models are ensembled to have a final forecasted value for yields and prices by introducing a voting ensemble. Based on an aggregated performance measure (AGM), it is found that this voting ensemble not only enhances the forecasting performance by 5% compared to its best performing component model but also outperforms the Deep Learning (DL) ensemble model found in literature by 33% for forecasting yields and 21% for forecasting prices △ Less

Submitted 17 February, 2021; originally announced February 2021.

Comments: Paper Accepted in Association for the Advancement of Artificial Intelligence (AAAI) Spring Symposium on 21st Jan, 2021

Journal ref: AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021)

arXiv:2101.06066 [pdf, other]

Unstructured Knowledge Access in Task-oriented Dialog Modeling using Language Inference, Knowledge Retrieval and Knowledge-Integrative Response Generation

Authors: Mudit Chaudhary, Borislav Dzodzo, Sida Huang, Chun Hei Lo, Mingzhi Lyu, Lun Yiu Nie, Jinbo Xing, Tianhua Zhang, Xiaoying Zhang, Jingyan Zhou, Hong Cheng, Wai Lam, Helen Meng

Abstract: Dialog systems enriched with external knowledge can handle user queries that are outside the scope of the supporting databases/APIs. In this paper, we follow the baseline provided in DSTC9 Track 1 and propose three subsystems, KDEAK, KnowleDgEFactor, and Ens-GPT, which form the pipeline for a task-oriented dialog system capable of accessing unstructured knowledge. Specifically, KDEAK performs know… ▽ More Dialog systems enriched with external knowledge can handle user queries that are outside the scope of the supporting databases/APIs. In this paper, we follow the baseline provided in DSTC9 Track 1 and propose three subsystems, KDEAK, KnowleDgEFactor, and Ens-GPT, which form the pipeline for a task-oriented dialog system capable of accessing unstructured knowledge. Specifically, KDEAK performs knowledge-seeking turn detection by formulating the problem as natural language inference using knowledge from dialogs, databases and FAQs. KnowleDgEFactor accomplishes the knowledge selection task by formulating a factorized knowledge/document retrieval problem with three modules performing domain, entity and knowledge level analyses. Ens-GPT generates a response by first processing multiple knowledge snippets, followed by an ensemble algorithm that decides if the response should be solely derived from a GPT2-XL model, or regenerated in combination with the top-ranking knowledge snippet. Experimental results demonstrate that the proposed pipeline system outperforms the baseline and generates high-quality responses, achieving at least 58.77% improvement on BLEU-4 score. △ Less

Submitted 15 January, 2021; originally announced January 2021.

arXiv:2008.11643 [pdf, other]

HydaLearn: Highly Dynamic Task Weighting for Multi-task Learning with Auxiliary Tasks

Authors: Sam Verboven, Muhammad Hafeez Chaudhary, Jeroen Berrevoets, Wouter Verbeke

Abstract: Multi-task learning (MTL) can improve performance on a task by sharing representations with one or more related auxiliary-tasks. Usually, MTL-networks are trained on a composite loss function formed by a constant weighted combination of the separate task losses. In practice, constant loss weights lead to poor results for two reasons: (i) the relevance of the auxiliary tasks can gradually drift thr… ▽ More Multi-task learning (MTL) can improve performance on a task by sharing representations with one or more related auxiliary-tasks. Usually, MTL-networks are trained on a composite loss function formed by a constant weighted combination of the separate task losses. In practice, constant loss weights lead to poor results for two reasons: (i) the relevance of the auxiliary tasks can gradually drift throughout the learning process; (ii) for mini-batch based optimisation, the optimal task weights vary significantly from one update to the next depending on mini-batch sample composition. We introduce HydaLearn, an intelligent weighting algorithm that connects main-task gain to the individual task gradients, in order to inform dynamic loss weighting at the mini-batch level, addressing i and ii. Using HydaLearn, we report performance increases on synthetic data, as well as on two supervised learning domains. △ Less

Submitted 26 August, 2020; originally announced August 2020.

arXiv:2008.07092 [pdf, other]

Understanding Brain Dynamics for Color Perception using Wearable EEG headband

Authors: Mahima Chaudhary, Sumona Mukhopadhyay, Marin Litoiu, Lauren E Sergio, Meaghan S Adams

Abstract: The perception of color is an important cognitive feature of the human brain. The variety of colors that impinge upon the human eye can trigger changes in brain activity which can be captured using electroencephalography (EEG). In this work, we have designed a multiclass classification model to detect the primary colors from the features of raw EEG signals. In contrast to previous research, our me… ▽ More The perception of color is an important cognitive feature of the human brain. The variety of colors that impinge upon the human eye can trigger changes in brain activity which can be captured using electroencephalography (EEG). In this work, we have designed a multiclass classification model to detect the primary colors from the features of raw EEG signals. In contrast to previous research, our method employs spectral power features, statistical features as well as correlation features from the signal band power obtained from continuous Morlet wavelet transform instead of raw EEG, for the classification task. We have applied dimensionality reduction techniques such as Forward Feature Selection and Stacked Autoencoders to reduce the dimension of data eventually increasing the model's efficiency. Our proposed methodology using Forward Selection and Random Forest Classifier gave the best overall accuracy of 80.6\% for intra-subject classification. Our approach shows promise in developing techniques for cognitive tasks using color cues such as controlling Internet of Thing (IoT) devices by looking at primary colors for individuals with restricted motor abilities. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: 10 pages,10 figures, Conference- EVOKE CASCON 2020

Journal ref: Proceedings of 30th Annual International Conference on Computer Science and Software Engineering 2020

arXiv:2007.09181 [pdf, other]

Network Learning Approaches to study World Happiness

Authors: Siddharth Dixit, Meghna Chaudhary, Niteesh Sahni

Abstract: The United Nations in its 2011 resolution declared the pursuit of happiness a fundamental human goal and proposed public and economic policies centered around happiness. In this paper we used 2 types of computational strategies viz. Predictive Modelling and Bayesian Networks (BNs) to model the processed historical happiness index data of 156 nations published by UN since 2012. We attacked the prob… ▽ More The United Nations in its 2011 resolution declared the pursuit of happiness a fundamental human goal and proposed public and economic policies centered around happiness. In this paper we used 2 types of computational strategies viz. Predictive Modelling and Bayesian Networks (BNs) to model the processed historical happiness index data of 156 nations published by UN since 2012. We attacked the problem of prediction using General Regression Neural Networks (GRNNs) and show that it out performs other state of the art predictive models. To understand causal links amongst key features that have been proven to have a significant impact on world happiness, we first used a manual discretization scheme to discretize continuous variables into 3 levels viz. Low, Medium and High. A consensus World Happiness BN structure was then fixed after amalgamating information by learning 10000 different BNs using bootstrapping. Lastly, exact inference through conditional probability queries was used on this BN to unravel interesting relationships among the important features affecting happiness which would be useful in policy making. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: 13 Pages, 8 figures

arXiv:1812.11214 [pdf, ps, other]

Kymatio: Scattering Transforms in Python

Authors: Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim andén, Eugene Belilovsky, Joan Bruna, Vincent Lostanlen, Muawiz Chaudhary, Matthew J. Hirn, Edouard Oyallon, Sixin Zhang, Carmine Cella, Michael Eickenberg

Abstract: The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU… ▽ More The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at https://www.kymat.io/ △ Less

Submitted 31 May, 2022; v1 submitted 28 December, 2018; originally announced December 2018.

arXiv:1805.02679 [pdf]

Multichannel Distributed Local Pattern for Content Based Indexing and Retrieval

Authors: Sonakshi Mathur, Mallika Chaudhary, Hemant Verma, Murari Mandal, S. K. Vipparthi, Subrahmanyam Murala

Abstract: A novel color feature descriptor, Multichannel Distributed Local Pattern (MDLP) is proposed in this manuscript. The MDLP combines the salient features of both local binary and local mesh patterns in the neighborhood. The multi-distance information computed by the MDLP aids in robust extraction of the texture arrangement. Further, MDLP features are extracted for each color channel of an image. The… ▽ More A novel color feature descriptor, Multichannel Distributed Local Pattern (MDLP) is proposed in this manuscript. The MDLP combines the salient features of both local binary and local mesh patterns in the neighborhood. The multi-distance information computed by the MDLP aids in robust extraction of the texture arrangement. Further, MDLP features are extracted for each color channel of an image. The retrieval performance of the MDLP is evaluated on the three benchmark datasets for CBIR, namely Corel-5000, Corel-10000 and MIT-Color Vistex respectively. The proposed technique attains substantial improvement as compared to other state-of- the-art feature descriptors in terms of various evaluation parameters such as ARP and ARR on the respective databases. △ Less

Submitted 7 May, 2018; originally announced May 2018.

Comments: Accepted in INDICON-2017

Showing 1–50 of 51 results for author: Chaudhary, M