REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval

Authors: Rishita Agarwal, Himanshu Singhal, Peter Baile Chen, Manan Roy Choudhury, Dan Roth, Vivek Gupta

Abstract: Answering natural language queries over relational data often requires retrieving and reasoning over multiple tables, yet most retrievers optimize only for query-table relevance and ignore table table compatibility. We introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework that separates semantic relevance from structural joinability for efficient, high-fidelity multi-tabl… ▽ More Answering natural language queries over relational data often requires retrieving and reasoning over multiple tables, yet most retrievers optimize only for query-table relevance and ignore table table compatibility. We introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework that separates semantic relevance from structural joinability for efficient, high-fidelity multi-table retrieval. REAR (i) retrieves query-aligned tables, (ii) expands these with structurally joinable tables via fast, precomputed column-embedding comparisons, and (iii) refines them by pruning noisy or weakly related candidates. Empirically, REAR is retriever-agnostic and consistently improves dense/sparse retrievers on complex table QA datasets (BIRD, MMQA, and Spider) by improving both multi-table retrieval quality and downstream SQL execution. Despite being LLM-free, it delivers performance competitive with state-of-the-art LLM-augmented retrieval systems (e.g.,ARM) while achieving much lower latency and cost. Ablations confirm complementary gains from expansion and refinement, underscoring REAR as a practical, scalable building block for table-based downstream tasks (e.g., Text-to-SQL). △ Less

Submitted 2 November, 2025; originally announced November 2025.

Comments: 13 pages, 2 figures, 8 tables

arXiv:2510.07557 [pdf, ps, other]

Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic

Authors: Abhay Bhandarkar, Gaurav Mishra, Khushi Juchani, Harsh Singhal

Abstract: This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering themat… ▽ More This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction. △ Less

Submitted 8 October, 2025; originally announced October 2025.

arXiv:2312.17300 [pdf, ps, other]

Improving Intrusion Detection with Domain-Invariant Representation Learning in Latent Space

Authors: Padmaksha Roy, Tyler Cody, Himanshu Singhal, Kevin Choi, Ming Jin

Abstract: Zero-day anomaly detection is critical in industrial applications where novel, unforeseen threats can compromise system integrity and safety. Traditional detection systems often fail to identify these unseen anomalies due to their reliance on in-distribution data. Domain generalization addresses this gap by leveraging knowledge from multiple known domains to detect out-of-distribution events. In t… ▽ More Zero-day anomaly detection is critical in industrial applications where novel, unforeseen threats can compromise system integrity and safety. Traditional detection systems often fail to identify these unseen anomalies due to their reliance on in-distribution data. Domain generalization addresses this gap by leveraging knowledge from multiple known domains to detect out-of-distribution events. In this work, we introduce a multi-task representation learning technique that fuses information across related domains into a unified latent space. By jointly optimizing classification, reconstruction, and mutual information regularization losses, our method learns a minimal(bottleneck), domain-invariant representation that discards spurious correlations. This latent space decorrelation enhances generalization, enabling the detection of anomalies in unseen domains. Our experimental results demonstrate significant improvements in zero-day or novel anomaly detection across diverse anomaly detection datasets. △ Less

Submitted 16 October, 2025; v1 submitted 28 December, 2023; originally announced December 2023.

Journal ref: European Conference of Machine Learning 2025

arXiv:2105.06558 [pdf]

Bias, Fairness, and Accountability with AI and ML Algorithms

Authors: Nengfeng Zhou, Zach Zhang, Vijayan N. Nair, Harsh Singhal, Jie Chen, Agus Sudjianto

Abstract: The advent of AI and ML algorithms has led to opportunities as well as challenges. In this paper, we provide an overview of bias and fairness issues that arise with the use of ML algorithms. We describe the types and sources of data bias, and discuss the nature of algorithmic unfairness. This is followed by a review of fairness metrics in the literature, discussion of their limitations, and a desc… ▽ More The advent of AI and ML algorithms has led to opportunities as well as challenges. In this paper, we provide an overview of bias and fairness issues that arise with the use of ML algorithms. We describe the types and sources of data bias, and discuss the nature of algorithmic unfairness. This is followed by a review of fairness metrics in the literature, discussion of their limitations, and a description of de-biasing (or mitigation) techniques in the model life cycle. △ Less

Submitted 13 May, 2021; originally announced May 2021.

Comments: 18 pages, 5 figures

MSC Class: 00-02

Showing 1–4 of 4 results for author: Singhal, H