Skip to main content

Showing 1–50 of 146 results for author: Kankanhalli, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2603.14005  [pdf, ps, other

    cs.CV

    Towards Generalizable Deepfake Detection via Real Distribution Bias Correction

    Authors: Ming-Hui Liu, Harry Cheng, Xin Luo, Xin-Shun Xu, Mohan S. Kankanhalli

    Abstract: To generalize deepfake detectors to future unseen forgeries, most existing methods attempt to simulate the dynamically evolving forgery types using available source domain data. However, predicting an unbounded set of future manipulations from limited prior examples is infeasible. To overcome this limitation, we propose to exploit the invariance of \textbf{real data} from two complementary perspec… ▽ More

    Submitted 14 March, 2026; originally announced March 2026.

    Comments: First Version

  2. arXiv:2603.06220  [pdf, ps, other

    cs.CV

    Word-Anchored Temporal Forgery Localization

    Authors: Tianyi Wang, Xi Shao, Harry Cheng, Yinglong Wang, Mohan Kankanhalli

    Abstract: Current temporal forgery localization (TFL) approaches typically rely on temporal boundary regression or continuous frame-level anomaly detection paradigms to derive candidate forgery proposals. However, they suffer not only from feature granularity misalignment but also from costly computation. To address these issues, we propose word-anchored temporal forgery localization (WAFL), a novel paradig… ▽ More

    Submitted 6 March, 2026; originally announced March 2026.

    Comments: Submitted for review

  3. arXiv:2602.13033  [pdf, ps, other

    cs.CY cs.AI cs.CE cs.CL cs.SI

    Buy versus Build an LLM: A Decision Framework for Governments

    Authors: Jiahao Lu, Ziwei Xu, William Tjhi, Junnan Li, Antoine Bosselut, Pang Wei Koh, Mohan Kankanhalli

    Abstract: Large Language Models (LLMs) represent a new frontier of digital infrastructure that can support a wide range of public-sector applications, from general purpose citizen services to specialized and sensitive state functions. When expanding AI access, governments face a set of strategic choices over whether to buy existing services, build domestic capabilities, or adopt hybrid approaches across dif… ▽ More

    Submitted 23 February, 2026; v1 submitted 13 February, 2026; originally announced February 2026.

    Comments: The short version of this document is published as an ACM TechBrief at https://dl.acm.org/doi/epdf/10.1145/3797946, and this document is published as an ACM Technology Policy Council white paper at https://www.acm.org/binaries/content/assets/public-policy/buildvsbuyai.pdf

    ACM Class: K.4.1; K.1; K.4.2; K.4.3; K.5.2; K.6.1; J.1

  4. arXiv:2602.06623  [pdf, ps, other

    cs.CL cs.CR

    Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

    Authors: Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli

    Abstract: Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity o… ▽ More

    Submitted 6 February, 2026; originally announced February 2026.

  5. arXiv:2601.19231  [pdf, ps, other

    cs.CR

    LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

    Authors: Yangyang Guo, Ziwei Xu, Si Liu, Zhiming Zheng, Mohan Kankanhalli

    Abstract: This study reveals a previously unexplored vulnerability in the safety alignment of Large Language Models (LLMs). Existing aligned LLMs predominantly respond to unsafe queries with refusals, which often begin with a fixed set of prefixes (I'm sorry). We demonstrate that this rigid refusal pattern is a vulnerability and introduce a novel \textbf{refusal unlearning} technique that exploits it. Speci… ▽ More

    Submitted 27 January, 2026; originally announced January 2026.

  6. arXiv:2601.08790  [pdf, ps, other

    cs.CV

    Aggregating Diverse Cue Experts for AI-Generated Image Detection

    Authors: Lei Tan, Shuwei Li, Mohan Kankanhalli, Robby T. Tan

    Abstract: The rapid emergence of image synthesis models poses challenges to the generalization of AI-generated image detectors. However, existing methods often rely on model-specific features, leading to overfitting and poor generalization. In this paper, we introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues in a unified network. MCAN emplo… ▽ More

    Submitted 13 January, 2026; originally announced January 2026.

    Comments: Accepted by AAAI 2026

  7. arXiv:2512.22933  [pdf, ps, other

    cs.AI cs.CL

    Multimodal Fact-Checking: An Agent-based Approach

    Authors: Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli

    Abstract: The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation i… ▽ More

    Submitted 4 January, 2026; v1 submitted 28 December, 2025; originally announced December 2025.

    Comments: Code and dataset will be released at https://github.com/xudanni0927/AgentFact

  8. arXiv:2512.18448  [pdf, ps, other

    cs.CV

    Object-Centric Framework for Video Moment Retrieval

    Authors: Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli

    Abstract: Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In pa… ▽ More

    Submitted 20 December, 2025; originally announced December 2025.

    Comments: AAAI2026

  9. arXiv:2511.18370  [pdf, ps, other

    cs.CV cs.GR

    MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

    Authors: Zenghao Chai, Chen Tang, Yongkang Wong, Xulei Yang, Mohan Kankanhalli

    Abstract: 3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and tra… ▽ More

    Submitted 25 March, 2026; v1 submitted 23 November, 2025; originally announced November 2025.

    Comments: Accepted to CVPR 2026. Project page: https://mimicat3d.github.io/

  10. arXiv:2510.10060  [pdf, ps, other

    cs.LG cs.AI cs.CL cs.CV

    Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

    Authors: Hehe Fan, Yi Yang, Mohan Kankanhalli, Fei Wu

    Abstract: When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for… ▽ More

    Submitted 11 October, 2025; originally announced October 2025.

    Comments: technical report

  11. arXiv:2509.08008  [pdf, ps, other

    cs.SI cs.AI cs.MM

    A New Dataset and Benchmark for Grounding Multimodal Misinformation

    Authors: Bingjian Yang, Danni Xu, Kaipeng Niu, Wenxuan Liu, Zheng Wang, Mohan Kankanhalli

    Abstract: The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies mul… ▽ More

    Submitted 8 September, 2025; originally announced September 2025.

    Comments: 6 pages, 5 figures, ACM Multimedia 2025 Dataset Track

  12. Nearest Neighbor Projection Removal Adversarial Training

    Authors: Himanshu Singh, A. V. Subramanyam, Shivank Rajput, Mohan Kankanhalli

    Abstract: Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitiga… ▽ More

    Submitted 8 April, 2026; v1 submitted 9 September, 2025; originally announced September 2025.

    MSC Class: 68T45 (Primary); 68T10 (Secondary) ACM Class: I.5.4

  13. arXiv:2508.13246  [pdf, ps, other

    cs.CR cs.AI

    Involuntary Jailbreak: On Self-Prompting Attacks

    Authors: Yangyang Guo, Yangyan Li, Mohan Kankanhalli

    Abstract: In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail.… ▽ More

    Submitted 27 December, 2025; v1 submitted 18 August, 2025; originally announced August 2025.

  14. arXiv:2508.10769  [pdf, ps, other

    cs.AI cs.MM

    Modeling Human Responses to Multimodal AI Content

    Authors: Zhiqi Shen, Shaojing Fan, Danni Xu, Terence Sim, Mohan Kankanhalli

    Abstract: As AI-generated content becomes widespread, so does the risk of misinformation. While prior research has primarily focused on identifying whether content is authentic, much less is known about how such content influences human perception and behavior. In domains like trading or the stock market, predicting how people react (e.g., whether a news post will go viral), can be more critical than verify… ▽ More

    Submitted 14 August, 2025; originally announced August 2025.

  15. arXiv:2507.02645  [pdf, ps, other

    cs.LG cs.CV

    Fair Deepfake Detectors Can Generalize

    Authors: Harry Cheng, Ming-Hui Liu, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli

    Abstract: Deepfake detection models face two critical challenges: generalization to unseen manipulations and demographic fairness among population groups. However, existing approaches often demonstrate that these two objectives are inherently conflicting, revealing a trade-off between them. In this paper, we, for the first time, uncover and formally define a causal relationship between fairness and generali… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 14 pages, version 1

  16. arXiv:2506.20702  [pdf

    cs.AI cs.CY

    The Singapore Consensus on Global AI Safety Research Priorities

    Authors: Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Juntao Dai , et al. (63 additional authors not shown)

    Abstract: Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on… ▽ More

    Submitted 30 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

    Comments: Final report from the "2025 Singapore Conference on AI (SCAI)" held April 26: https://www.scai.gov.sg/2025/scai2025-report

  17. arXiv:2506.17279  [pdf, ps, other

    cs.CR cs.AI

    Step-by-Step Reasoning Attack: Revealing 'Erased' Knowledge in Large Language Models

    Authors: Yash Sinha, Manit Baser, Murari Mandal, Dinil Mon Divakaran, Mohan Kankanhalli

    Abstract: Knowledge erasure in large language models (LLMs) is important for ensuring compliance with data and AI regulations, safeguarding user privacy, mitigating bias, and misinformation. Existing unlearning methods aim to make the process of knowledge erasure more efficient and effective by removing specific knowledge while preserving overall model performance, especially for retained information. Howev… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  18. arXiv:2505.23788  [pdf, ps, other

    cs.CL cs.AI

    Nine Ways to Break Copyright Law and Why Our LLM Won't: A Fair Use Aligned Generation Framework

    Authors: Aakash Sen Sharma, Debdeep Sanyal, Priyansh Srivastava, Sundar Atreya H., Shirish Karande, Mohan Kankanhalli, Murari Mandal

    Abstract: Large language models (LLMs) commonly risk copyright infringement by reproducing protected content verbatim or with insufficient transformative modifications, posing significant ethical, legal, and practical concerns. Current inference-time safeguards predominantly rely on restrictive refusal-based filters, often compromising the practical utility of these models. To address this, we collaborated… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 30 Pages

  19. arXiv:2505.20296  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.MM

    Reasoning LLMs are Wandering Solution Explorers

    Authors: Jiahao Lu, Ziwei Xu, Mohan Kankanhalli

    Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure m… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: 71 pages, 14 figures, 2 tables

  20. arXiv:2505.19165  [pdf, ps, other

    cs.AI

    OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs

    Authors: Debdeep Sanyal, Umakanta Maharana, Yash Sinha, Hong Ming Tan, Shirish Karande, Mohan Kankanhalli, Murari Mandal

    Abstract: Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet under explored, challenge emerges: \textit{can th… ▽ More

    Submitted 17 June, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

    Comments: 56 Pages

  21. arXiv:2505.12692  [pdf, ps, other

    cs.AI cs.CL

    Bullying the Machine: How Personas Increase LLM Vulnerability

    Authors: Ziwei Xu, Udit Sanghi, Mohan Kankanhalli

    Abstract: Large Language Models (LLMs) are increasingly deployed in interactions where they are prompted to adopt personas. This paper investigates whether such persona conditioning affects model safety under bullying, an adversarial manipulation that applies psychological pressures in order to force the victim to comply to the attacker. We introduce a simulation framework in which an attacker LLM engages a… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  22. arXiv:2505.02828  [pdf, ps, other

    cs.AI cs.CR

    Privacy Risks and Preservation Methods in Explainable Artificial Intelligence: A Scoping Review

    Authors: Sonal Allana, Mohan Kankanhalli, Rozita Dara

    Abstract: Explainable Artificial Intelligence (XAI) has emerged as a pillar of Trustworthy AI and aims to bring transparency in complex models that are opaque by nature. Despite the benefits of incorporating explanations in models, an urgent need is found in addressing the privacy concerns of providing this additional information to end users. In this article, we conduct a scoping review of existing literat… ▽ More

    Submitted 2 December, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: Published in Transactions on Machine Learning Research: https://openreview.net/forum?id=q9nykJfzku

    Journal ref: Transactions on Machine Learning Research, 10/2025, ISSN=2835-8856

  23. FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks

    Authors: Tianyi Wang, Harry Cheng, Ming-Hui Liu, Mohan Kankanhalli

    Abstract: Proactive Deepfake detection via robust watermarks has seen interest ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the de… ▽ More

    Submitted 3 November, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

    Comments: ACM Multimedia 2025 Oral

  24. arXiv:2503.06608  [pdf, ps, other

    cs.CV cs.LG cs.MM

    GroMo: Plant Growth Modeling with Multiview Images

    Authors: Ruchi Bhatt, Shreya Bansal, Amanpreet Chander, Rupinder Kaur, Malya Singh, Mohan Kankanhalli, Abdulmotaleb El Saddik, Mukesh Kumar Saini

    Abstract: Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish,… ▽ More

    Submitted 6 June, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

    Comments: 7 pages, 5 Figures, 3 Tables

  25. arXiv:2412.15614  [pdf, other

    cs.CR cs.CV

    Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM

    Authors: Yangyang Guo, Ziwei Xu, Xilie Xu, YongKang Wong, Liqiang Nie, Mohan Kankanhalli

    Abstract: This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

    Comments: ICML TiFA Challenge Technical Report

  26. arXiv:2411.16771  [pdf, other

    cs.CV

    VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

    Authors: Wey Yeh Choong, Yangyang Guo, Mohan Kankanhalli

    Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of vid… ▽ More

    Submitted 7 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: 9 pages, 10 figures. Code available at https://github.com/Lookuz/VidHal

  27. arXiv:2411.13281  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

    Authors: Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li

    Abstract: Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitiv… ▽ More

    Submitted 23 March, 2025; v1 submitted 20 November, 2024; originally announced November 2024.

    Comments: CVPR 2025, Project Page: https://videoautoarena.github.io/

  28. arXiv:2411.12785  [pdf, other

    cs.CV

    Joint Vision-Language Social Bias Removal for CLIP

    Authors: Haoyu Zhang, Yangyang Guo, Mohan Kankanhalli

    Abstract: Vision-Language (V-L) pre-trained models such as CLIP show prominent capabilities in various downstream tasks. Despite this promise, V-L models are notoriously limited by their inherent social biases. A typical demonstration is that V-L models often produce biased predictions against specific groups of people, significantly undermining their real-world applicability. Existing approaches endeavor t… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

  29. arXiv:2411.09126  [pdf, other

    cs.CV

    SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency

    Authors: Yangyang Guo, Mohan Kankanhalli

    Abstract: While contrastive pre-training is widely employed, its data efficiency problem has remained relatively under-explored thus far. Existing methods often rely on static coreset selection algorithms to pre-identify important data for training. However, this static nature renders them unable to dynamically track the data usefulness throughout pre-training, leading to subpar pre-trained models. To addre… ▽ More

    Submitted 13 November, 2024; originally announced November 2024.

  30. arXiv:2411.08410  [pdf, other

    cs.CR cs.CV

    The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

    Authors: Yangyang Guo, Fangkai Jiao, Liqiang Nie, Mohan Kankanhalli

    Abstract: The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise. However, recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations, often with minimal effort. This \emph{dual high performance} in both attack and defense raises a fundamental and perplexing paradox. To gain a deep understanding of this iss… ▽ More

    Submitted 5 March, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

    Comments: Logic smoothing and language polishing

  31. arXiv:2410.17050  [pdf, other

    cs.LG cs.AI cs.CL

    UnStar: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

    Authors: Yash Sinha, Murari Mandal, Mohan Kankanhalli

    Abstract: The key components of machine learning are data samples for training, model for learning patterns, and loss function for optimizing accuracy. Analogously, unlearning can potentially be achieved through anti-data samples (or anti-samples), unlearning method, and reversed loss function. While prior research has explored unlearning methods and reversed loss functions, the potential of anti-samples re… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

  32. arXiv:2410.02451  [pdf, other

    cs.AI

    Strong Preferences Affect the Robustness of Preference Models and Value Alignment

    Authors: Ziwei Xu, Mohan Kankanhalli

    Abstract: Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitiv… ▽ More

    Submitted 7 March, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

    Comments: 21 Pages. Accepted by ICLR 2025

  33. arXiv:2406.04629  [pdf, other

    cs.CV cs.GR cs.MM

    STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

    Authors: Zenghao Chai, Chen Tang, Yongkang Wong, Mohan Kankanhalli

    Abstract: The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Di… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Tech report

  34. arXiv:2405.16934  [pdf, other

    cs.CV

    Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

    Authors: Zhenyang Li, Yangyang Guo, Kejie Wang, Xiaolin Chen, Liqiang Nie, Mohan Kankanhalli

    Abstract: Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  35. Multi-Modal Recommendation Unlearning for Legal, Licensing, and Modality Constraints

    Authors: Yash Sinha, Murari Mandal, Mohan Kankanhalli

    Abstract: User data spread across multiple modalities has popularized multi-modal recommender systems (MMRS). They recommend diverse content such as products, social media posts, TikTok reels, etc., based on a user-item interaction graph. With rising data privacy demands, recent methods propose unlearning private user data from uni-modal recommender systems (RS). However, methods for unlearning item data re… ▽ More

    Submitted 29 June, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: Extended Version, Accepted at AAAI 2025. 17 pages, 4 figures and 9 tables

    ACM Class: H.3.3; H.5.1; I.2.6; K.4.1

    Journal ref: In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 12, pp. 12541-12549. 2025

  36. arXiv:2405.13911  [pdf, other

    cs.CV cs.AI cs.CL

    TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

    Authors: Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang

    Abstract: Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduc… ▽ More

    Submitted 3 November, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

    Comments: NeurIPS 2024 (Spotlight)

  37. arXiv:2405.12538  [pdf, other

    cs.CV

    Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

    Authors: Yi Cheng, Ziwei Xu, Dongyun Lin, Harry Cheng, Yongkang Wong, Ying Sun, Joo Hwee Lim, Mohan Kankanhalli

    Abstract: For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leadi… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  38. arXiv:2404.14106  [pdf

    cs.CR

    DPTraj-PM: Differentially Private Trajectory Synthesis Using Prefix Tree and Markov Process

    Authors: Nana Wang, Mohan Kankanhalli

    Abstract: The increasing use of GPS-enabled devices has generated a large amount of trajectory data. These data offer us vital insights to understand the movements of individuals and populations, benefiting a broad range of applications from transportation planning to epidemic modeling. However, improper release of trajectory data is increasing concerns on individual privacy. Previous attempts either lack s… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  39. MCM: Multi-condition Motion Synthesis Framework

    Authors: Zeyu Ling, Bo Han, Yongkang Wongkan, Han Lin, Mohan Kankanhalli, Weidong Geng

    Abstract: Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framewor… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Report number: https://doi.org/10.24963/ijcai.2024/120

    Journal ref: International Joint Conference on Artificial Intelligence 2024

  40. arXiv:2404.10321  [pdf, other

    cs.IR

    Cluster-based Graph Collaborative Filtering

    Authors: Fan Liu, Shuai Zhao, Zhiyong Cheng, Liqiang Nie, Mohan Kankanhalli

    Abstract: Graph Convolution Networks (GCNs) have significantly succeeded in learning user and item representations for recommendation systems. The core of their efficacy is the ability to explicitly exploit the collaborative signals from both the first- and high-order neighboring nodes. However, most existing GCN-based methods overlook the multiple interests of users while performing high-order graph convol… ▽ More

    Submitted 8 November, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: Accepted by ACM TOIS

    ACM Class: H.3.3

  41. arXiv:2404.08111  [pdf, other

    cs.CV cs.AI cs.CL

    S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing

    Authors: Guangzhi Wang, Tianyi Chen, Kamran Ghasedi, HsiangTao Wu, Tianyu Ding, Chris Nuesmeyer, Ilya Zharkov, Mohan Kankanhalli, Luming Liang

    Abstract: Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introdu… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

  42. arXiv:2403.06520  [pdf, other

    cs.CL cs.AI

    How to Understand Named Entities: Using Common Sense for News Captioning

    Authors: Ning Xu, Yanhui Wang, Tingting Zhang, Hongshuo Tian, Mohan Kankanhalli, An-An Liu

    Abstract: News captioning aims to describe an image with its news article body as input. It greatly relies on a set of detected named entities, including real-world people, organizations, and places. This paper exploits commonsense knowledge to understand named entities for news captioning. By ``understand'', we mean correlating the news content with common sense in the wild, which helps an agent to 1) dist… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  43. arXiv:2402.09288  [pdf, other

    cs.LG

    EcoVal: An Efficient Data Valuation Framework for Machine Learning

    Authors: Ayush K Tarun, Vikram S Chundawat, Murari Mandal, Hong Ming Tan, Bowei Chen, Mohan Kankanhalli

    Abstract: Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an… ▽ More

    Submitted 9 July, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: KDD-2024

  44. arXiv:2401.15859  [pdf, other

    cs.CV cs.AI

    Diffusion Facial Forgery Detection

    Authors: Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli

    Abstract: Detecting diffusion-generated images has recently grown into an emerging research area. Existing diffusion-based datasets predominantly focus on general image generation. However, facial forgeries, which pose a more severe social risk, have remained less explored thus far. To address this gap, this paper introduces DiFF, a comprehensive dataset dedicated to face-focused diffusion-generated images.… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: The dataset will be released at \url{https://github.com/xaCheng1996/DiFF}

  45. arXiv:2401.11817  [pdf, other

    cs.CL cs.AI cs.LG

    Hallucination is Inevitable: An Innate Limitation of Large Language Models

    Authors: Ziwei Xu, Sanjay Jain, Mohan Kankanhalli

    Abstract: Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that it is impossible to eliminat… ▽ More

    Submitted 13 February, 2025; v1 submitted 22 January, 2024; originally announced January 2024.

  46. arXiv:2312.16275  [pdf, other

    cs.IR cs.MM

    Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models

    Authors: Fan Liu, Yaqi Liu, Huilin Chen, Zhiyong Cheng, Liqiang Nie, Mohan Kankanhalli

    Abstract: Recommendation systems harness user-item interactions like clicks and reviews to learn their representations. Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents. However, the aspects and intents are inferred directly from user reviews or behavior patterns, suffering from the data noise and the data sparsity problem.… ▽ More

    Submitted 16 November, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

    Comments: Accepted by ACM TOIS

    ACM Class: H.3.3

  47. Attribute-driven Disentangled Representation Learning for Multimodal Recommendation

    Authors: Zhenyang Li, Fan Liu, Yinwei Wei, Zhiyong Cheng, Liqiang Nie, Mohan Kankanhalli

    Abstract: Recommendation algorithms forecast user preferences by correlating user and item representations derived from historical interaction patterns. In pursuit of enhanced performance, many methods focus on learning robust and independent representations by disentangling the intricate factors within interaction data across various modalities in an unsupervised manner. However, such an approach obfuscate… ▽ More

    Submitted 31 July, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: ACM Multimedia 2024 Accepted

    Journal ref: In Proceedings of the 32st ACM International Conference on Multimedia (MM '24), 2024

  48. arXiv:2311.16475  [pdf, other

    cs.CV

    Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

    Authors: Yu-Wei Zhan, Fan Liu, Xin Luo, Xin-Shun Xu, Liqiang Nie, Mohan Kankanhalli

    Abstract: Human-Object Interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, conventional HOI detection methods often struggle to fully capture the contextual information needed to accurately identify these interactions. While large Vision-Language Models (VLMs) show promise in tasks involving human interactions, they are not tailored for HOI detection.… ▽ More

    Submitted 8 October, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

  49. arXiv:2311.07604  [pdf, other

    cs.LG cs.AI cs.CV cs.CY

    Finetuning Text-to-Image Diffusion Models for Fairness

    Authors: Xudong Shen, Chao Du, Tianyu Pang, Min Lin, Yongkang Wong, Mohan Kankanhalli

    Abstract: The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a skewed worldview and restrict opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. Our solution consists of two main technical contributions: (1) a distributional alignment loss… ▽ More

    Submitted 15 March, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

    Comments: ICLR 2024 oral presentation

  50. arXiv:2311.04811  [pdf, other

    cs.CV

    Image-Based Virtual Try-On: A Survey

    Authors: Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, An-An Liu

    Abstract: Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to acc… ▽ More

    Submitted 2 September, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 30 pages, 20 figures