-
ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation
Authors:
Ziyuan Luo,
Yangyi Zhao,
Ka Chun Cheung,
Simon See,
Renjie Wan
Abstract:
The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking a…
▽ More
The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking approaches face limitations in RAIG systems, as the complex feature extraction and recombination processes fail to preserve watermark signals during generation. To address these challenges, we propose ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our framework synthesizes sentinel images that maintain visual consistency with the original dataset. These sentinels enable protection verification through randomly generated character sequences that serve as retrieval keys. To ensure seamless integration, we leverage vision-language models to generate the sentinel images. Experimental results demonstrate that ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications. Code is available at https://github.com/luo-ziyuan/ImageSentinel.
△ Less
Submitted 13 October, 2025;
originally announced October 2025.
-
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
Authors:
Tianshi Zheng,
Kelvin Kiu-Wai Tam,
Newt Hue-Nam K. Nguyen,
Baixuan Xu,
Zhaowei Wang,
Jiayang Cheng,
Hong Ting Tsang,
Weiqi Wang,
Jiaxin Bai,
Tianqing Fang,
Yangqiu Song,
Ginny Y. Wong,
Simon See
Abstract:
Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to c…
▽ More
Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.
△ Less
Submitted 9 December, 2025; v1 submitted 8 October, 2025;
originally announced October 2025.
-
LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction
Authors:
Sheng-Hsiang Hung,
Ting-Yu Yen,
Wei-Fang Sun,
Simon See,
Shih-Hsuan Hung,
Hung-Kuo Chu
Abstract:
3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load i…
▽ More
3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians -- a strong proxy for computational load -- across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to $2\times$ faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.
△ Less
Submitted 2 October, 2025;
originally announced October 2025.
-
Etching-free dual-lift-off for direct patterning of epitaxial oxide thin films
Authors:
Jiayi Qin,
Josephine Si Yu See,
Yanran Liu,
Xueyan Wang,
Wenhai Zhao,
Yang He,
Jianbo Ding,
Yilin Wu,
Shanhu Wang,
Huiping Han,
Afzal Khan,
Shuya Liu,
Sheng'an Yang,
Hui Zhang,
Jiangnan Li,
Qingming Chen,
Jiyang Xie,
Ji Ma,
Wanbiao Hu,
Jianhong Yi,
Liang Wu,
X. Renshaw Wang
Abstract:
Although monocrystalline oxide films offer broad functional capabilities, their practical use is hampered by challenges in patterning. Traditional patterning relies on etching, which can be costly and prone to issues like film or substrate damage, under-etching, over-etching, and lateral etching. In this study, we introduce a dual-lift-off method for direct patterning of oxide films, circumventing…
▽ More
Although monocrystalline oxide films offer broad functional capabilities, their practical use is hampered by challenges in patterning. Traditional patterning relies on etching, which can be costly and prone to issues like film or substrate damage, under-etching, over-etching, and lateral etching. In this study, we introduce a dual-lift-off method for direct patterning of oxide films, circumventing the etching process and associated issues. Our method involves an initial lift-off of amorphous Sr$_3$Al$_2$O$_6$ or Sr$_4$Al$_2$O$_7$ ($a$SAO) through stripping the photoresist, followed by a subsequent lift-off of the functional oxide thin films by dissolving the $a$SAO layer. $a$SAO functions as a ``high-temperature photoresist", making it compatible with the high-temperature growth of monocrystalline oxides. Using this method, patterned ferromagnetic La$_{0.67}$Sr$_{0.33}$MnO$_{3}$ and ferroelectric BiFeO$_3$ were fabricated, accurately mirroring the shape of the photoresist. Our study presents a straightforward, flexible, precise, environmentally friendly, and cost-effective method for patterning high-quality oxide thin films.
△ Less
Submitted 31 August, 2025;
originally announced September 2025.
-
Biological Pathway Informed Models with Graph Attention Networks (GATs)
Authors:
Gavin Wong,
Ping Shu Ho,
Ivan Au Yeung,
Ka Chun Cheung,
Simon See
Abstract:
Biological pathways map gene-gene interactions that govern all human processes. Despite their importance, most ML models treat genes as unstructured tokens, discarding known pathway structure. The latest pathway-informed models capture pathway-pathway interactions, but still treat each pathway as a "bag of genes" via MLPs, discarding its topology and gene-gene interactions. We propose a Graph Atte…
▽ More
Biological pathways map gene-gene interactions that govern all human processes. Despite their importance, most ML models treat genes as unstructured tokens, discarding known pathway structure. The latest pathway-informed models capture pathway-pathway interactions, but still treat each pathway as a "bag of genes" via MLPs, discarding its topology and gene-gene interactions. We propose a Graph Attention Network (GAT) framework that models pathways at the gene level. We show that GATs generalize much better than MLPs, achieving an 81% reduction in MSE when predicting pathway dynamics under unseen treatment conditions. We further validate the correctness of our biological prior by encoding drug mechanisms via edge interventions, boosting model robustness. Finally, we show that our GAT model is able to correctly rediscover all five gene-gene interactions in the canonical TP53-MDM2-MDM4 feedback loop from raw time-series mRNA data, demonstrating potential to generate novel biological hypotheses directly from experimental data.
△ Less
Submitted 30 August, 2025;
originally announced September 2025.
-
SegReConcat: A Data Augmentation Method for Voice Anonymization Attack
Authors:
Ridwan Arefeen,
Xiaoxiao Miao,
Rong Tong,
Aik Beng Ng,
Simon See
Abstract:
Anonymization of voice seeks to conceal the identity of the speaker while maintaining the utility of speech data. However, residual speaker cues often persist, which pose privacy risks. We propose SegReConcat, a data augmentation method for attacker-side enhancement of automatic speaker verification systems. SegReConcat segments anonymized speech at the word level, rearranges segments using random…
▽ More
Anonymization of voice seeks to conceal the identity of the speaker while maintaining the utility of speech data. However, residual speaker cues often persist, which pose privacy risks. We propose SegReConcat, a data augmentation method for attacker-side enhancement of automatic speaker verification systems. SegReConcat segments anonymized speech at the word level, rearranges segments using random or similarity-based strategies to disrupt long-term contextual cues, and concatenates them with the original utterance, allowing an attacker to learn source speaker traits from multiple perspectives. The proposed method has been evaluated in the VoicePrivacy Attacker Challenge 2024 framework across seven anonymization systems, SegReConcat improves de-anonymization on five out of seven systems.
△ Less
Submitted 26 August, 2025;
originally announced August 2025.
-
Polarization-Aware DoA Detection Relying on a Single Rydberg Atomic Receiver
Authors:
Yuanbin Chen,
Chau Yuen,
Darmindra Arumugam,
Chong Meng Samson See,
Mérouane Debbah,
Lajos Hanzo
Abstract:
A polarization-aware direction-of-arrival (DoA) detection scheme is conceived that leverages the intrinsic vector sensitivity of a single Rydberg atomic vapor cell to achieve quantum-enhanced angle resolution. Our core idea lies in the fact that the vector nature of an electromagnetic wave is uniquely determined by its orthogonal electric and magnetic field components, both of which can be retriev…
▽ More
A polarization-aware direction-of-arrival (DoA) detection scheme is conceived that leverages the intrinsic vector sensitivity of a single Rydberg atomic vapor cell to achieve quantum-enhanced angle resolution. Our core idea lies in the fact that the vector nature of an electromagnetic wave is uniquely determined by its orthogonal electric and magnetic field components, both of which can be retrieved by a single Rydberg atomic receiver via electromagnetically induced transparency (EIT)-based spectroscopy. To be specific, in the presence of a static magnetic bias field that defines a stable quantization axis, a pair of sequential EIT measurements is carried out in the same vapor cell. Firstly, the electric-field polarization angle is extracted from the Zeeman-resolved EIT spectrum associated with an electric-dipole transition driven by the radio frequency (RF) field. Within the same experimental cycle, the RF field is then retuned to a magnetic-dipole resonance, producing Zeeman-resolved EIT peaks for decoding the RF magnetic-field orientation. This scheme exhibits a dual yet independent sensitivity on both angles, allowing for precise DoA reconstruction without the need for spatial diversity or phase referencing. Building on this foundation, we derive the quantum Fisher-information matrix (QFIM) and obtain a closed-form quantum Cramér-Rao bound (QCRB) for the joint estimation of polarization and orientation angles. Finally, simulation results spanning various quantum parameters validate the proposed approach and identify optimal operating regimes. With appropriately chosen polarization and magnetic-field geometries, a single vapor cell is expected to achieve sub-0.1$^\circ$ angle resolution at moderate RF-field driving strengths.
△ Less
Submitted 23 August, 2025;
originally announced August 2025.
-
Align 3D Representation and Text Embedding for 3D Content Personalization
Authors:
Qi Song,
Ziyuan Luo,
Ka Chun Cheung,
Simon See,
Renjie Wan
Abstract:
Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose…
▽ More
Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose \textbf{Invert3D}, a novel framework for convenient 3D content personalization. Nowadays, vision-language models such as CLIP enable direct image personalization through aligned vision-text embedding spaces. However, the inherent structural differences between 3D content and 2D images preclude direct application of these techniques to 3D personalization. Our approach bridges this gap by establishing alignment between 3D representations and text embedding spaces. Specifically, we develop a camera-conditioned 3D-to-text inverse mechanism that projects 3D contents into a 3D embedding aligned with text embeddings. This alignment enables efficient manipulation and personalization of 3D content through natural language prompts, eliminating the need for computationally retraining procedures. Extensive experiments demonstrate that Invert3D achieves effective personalization of 3D content. Our work is available at: https://github.com/qsong2001/Invert3D.
△ Less
Submitted 23 August, 2025;
originally announced August 2025.
-
Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction
Authors:
Xiufeng Huang,
Ka Chun Cheung,
Runmin Cong,
Simon See,
Renjie Wan
Abstract:
Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address thi…
▽ More
Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.
△ Less
Submitted 20 July, 2025;
originally announced July 2025.
-
SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition
Authors:
Quan Bi Pay,
Vishnu Monn Baskaran,
Junn Yong Loo,
KokSheik Wong,
Simon See
Abstract:
The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Fu…
▽ More
The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [https://github.com/henry-pay/SpaRTAN].
△ Less
Submitted 15 July, 2025;
originally announced July 2025.
-
Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection
Authors:
Quan Bi Pay,
Vishnu Monn Baskaran,
Junn Yong Loo,
KokSheik Wong,
Simon See
Abstract:
Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. T…
▽ More
Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].
△ Less
Submitted 15 July, 2025;
originally announced July 2025.
-
XToM: Exploring the Multilingual Theory of Mind for Large Language Models
Authors:
Chunkit Chan,
Yauwai Yim,
Hongchuan Zeng,
Zhiying Zou,
Xinyuan Cheng,
Zhifan Sun,
Zheye Deng,
Kawai Chung,
Yuzhuo Ao,
Yixiang Fan,
Cheng Jiayang,
Ercong Nie,
Ginny Y. Wong,
Helmut Schmid,
Hinrich Schütze,
Simon See,
Yangqiu Song
Abstract:
Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse lin…
▽ More
Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Rydberg Atomic Quantum MIMO Receivers for The Multi-User Uplink
Authors:
Tierui Gong,
Chau Yuen,
Chong Meng Samson See,
Mérouane Debbah,
Lajos Hanzo
Abstract:
Rydberg atomic quantum receivers (RAQRs) have emerged as a promising solution for evolving wireless receivers from the classical to the quantum domain. To further unleash their great potential in wireless communications, we propose a flexible architecture for Rydberg atomic quantum multiple-input multiple-output (RAQ-MIMO) receivers in the multi-user uplink. Then the corresponding signal model of…
▽ More
Rydberg atomic quantum receivers (RAQRs) have emerged as a promising solution for evolving wireless receivers from the classical to the quantum domain. To further unleash their great potential in wireless communications, we propose a flexible architecture for Rydberg atomic quantum multiple-input multiple-output (RAQ-MIMO) receivers in the multi-user uplink. Then the corresponding signal model of the RAQ-MIMO system is constructed by paving the way from quantum physics to classical wireless communications. Explicitly, we outline the associated operating principles and transmission flow. We also validate the linearity of our model and its feasible region. Based on our model, we derive closed-form asymptotic formulas for the ergodic achievable rate (EAR) of both the maximum-ratio combining (MRC) and zero-forcing (ZF) receivers operating in uncorrelated fading channels (UFC) and the correlated fading channels (CFC), respectively. Furthermore, we theoretically characterize the EAR difference both between the UFC and CFC scenarios, as well as MRC and ZF schemes. More particularly, we quantify the superiority of RAQ-MIMO receivers over the classical massive MIMO (M-MIMO) receivers, specifying an increase of $\log_{2} Π$ of the EAR per user, $Π$-fold reduction of the users' transmit power, and $\sqrt[ν]Π$-fold increase of the transmission distance, respectively, where $Π= \text{ReceiverGainRatio} / \text{ReceiverNoisePowerRatio}$ of the single-sensor receivers and $ν$ is the path-loss exponent. Our simulation results reveal that, compared to classical M-MIMO receivers, our RAQ-MIMO scheme can either realize $\sim 12$ bits/s/Hz/user ($\sim 8$ bits/s/Hz/user) higher EAR, or $\sim 10000$-fold ($\sim 500$-fold) lower transmit power, or alternatively, $\sim 100$-fold ($\sim 21$-fold) longer distance in free-space transmissions, in the standard quantum limit (photon shot limit).
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Authors:
Zhaowei Wang,
Wenhao Yu,
Xiyu Ren,
Jipeng Zhang,
Yu Zhao,
Rohit Saxena,
Liang Cheng,
Ginny Wong,
Simon See,
Pasquale Minervini,
Yangqiu Song,
Mark Steedman
Abstract:
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thor…
▽ More
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.
△ Less
Submitted 6 October, 2025; v1 submitted 15 May, 2025;
originally announced May 2025.
-
Continual Pre-Training is (not) What You Need in Domain Adaption
Authors:
Pin-Er Chen,
Da-Chen Lian,
Shu-Kai Hsieh,
Sieh-Chuen Huang,
Hsuan-Lei Shao,
Jun-Wei Chiu,
Yang-Hsien Lin,
Zih-Ching Chen,
Cheng-Kuang,
Eddie TC Huang,
Simon See
Abstract:
The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, a…
▽ More
The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, and the potential for hallucinations. This paper examines the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the legal reasoning capabilities of LLMs. Through a series of experiments on legal reasoning tasks within the Taiwanese legal framework, we demonstrate that while DACP enhances domain-specific knowledge, it does not uniformly improve performance across all legal tasks. We discuss the trade-offs involved in DACP, particularly its impact on model generalization and performance in prompt-based tasks, and propose directions for future research to optimize domain adaptation strategies in legal AI.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
Authors:
Tianshi Zheng,
Yixiang Chen,
Chengxi Li,
Chunyang Li,
Qing Zong,
Haochen Shi,
Baixuan Xu,
Yangqiu Song,
Ginny Y. Wong,
Simon See
Abstract:
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based IC…
▽ More
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
△ Less
Submitted 1 November, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
Transforming Future Data Center Operations and Management via Physical AI
Authors:
Zhiwei Cao,
Minghao Li,
Feng Lin,
Jimin Jia,
Yonggang Wen,
Jianxiong Yin,
Simon See
Abstract:
Data centers (DCs) as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. As a result, new paradigms, beyond the traditional approaches based on bes…
▽ More
Data centers (DCs) as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. As a result, new paradigms, beyond the traditional approaches based on best practices, must be in order for future data centers. In this research, we propose and develop a novel Physical AI (PhyAI) framework for advancing DC operations and management. Our system leverages the emerging capabilities of state-of-the-art industrial products and our in-house research and development. Specifically, it presents three core modules, namely: 1) an industry-grade in-house simulation engine to simulate DC operations in a highly accurate manner, 2) an AI engine built upon NVIDIA PhysicsNemo for the training and evaluation of physics-informed machine learning (PIML) models, and 3) a digital twin platform built upon NVIDIA Omniverse for our proposed 5-tier digital twin framework. This system presents a scalable and adaptable solution to digitalize, optimize, and automate future data center operations and management, by enabling real-time digital twins for future data centers. To illustrate its effectiveness, we present a compelling case study on building a surrogate model for predicting the thermal and airflow profiles of a large-scale DC in a real-time manner. Our results demonstrate its superior performance over traditional time-consuming Computational Fluid Dynamics/Heat Transfer (CFD/HT) simulation, with a median absolute temperature prediction error of 0.18 °C. This emerging approach would open doors to several potential research directions for advancing Physical AI in future DC operations.
△ Less
Submitted 15 April, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving
Authors:
Xuesong Chen,
Shaoshuai Shi,
Tao Ma,
Jingqiu Zhou,
Simon See,
Ka Chun Cheung,
Hongsheng Li
Abstract:
The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this…
▽ More
The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
Unified Locomotion Transformer with Simultaneous Sim-to-Real Transfer for Quadrupeds
Authors:
Dikai Liu,
Tianwei Zhang,
Jianxiong Yin,
Simon See
Abstract:
Quadrupeds have gained rapid advancement in their capability of traversing across complex terrains. The adoption of deep Reinforcement Learning (RL), transformers and various knowledge transfer techniques can greatly reduce the sim-to-real gap. However, the classical teacher-student framework commonly used in existing locomotion policies requires a pre-trained teacher and leverages the privilege i…
▽ More
Quadrupeds have gained rapid advancement in their capability of traversing across complex terrains. The adoption of deep Reinforcement Learning (RL), transformers and various knowledge transfer techniques can greatly reduce the sim-to-real gap. However, the classical teacher-student framework commonly used in existing locomotion policies requires a pre-trained teacher and leverages the privilege information to guide the student policy. With the implementation of large-scale models in robotics controllers, especially transformers-based ones, this knowledge distillation technique starts to show its weakness in efficiency, due to the requirement of multiple supervised stages. In this paper, we propose Unified Locomotion Transformer (ULT), a new transformer-based framework to unify the processes of knowledge transfer and policy optimization in a single network while still taking advantage of privilege information. The policies are optimized with reinforcement learning, next state-action prediction, and action imitation, all in just one training stage, to achieve zero-shot deployment. Evaluation results demonstrate that with ULT, optimal teacher and student policies can be obtained at the same time, greatly easing the difficulty in knowledge transfer, even with complex transformer-based models.
△ Less
Submitted 3 August, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
XAI4Extremes: An interpretable machine learning framework for understanding extreme-weather precursors under climate change
Authors:
Jiawen Wei,
Aniruddha Bora,
Vivek Oommen,
Chenyu Dong,
Juntao Yang,
Jeff Adie,
Chen Chen,
Simon See,
George Karniadakis,
Gianmarco Mengaldo
Abstract:
Extreme weather events are increasing in frequency and intensity due to climate change. This, in turn, is exacting a significant toll in communities worldwide. While prediction skills are increasing with advances in numerical weather prediction and artificial intelligence tools, extreme weather still present challenges. More specifically, identifying the precursors of such extreme weather events a…
▽ More
Extreme weather events are increasing in frequency and intensity due to climate change. This, in turn, is exacting a significant toll in communities worldwide. While prediction skills are increasing with advances in numerical weather prediction and artificial intelligence tools, extreme weather still present challenges. More specifically, identifying the precursors of such extreme weather events and how these precursors may evolve under climate change remain unclear. In this paper, we propose to use post-hoc interpretability methods to construct relevance weather maps that show the key extreme-weather precursors identified by deep learning models. We then compare this machine view with existing domain knowledge to understand whether deep learning models identified patterns in data that may enrich our understanding of extreme-weather precursors. We finally bin these relevant maps into different multi-year time periods to understand the role that climate change is having on these precursors. The experiments are carried out on Indochina heatwaves, but the methodology can be readily extended to other extreme weather events worldwide.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
CondensNet: Enabling stable long-term climate simulations via hybrid deep learning models with adaptive physical constraints
Authors:
Xin Wang,
Juntao Yang,
Jeff Adie,
Simon See,
Kalli Furtado,
Chen Chen,
Troy Arcomano,
Romit Maulik,
Gianmarco Mengaldo
Abstract:
Accurate and efficient climate simulations are crucial for understanding Earth's evolving climate. However, current general circulation models (GCMs) face challenges in capturing unresolved physical processes, such as cloud and convection. A common solution is to adopt cloud resolving models, that provide more accurate results than the standard subgrid parametrisation schemes typically used in GCM…
▽ More
Accurate and efficient climate simulations are crucial for understanding Earth's evolving climate. However, current general circulation models (GCMs) face challenges in capturing unresolved physical processes, such as cloud and convection. A common solution is to adopt cloud resolving models, that provide more accurate results than the standard subgrid parametrisation schemes typically used in GCMs. However, cloud resolving models, also referred to as super paramtetrizations, remain computationally prohibitive. Hybrid modeling, which integrates deep learning with equation-based GCMs, offers a promising alternative but often struggles with long-term stability and accuracy issues. In this work, we find that water vapor oversaturation during condensation is a key factor compromising the stability of hybrid models. To address this, we introduce CondensNet, a novel neural network architecture that embeds a self-adaptive physical constraint to correct unphysical condensation processes. CondensNet effectively mitigates water vapor oversaturation, enhancing simulation stability while maintaining accuracy and improving computational efficiency compared to super parameterization schemes.
We integrate CondensNet into a GCM to form PCNN-GCM (Physics-Constrained Neural Network GCM), a hybrid deep learning framework designed for long-term stable climate simulations in real-world conditions, including ocean and land. PCNN-GCM represents a significant milestone in hybrid climate modeling, as it shows a novel way to incorporate physical constraints adaptively, paving the way for accurate, lightweight, and stable long-term climate simulations.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
LogiDynamics: Unraveling the Dynamics of Inductive, Abductive and Deductive Logical Inferences in LLM Reasoning
Authors:
Tianshi Zheng,
Jiayang Cheng,
Chunyang Li,
Haochen Shi,
Zihao Wang,
Jiaxin Bai,
Yangqiu Song,
Ginny Y. Wong,
Simon See
Abstract:
Modern large language models (LLMs) employ diverse logical inference mechanisms for reasoning, making the strategic optimization of these approaches critical for advancing their capabilities. This paper systematically investigate the comparative dynamics of inductive (System 1) versus abductive/deductive (System 2) inference in LLMs. We utilize a controlled analogical reasoning environment, varyin…
▽ More
Modern large language models (LLMs) employ diverse logical inference mechanisms for reasoning, making the strategic optimization of these approaches critical for advancing their capabilities. This paper systematically investigate the comparative dynamics of inductive (System 1) versus abductive/deductive (System 2) inference in LLMs. We utilize a controlled analogical reasoning environment, varying modality (textual, visual, symbolic), difficulty, and task format (MCQ / free-text). Our analysis reveals System 2 pipelines generally excel, particularly in visual/symbolic modalities and harder tasks, while System 1 is competitive for textual and easier problems. Crucially, task format significantly influences their relative advantage, with System 1 sometimes outperforming System 2 in free-text rule-execution. These core findings generalize to broader in-context learning. Furthermore, we demonstrate that advanced System 2 strategies like hypothesis selection and iterative refinement can substantially scale LLM reasoning. This study offers foundational insights and actionable guidelines for strategically deploying logical inference to enhance LLM reasoning. Resources are available at https://github.com/HKUST-KnowComp/LogiDynamics.
△ Less
Submitted 17 September, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
Feature-based Graph Attention Networks Improve Online Continual Learning
Authors:
Adjovi Sim,
Zhengkui Wang,
Aik Beng Ng,
Shalini De Mello,
Simon See,
Wonmin Byeon
Abstract:
Online continual learning for image classification is crucial for models to adapt to new data while retaining knowledge of previously learned tasks. This capability is essential to address real-world challenges involving dynamic environments and evolving data distributions. Traditional approaches predominantly employ Convolutional Neural Networks, which are limited to processing images as grids an…
▽ More
Online continual learning for image classification is crucial for models to adapt to new data while retaining knowledge of previously learned tasks. This capability is essential to address real-world challenges involving dynamic environments and evolving data distributions. Traditional approaches predominantly employ Convolutional Neural Networks, which are limited to processing images as grids and primarily capture local patterns rather than relational information. Although the emergence of transformer architectures has improved the ability to capture relationships, these models often require significantly larger resources. In this paper, we present a novel online continual learning framework based on Graph Attention Networks (GATs), which effectively capture contextual relationships and dynamically update the task-specific representation via learned attention weights. Our approach utilizes a pre-trained feature extractor to convert images into graphs using hierarchical feature maps, representing information at varying levels of granularity. These graphs are then processed by a GAT and incorporate an enhanced global pooling strategy to improve classification performance for continual learning. In addition, we propose the rehearsal memory duplication technique that improves the representation of the previous tasks while maintaining the memory budget. Comprehensive evaluations on benchmark datasets, including SVHN, CIFAR10, CIFAR100, and MiniImageNet, demonstrate the superiority of our method compared to the state-of-the-art methods.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Rydberg Atomic Quantum Receivers for the Multi-User MIMO Uplink
Authors:
Tierui Gong,
Chau Yuen,
Chong Meng Samson See,
Mérouane Debbah,
Lajos Hanzo
Abstract:
Rydberg atomic quantum receivers exhibit great potential in assisting classical wireless communications due to their outstanding advantages in detecting radio frequency signals. To realize this potential, we integrate a Rydberg atomic quantum receiver into a classical multi-user multiple-input multiple-output (MIMO) scheme to form a multi-user Rydberg atomic quantum MIMO (RAQ-MIMO) system for the…
▽ More
Rydberg atomic quantum receivers exhibit great potential in assisting classical wireless communications due to their outstanding advantages in detecting radio frequency signals. To realize this potential, we integrate a Rydberg atomic quantum receiver into a classical multi-user multiple-input multiple-output (MIMO) scheme to form a multi-user Rydberg atomic quantum MIMO (RAQ-MIMO) system for the uplink. To study this system, we first construct an equivalent baseband signal model, which facilitates convenient system design, signal processing and optimizations. We then study the ergodic achievable rates under both the maximum ratio combining (MRC) and zero-forcing (ZF) schemes by deriving their tight lower bounds. We next compare the ergodic achievable rates of the RAQ-MIMO and the conventional massive MIMO schemes by offering a closed-form expression for the difference of their ergodic achievable rates, which allows us to directly compare the two systems. Our results show that RAQ-MIMO allows the average transmit power of users to be $> 25$ dBm lower than that of the conventional massive MIMO. Viewed from a different perspective, an extra $\sim 8.8$ bits/s/Hz/user rate becomes achievable by ZF RAQ-MIMO.
△ Less
Submitted 28 February, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
Harnessing Rydberg Atomic Receivers: From Quantum Physics to Wireless Communications
Authors:
Yuanbin Chen,
Xufeng Guo,
Chau Yuen,
Yufei Zhao,
Yong Liang Guan,
Chong Meng Samson See,
Merouane Débbah,
Lajos Hanzo
Abstract:
The intrinsic integration of Rydberg atomic receivers into wireless communication systems is proposed, by harnessing the principles of quantum physics in wireless communications. More particularly, we conceive a pair of Rydberg atomic receivers, one incorporates a local oscillator (LO), referred to as an LO-dressed receiver, while the other operates without an LO and is termed an LO-free receiver.…
▽ More
The intrinsic integration of Rydberg atomic receivers into wireless communication systems is proposed, by harnessing the principles of quantum physics in wireless communications. More particularly, we conceive a pair of Rydberg atomic receivers, one incorporates a local oscillator (LO), referred to as an LO-dressed receiver, while the other operates without an LO and is termed an LO-free receiver. The appropriate wireless model is developed for each configuration, elaborating on the receiver's responses to the radio frequency (RF) signal, on the potential noise sources, and on the signal-to-noise ratio (SNR) performance. The developed wireless model conforms to the classical RF framework, facilitating compatibility with established signal processing methodologies. Next, we investigate the associated distortion effects that might occur, specifically identifying the conditions under which distortion arises and demonstrating the boundaries of linear dynamic ranges. This provides critical insights into its practical implementations in wireless systems. Finally, extensive simulation results are provided for characterizing the performance of wireless systems, harnessing this pair of Rydberg atomic receivers. Our results demonstrate that LO-dressed systems achieve a significant SNR gain of approximately 40~50 dB over conventional RF receivers in the standard quantum limit regime. This SNR head-room translates into reduced symbol error rates, enabling efficient and reliable transmission with higher-order constellations.
△ Less
Submitted 30 July, 2025; v1 submitted 20 January, 2025;
originally announced January 2025.
-
Rydberg Atomic Quantum Receivers for Multi-Target DOA Estimation
Authors:
Tierui Gong,
Chau Yuen,
Chong Meng Samson See,
Mérouane Debbah,
Lajos Hanzo
Abstract:
Quantum sensing technologies have experienced rapid progresses since entering the `second quantum revolution'. Among various candidates, schemes relying on Rydberg atoms exhibit compelling advantages for detecting radio frequency signals. Based on this, Rydberg atomic quantum receivers (RAQRs) have emerged as a promising solution to classical wireless communication and sensing. To harness the adva…
▽ More
Quantum sensing technologies have experienced rapid progresses since entering the `second quantum revolution'. Among various candidates, schemes relying on Rydberg atoms exhibit compelling advantages for detecting radio frequency signals. Based on this, Rydberg atomic quantum receivers (RAQRs) have emerged as a promising solution to classical wireless communication and sensing. To harness the advantages and exploit the potential of RAQRs in wireless sensing, we investigate the realization of the direction of arrival (DOA) estimation by RAQRs. Specifically, we first conceive a Rydberg atomic quantum uniform linear array (RAQ-ULA) aided wireless receiver for multi-target DOA detection and propose the corresponding signal model of this sensing system. Our model reveals that the presence of the radio-frequency local oscillator in the RAQ-ULA creates sensor gain mismatches, which degrade the DOA estimation significantly by employing the classical Estimation of Signal Parameters via Rotational Invariant Techniques (ESPRIT). To solve this sensor gain mismatch problem, we propose the Rydberg atomic quantum ESPRIT (RAQ-ESPRIT) relying on our model. Lastly, we characterize our scheme through numerical simulations, where the results exhibit that it is capable of reducing the estimation error of its classical counterpart on the order of $> 400$-fold and $> 9000$-fold in the PSL and SQL, respectively.
△ Less
Submitted 11 December, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
-
Meme Trojan: Backdoor Attacks Against Hateful Meme Detection via Cross-Modal Triggers
Authors:
Ruofei Wang,
Hongzhan Lin,
Ziyuan Luo,
Ka Chun Cheung,
Simon See,
Jing Ma,
Renjie Wan
Abstract:
Hateful meme detection aims to prevent the proliferation of hateful memes on various social media platforms. Considering its impact on social environments, this paper introduces a previously ignored but significant threat to hateful meme detection: backdoor attacks. By injecting specific triggers into meme samples, backdoor attackers can manipulate the detector to output their desired outcomes. To…
▽ More
Hateful meme detection aims to prevent the proliferation of hateful memes on various social media platforms. Considering its impact on social environments, this paper introduces a previously ignored but significant threat to hateful meme detection: backdoor attacks. By injecting specific triggers into meme samples, backdoor attackers can manipulate the detector to output their desired outcomes. To explore this, we propose the Meme Trojan framework to initiate backdoor attacks on hateful meme detection. Meme Trojan involves creating a novel Cross-Modal Trigger (CMT) and a learnable trigger augmentor to enhance the trigger pattern according to each input sample. Due to the cross-modal property, the proposed CMT can effectively initiate backdoor attacks on hateful meme detectors under an automatic application scenario. Additionally, the injection position and size of our triggers are adaptive to the texts contained in the meme, which ensures that the trigger is seamlessly integrated with the meme content. Our approach outperforms the state-of-the-art backdoor attack methods, showing significant improvements in effectiveness and stealthiness. We believe that this paper will draw more attention to the potential threat posed by backdoor attacks on hateful meme detection.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning
Authors:
Meng Shen,
Yake Wei,
Jianxiong Yin,
Deepu Rajan,
Di Hu,
Simon See
Abstract:
Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-star…
▽ More
Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL).
Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Rydberg Atomic Quantum Receivers for Classical Wireless Communications and Sensing: Their Models and Performance
Authors:
Tierui Gong,
Jiaming Sun,
Chau Yuen,
Guangwei Hu,
Yufei Zhao,
Yong Liang Guan,
Chong Meng Samson See,
Mérouane Debbah,
Lajos Hanzo
Abstract:
The significant progress of quantum sensing technologies offer numerous radical solutions for measuring a multitude of physical quantities at an unprecedented precision. Among them, Rydberg atomic quantum receivers (RAQRs) emerge as an eminent solution for detecting the electric field of radio frequency (RF) signals, exhibiting great potential in assisting classical wireless communications and sen…
▽ More
The significant progress of quantum sensing technologies offer numerous radical solutions for measuring a multitude of physical quantities at an unprecedented precision. Among them, Rydberg atomic quantum receivers (RAQRs) emerge as an eminent solution for detecting the electric field of radio frequency (RF) signals, exhibiting great potential in assisting classical wireless communications and sensing. So far, most experimental studies have aimed for the proof of physical concepts to reveal its promise, while the practical signal model of RAQR-aided wireless communications and sensing remained under-explored. Furthermore, the performance of RAQR-based wireless receivers and their advantages over classical RF receivers have not been fully characterized. To fill these gaps, we introduce the RAQR to the wireless community by presenting an end-to-end reception scheme. We then develop a corresponding equivalent baseband signal model relying on a realistic reception flow. Our scheme and model provide explicit design guidance to RAQR-aided wireless systems. We next study the performance of RAQR-aided wireless systems based on our model, and compare them to classical RF receivers. The results show that the RAQR is capable of achieving a substantial received signal-to-noise ratio (SNR) gain of over $27$ decibel (dB) and $40$ dB in the photon shot limit regime and the standard quantum limit regime, respectively.
△ Less
Submitted 13 May, 2025; v1 submitted 7 December, 2024;
originally announced December 2024.
-
GaussianMarker: Uncertainty-Aware Copyright Protection of 3D Gaussian Splatting
Authors:
Xiufeng Huang,
Ruiqi Li,
Yiu-ming Cheung,
Ka Chun Cheung,
Simon See,
Renjie Wan
Abstract:
3D Gaussian Splatting (3DGS) has become a crucial method for acquiring 3D assets. To protect the copyright of these assets, digital watermarking techniques can be applied to embed ownership information discreetly within 3DGS models. However, existing watermarking methods for meshes, point clouds, and implicit radiance fields cannot be directly applied to 3DGS models, as 3DGS models use explicit 3D…
▽ More
3D Gaussian Splatting (3DGS) has become a crucial method for acquiring 3D assets. To protect the copyright of these assets, digital watermarking techniques can be applied to embed ownership information discreetly within 3DGS models. However, existing watermarking methods for meshes, point clouds, and implicit radiance fields cannot be directly applied to 3DGS models, as 3DGS models use explicit 3D Gaussians with distinct structures and do not rely on neural networks. Naively embedding the watermark on a pre-trained 3DGS can cause obvious distortion in rendered images. In our work, we propose an uncertainty-based method that constrains the perturbation of model parameters to achieve invisible watermarking for 3DGS. At the message decoding stage, the copyright messages can be reliably extracted from both 3D Gaussians and 2D rendered images even under various forms of 3D and 2D distortions. We conduct extensive experiments on the Blender, LLFF and MipNeRF-360 datasets to validate the effectiveness of our proposed method, demonstrating state-of-the-art performance on both message decoding accuracy and view synthesis quality.
△ Less
Submitted 31 October, 2024;
originally announced October 2024.
-
Geometry Cloak: Preventing TGS-based 3D Reconstruction from Copyrighted Images
Authors:
Qi Song,
Ziyuan Luo,
Ka Chun Cheung,
Simon See,
Renjie Wan
Abstract:
Single-view 3D reconstruction methods like Triplane Gaussian Splatting (TGS) have enabled high-quality 3D model generation from just a single image input within seconds. However, this capability raises concerns about potential misuse, where malicious users could exploit TGS to create unauthorized 3D models from copyrighted images. To prevent such infringement, we propose a novel image protection a…
▽ More
Single-view 3D reconstruction methods like Triplane Gaussian Splatting (TGS) have enabled high-quality 3D model generation from just a single image input within seconds. However, this capability raises concerns about potential misuse, where malicious users could exploit TGS to create unauthorized 3D models from copyrighted images. To prevent such infringement, we propose a novel image protection approach that embeds invisible geometry perturbations, termed "geometry cloaks", into images before supplying them to TGS. These carefully crafted perturbations encode a customized message that is revealed when TGS attempts 3D reconstructions of the cloaked image. Unlike conventional adversarial attacks that simply degrade output quality, our method forces TGS to fail the 3D reconstruction in a specific way - by generating an identifiable customized pattern that acts as a watermark. This watermark allows copyright holders to assert ownership over any attempted 3D reconstructions made from their protected images. Extensive experiments have verified the effectiveness of our geometry cloak. Our project is available at https://qsong2001.github.io/geometry_cloak.
△ Less
Submitted 30 October, 2024;
originally announced October 2024.
-
On-Device LLMs for SMEs: Challenges and Opportunities
Authors:
Jeremy Stephen Gabriel Yee,
Pai Chet Ng,
Zhengkui Wang,
Ian McLoughlin,
Aik Beng Ng,
Simon See
Abstract:
This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for…
▽ More
This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for effective deployment, addressing the challenges of limited computational resources typical in SME settings. From the software perspective, we explore framework compatibility, operating system optimization, and the use of specialized libraries tailored for resource-constrained environments. The review is structured to first identify the unique challenges faced by SMEs in deploying LLMs on-device, followed by an exploration of the opportunities that both hardware innovations and software adaptations offer to overcome these obstacles. Such a structured review provides practical insights, contributing significantly to the community by enhancing the technological resilience of SMEs in integrating LLMs.
△ Less
Submitted 22 October, 2024; v1 submitted 21 October, 2024;
originally announced October 2024.
-
A Multimodal Vision Foundation Model for Clinical Dermatology
Authors:
Siyuan Yan,
Zhen Yu,
Clare Primiero,
Cristina Vico-Alonso,
Zhonghua Wang,
Litao Yang,
Philipp Tschandl,
Ming Hu,
Lie Ju,
Gin Tan,
Vincent Tang,
Aik Beng Ng,
David Powell,
Paul Bonnington,
Simon See,
Elisabetta Magnaterra,
Peter Ferguson,
Jennifer Nguyen,
Pascale Guitera,
Jose Banuls,
Monika Janda,
Victoria Mar,
Harald Kittler,
H. Peter Soyer,
Zongyuan Ge
Abstract:
Diagnosing and treating skin diseases require advanced visual skills across domains and the ability to synthesize information from multiple imaging modalities. While current deep learning models excel at specific tasks like skin cancer diagnosis from dermoscopic images, they struggle to meet the complex, multimodal requirements of clinical practice. Here, we introduce PanDerm, a multimodal dermato…
▽ More
Diagnosing and treating skin diseases require advanced visual skills across domains and the ability to synthesize information from multiple imaging modalities. While current deep learning models excel at specific tasks like skin cancer diagnosis from dermoscopic images, they struggle to meet the complex, multimodal requirements of clinical practice. Here, we introduce PanDerm, a multimodal dermatology foundation model pretrained through self-supervised learning on over 2 million real-world skin disease images from 11 clinical institutions across 4 imaging modalities. We evaluated PanDerm on 28 diverse benchmarks, including skin cancer screening, risk stratification, differential diagnosis of common and rare skin conditions, lesion segmentation, longitudinal monitoring, and metastasis prediction and prognosis. PanDerm achieved state-of-the-art performance across all evaluated tasks, often outperforming existing models when using only 10% of labeled data. We conducted three reader studies to assess PanDerm's potential clinical utility. PanDerm outperformed clinicians by 10.2% in early-stage melanoma detection through longitudinal analysis, improved clinicians' skin cancer diagnostic accuracy by 11% on dermoscopy images, and enhanced non-dermatologist healthcare providers' differential diagnosis by 16.5% across 128 skin conditions on clinical photographs. These results demonstrate PanDerm's potential to improve patient care across diverse clinical scenarios and serve as a model for developing multimodal foundation models in other medical specialties, potentially accelerating the integration of AI support in healthcare. The code can be found at https://github.com/SiyuanYan1/PanDerm.
△ Less
Submitted 13 April, 2025; v1 submitted 19 October, 2024;
originally announced October 2024.
-
Persona Knowledge-Aligned Prompt Tuning Method for Online Debate
Authors:
Chunkit Chan,
Cheng Jiayang,
Xin Liu,
Yauwai Yim,
Yuxin Jiang,
Zheye Deng,
Haoran Li,
Yangqiu Song,
Ginny Y. Wong,
Simon See
Abstract:
Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument…
▽ More
Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument persuasiveness and impact with the social personae of the audience has not been explored due to the difficulty and complexity. We have observed the impressive simulation and personification capability of ChatGPT, indicating a giant pre-trained language model may function as an individual to provide personae and exert unique influences based on diverse background knowledge. Therefore, we propose a persona knowledge-aligned framework for argument quality assessment tasks from the audience side. This is the first work that leverages the emergence of ChatGPT and injects such audience personae knowledge into smaller language models via prompt tuning. The performance of our pipeline demonstrates significant and consistent improvement compared to competitive architectures.
△ Less
Submitted 5 October, 2024;
originally announced October 2024.
-
Rydberg Atomic Quantum Receivers for Classical Wireless Communication and Sensing
Authors:
Tierui Gong,
Aveek Chandra,
Chau Yuen,
Yong Liang Guan,
Rainer Dumke,
Chong Meng Samson See,
Mérouane Debbah,
Lajos Hanzo
Abstract:
The Rydberg atomic quantum receivers (RAQR) are emerging quantum precision sensing platforms designed for receiving radio frequency (RF) signals. It relies on creation of Rydberg atoms from normal atoms by exciting one or more electrons to a very high energy level, thereby making the atom sensitive to RF signals. RAQRs realize RF-to-optical conversions based on light-atom interactions relying on t…
▽ More
The Rydberg atomic quantum receivers (RAQR) are emerging quantum precision sensing platforms designed for receiving radio frequency (RF) signals. It relies on creation of Rydberg atoms from normal atoms by exciting one or more electrons to a very high energy level, thereby making the atom sensitive to RF signals. RAQRs realize RF-to-optical conversions based on light-atom interactions relying on the so called electromagnetically induced transparency (EIT) and Aulter-Townes splitting (ATS), so that the desired RF signal can be read out optically. The large dipole moments of Rydberg atoms associated with rich choices of Rydberg states and various modulation schemes facilitate an ultra-high sensitivity ($\sim$ nV/cm/$\sqrt{\text{Hz}}$) and an ultra-broadband tunability (direct-current to Terahertz). RAQRs also exhibit compelling scalability and lend themselves to the construction of innovative, compact receivers. Initial experimental studies have demonstrated their capabilities in classical wireless communications and sensing. To fully harness their potential in a wide variety of applications, we commence by outlining the underlying fundamentals of Rydberg atoms, followed by the principles and schemes of RAQRs. Then, we overview the state-of-the-art studies from both physics and communication societies. Furthermore, we conceive Rydberg atomic quantum single-input single-output (RAQ-SISO) and multiple-input multiple-output (RAQ-MIMO) schemes for facilitating the integration of RAQRs with classical wireless systems. Finally, we conclude with a set of potent research directions.
△ Less
Submitted 18 January, 2025; v1 submitted 22 September, 2024;
originally announced September 2024.
-
Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion
Authors:
Dikai Liu,
Tianwei Zhang,
Jianxiong Yin,
Simon See
Abstract:
With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensor inputs becomes highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA),…
▽ More
With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensor inputs becomes highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA), a novel transformer-based mechanism with masking for quadruped locomotion. It employs direct sensor-level attention to enhance the sensory-temporal understanding and handle different combinations of sensor data, serving as a foundation for incorporating unseen information. MSTA can effectively understand its states even with a large portion of missing information, and is flexible enough to be deployed on physical systems despite the long input sequence.
△ Less
Submitted 11 March, 2025; v1 submitted 5 September, 2024;
originally announced September 2024.
-
GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields
Authors:
Xiufeng Huang,
Ka Chun Cheung,
Simon See,
Renjie Wan
Abstract:
Remarkable advancements in the recolorization of Neural Radiance Fields (NeRF) have simplified the process of modifying NeRF's color attributes. Yet, with the potential of NeRF to serve as shareable digital assets, there's a concern that malicious users might alter the color of NeRF models and falsely claim the recolorized version as their own. To safeguard against such breaches of ownership, enab…
▽ More
Remarkable advancements in the recolorization of Neural Radiance Fields (NeRF) have simplified the process of modifying NeRF's color attributes. Yet, with the potential of NeRF to serve as shareable digital assets, there's a concern that malicious users might alter the color of NeRF models and falsely claim the recolorized version as their own. To safeguard against such breaches of ownership, enabling original NeRF creators to establish rights over recolorized NeRF is crucial. While approaches like CopyRNeRF have been introduced to embed binary messages into NeRF models as digital signatures for copyright protection, the process of recolorization can remove these binary messages. In our paper, we present GeometrySticker, a method for seamlessly integrating binary messages into the geometry components of radiance fields, akin to applying a sticker. GeometrySticker can embed binary messages into NeRF models while preserving the effectiveness of these messages against recolorization. Our comprehensive studies demonstrate that GeometrySticker is adaptable to prevalent NeRF architectures and maintains a commendable level of robustness against various distortions. Project page: https://kevinhuangxf.github.io/GeometrySticker/.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction
Authors:
Xingzhi Zhou,
Xin Dong,
Chunhao Li,
Yuning Bai,
Yulong Xu,
Ka Chun Cheung,
Simon See,
Xinpeng Song,
Runshun Zhang,
Xuezhong Zhou,
Nevin L. Zhang
Abstract:
Traditional Chinese medicine (TCM) has relied on specific combinations of herbs in prescriptions to treat various symptoms and signs for thousands of years. Predicting TCM prescriptions poses a fascinating technical challenge with significant practical implications. However, this task faces limitations due to the scarcity of high-quality clinical datasets and the complex relationship between sympt…
▽ More
Traditional Chinese medicine (TCM) has relied on specific combinations of herbs in prescriptions to treat various symptoms and signs for thousands of years. Predicting TCM prescriptions poses a fascinating technical challenge with significant practical implications. However, this task faces limitations due to the scarcity of high-quality clinical datasets and the complex relationship between symptoms and herbs. To address these issues, we introduce \textit{DigestDS}, a novel dataset comprising practical medical records from experienced experts in digestive system diseases. We also propose a method, TCM-FTP (TCM Fine-Tuning Pre-trained), to leverage pre-trained large language models (LLMs) via supervised fine-tuning on \textit{DigestDS}. Additionally, we enhance computational efficiency using a low-rank adaptation technique. Moreover, TCM-FTP incorporates data augmentation by permuting herbs within prescriptions, exploiting their order-agnostic nature. Impressively, TCM-FTP achieves an F1-score of 0.8031, significantly outperforming previous methods. Furthermore, it demonstrates remarkable accuracy in dosage prediction, achieving a normalized mean square error of 0.0604. In contrast, LLMs without fine-tuning exhibit poor performance. Although LLMs have demonstrated wide-ranging capabilities, our work underscores the necessity of fine-tuning for TCM prescription prediction and presents an effective way to accomplish this.
△ Less
Submitted 12 December, 2024; v1 submitted 15 July, 2024;
originally announced July 2024.
-
Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model
Authors:
Qi Song,
Ziyuan Luo,
Ka Chun Cheung,
Simon See,
Renjie Wan
Abstract:
Neural Radiance Fields (NeRFs) have become a key method for 3D scene representation. With the rising prominence and influence of NeRF, safeguarding its intellectual property has become increasingly important. In this paper, we propose \textbf{NeRFProtector}, which adopts a plug-and-play strategy to protect NeRF's copyright during its creation. NeRFProtector utilizes a pre-trained watermarking base…
▽ More
Neural Radiance Fields (NeRFs) have become a key method for 3D scene representation. With the rising prominence and influence of NeRF, safeguarding its intellectual property has become increasingly important. In this paper, we propose \textbf{NeRFProtector}, which adopts a plug-and-play strategy to protect NeRF's copyright during its creation. NeRFProtector utilizes a pre-trained watermarking base model, enabling NeRF creators to embed binary messages directly while creating their NeRF. Our plug-and-play property ensures NeRF creators can flexibly choose NeRF variants without excessive modifications. Leveraging our newly designed progressive distillation, we demonstrate performance on par with several leading-edge neural rendering methods. Our project is available at: \url{https://qsong2001.github.io/NeRFProtector}.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Nutrition Estimation for Dietary Management: A Transformer Approach with Depth Sensing
Authors:
Zhengyi Kwan,
Wei Zhang,
Zhengkui Wang,
Aik Beng Ng,
Simon See
Abstract:
Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder an…
▽ More
Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder and decoder, along with two types of feature fusion modules, specialized for estimating five nutritional factors. These modules effectively balance the efficiency and effectiveness of feature extraction with flexible usage of our customized attention mechanisms and fusion strategies. Our experimental study shows that NuNet outperforms its variants and existing solutions significantly for nutrition estimation. It achieves an error rate of 15.65%, the lowest known to us, largely due to our multi-scale architecture and fusion modules. This research holds practical values for dietary management with huge potential for transnational research and deployment and could inspire other applications involving multiple data types with varying degrees of importance.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow
Authors:
Chen-Hao Chao,
Chien Feng,
Wei-Fang Sun,
Cheng-Kuang Lee,
Simon See,
Chun-Yi Lee
Abstract:
Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy improvement. In the policy evaluation steps, the critic is updated to capture the soft Q-function. In the policy improvement steps, the actor is adjusted in accordance wit…
▽ More
Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy improvement. In the policy evaluation steps, the critic is updated to capture the soft Q-function. In the policy improvement steps, the actor is adjusted in accordance with the updated soft Q-function. In this paper, we introduce a new MaxEnt RL framework modeled using Energy-Based Normalizing Flows (EBFlow). This framework integrates the policy evaluation steps and the policy improvement steps, resulting in a single objective training process. Our method enables the calculation of the soft value function used in the policy evaluation target without Monte Carlo approximation. Moreover, this design supports the modeling of multi-modal action distributions while facilitating efficient action sampling. To evaluate the performance of our method, we conducted experiments on the MuJoCo benchmark suite and a number of high-dimensional robotic tasks simulated by Omniverse Isaac Gym. The evaluation results demonstrate that our method achieves superior performance compared to widely-adopted representative baselines.
△ Less
Submitted 26 October, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Validating Large-Scale Quantum Machine Learning: Efficient Simulation of Quantum Support Vector Machines Using Tensor Networks
Authors:
Kuan-Cheng Chen,
Tai-Yue Li,
Yun-Yuan Wang,
Simon See,
Chun-Chieh Wang,
Robert Wille,
Nan-Yow Chen,
An-Cheng Yang,
Chun-Yu Lin
Abstract:
We present an efficient tensor-network-based approach for simulating large-scale quantum circuits, demonstrated using Quantum Support Vector Machines (QSVMs). Our method effectively reduces exponential runtime growth to near-quadratic scaling with respect to the number of qubits in practical scenarios. Traditional state-vector simulations become computationally infeasible beyond approximately 50 q…
▽ More
We present an efficient tensor-network-based approach for simulating large-scale quantum circuits, demonstrated using Quantum Support Vector Machines (QSVMs). Our method effectively reduces exponential runtime growth to near-quadratic scaling with respect to the number of qubits in practical scenarios. Traditional state-vector simulations become computationally infeasible beyond approximately 50 qubits; in contrast, our simulator successfully handles QSVMs with up to 784 qubits, completing simulations within seconds on a single high-performance GPU. Furthermore, by employing the Message Passing Interface (MPI) in multi-GPU environments, the approach shows strong linear scalability, reducing computation time as dataset size increases. We validate the framework on the MNIST and Fashion MNIST datasets, achieving successful multiclass classification and emphasizing the potential of QSVMs for high-dimensional data analysis. By integrating tensor-network techniques with high-performance computing resources, this work demonstrates both the feasibility and scalability of large-qubit quantum machine learning models, providing a valuable validation tool in the emerging Quantum-HPC ecosystem.
△ Less
Submitted 6 January, 2025; v1 submitted 4 May, 2024;
originally announced May 2024.
-
AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation
Authors:
Zhaowei Wang,
Wei Fan,
Qing Zong,
Hongming Zhang,
Sehyun Choi,
Tianqing Fang,
Xin Liu,
Yangqiu Song,
Ginny Y. Wong,
Simon See
Abstract:
Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LL…
▽ More
Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LLMs in capturing the underlying rationale of abstraction. Meanwhile, we introduce a plausibility estimator to select instructions that are more consistent with the abstraction knowledge of LLMs to be aligned. Then, our framework combines abstraction instructions with general-purpose ones to build a hybrid dataset. Extensive experiments and analyses demonstrate that our framework can considerably enhance LLMs' abstraction ability with strong generalization performance while maintaining their general instruction-following abilities.
△ Less
Submitted 17 June, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
Authors:
Xiaoyu Shi,
Zhaoyang Huang,
Fu-Yun Wang,
Weikang Bian,
Dasong Li,
Yi Zhang,
Manyuan Zhang,
Ka Chun Cheung,
Simon See,
Hongwei Qin,
Jifeng Dai,
Hongsheng Li
Abstract:
We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the ref…
▽ More
We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.
△ Less
Submitted 31 January, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
Resilient Practical Test-Time Adaptation: Soft Batch Normalization Alignment and Entropy-driven Memory Bank
Authors:
Xingzhi Zhou,
Zhiliang Tian,
Ka Chun Cheung,
Simon See,
Nevin L. Zhang
Abstract:
Test-time domain adaptation effectively adjusts the source domain model to accommodate unseen domain shifts in a target domain during inference. However, the model performance can be significantly impaired by continuous distribution changes in the target domain and non-independent and identically distributed (non-i.i.d.) test samples often encountered in practical scenarios. While existing memory…
▽ More
Test-time domain adaptation effectively adjusts the source domain model to accommodate unseen domain shifts in a target domain during inference. However, the model performance can be significantly impaired by continuous distribution changes in the target domain and non-independent and identically distributed (non-i.i.d.) test samples often encountered in practical scenarios. While existing memory bank methodologies use memory to store samples and mitigate non-i.i.d. effects, they do not inherently prevent potential model degradation. To address this issue, we propose a resilient practical test-time adaptation (ResiTTA) method focused on parameter resilience and data quality. Specifically, we develop a resilient batch normalization with estimation on normalization statistics and soft alignments to mitigate overfitting and model degradation. We use an entropy-driven memory bank that accounts for timeliness, the persistence of over-confident samples, and sample uncertainty for high-quality data in adaptation. Our framework periodically adapts the source domain model using a teacher-student model through a self-training loss on the memory samples, incorporating soft alignment losses on batch normalization. We empirically validate ResiTTA across various benchmark datasets, demonstrating state-of-the-art performance.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining
Authors:
Qing Zong,
Zhaowei Wang,
Baixuan Xu,
Tianshi Zheng,
Haochen Shi,
Weiqi Wang,
Yangqiu Song,
Ginny Y. Wong,
Simon See
Abstract:
A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argumen…
▽ More
A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argument Mining), is designed to handle this mixed data. It excels at not only understanding text but also detecting optical characters and recognizing layout details in images. Our model significantly outperforms existing baselines, earning our team, KnowComp, the 1st place in the leaderboard of Argumentative Stance Classification subtask in this shared task.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Self-Consistent Narrative Prompts on Abductive Natural Language Inference
Authors:
Chunkit Chan,
Xin Liu,
Tsz Ho Chan,
Jiayang Cheng,
Yangqiu Song,
Ginny Wong,
Simon See
Abstract:
Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($α$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited…
▽ More
Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($α$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited in the previous works on this task. In this work, we propose a prompt tuning model $α$-PACE, which takes self-consistency and inter-sentential coherence into consideration. Besides, we propose a general self-consistent framework that considers various narrative sequences (e.g., linear narrative and reverse chronology) for guiding the pre-trained language model in understanding the narrative context of input. We conduct extensive experiments and thorough ablation studies to illustrate the necessity and effectiveness of $α$-PACE. The performance of our method shows significant improvement against extensive competitive baselines.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Learning Gabor Texture Features for Fine-Grained Recognition
Authors:
Lanyun Zhu,
Tianrun Chen,
Jianxiong Yin,
Simon See,
Jun Liu
Abstract:
Extracting and using class-discriminative features is critical for fine-grained recognition. Existing works have demonstrated the possibility of applying deep CNNs to exploit features that distinguish similar classes. However, CNNs suffer from problems including frequency bias and loss of detailed local information, which restricts the performance of recognizing fine-grained categories. To address…
▽ More
Extracting and using class-discriminative features is critical for fine-grained recognition. Existing works have demonstrated the possibility of applying deep CNNs to exploit features that distinguish similar classes. However, CNNs suffer from problems including frequency bias and loss of detailed local information, which restricts the performance of recognizing fine-grained categories. To address the challenge, we propose a novel texture branch as complimentary to the CNN branch for feature extraction. We innovatively utilize Gabor filters as a powerful extractor to exploit texture features, motivated by the capability of Gabor filters in effectively capturing multi-frequency features and detailed local information. We implement several designs to enhance the effectiveness of Gabor filters, including imposing constraints on parameter values and developing a learning method to determine the optimal parameters. Moreover, we introduce a statistical feature extractor to utilize informative statistical information from the signals captured by Gabor filters, and a gate selection mechanism to enable efficient computation by only considering qualified regions as input for texture extraction. Through the integration of features from the Gabor-filter-based texture branch and CNN-based semantic branch, we achieve comprehensive information extraction. We demonstrate the efficacy of our method on multiple datasets, including CUB-200-2011, NA-bird, Stanford Dogs, and GTOS-mobile. State-of-the-art performance is achieved using our approach.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
Towards Building AI-CPS with NVIDIA Isaac Sim: An Industrial Benchmark and Case Study for Robotics Manipulation
Authors:
Zhehua Zhou,
Jiayang Song,
Xuan Xie,
Zhan Shu,
Lei Ma,
Dikai Liu,
Jianxiong Yin,
Simon See
Abstract:
As a representative cyber-physical system (CPS), robotic manipulator has been widely adopted in various academic research and industrial processes, indicating its potential to act as a universal interface between the cyber and the physical worlds. Recent studies in robotics manipulation have started employing artificial intelligence (AI) approaches as controllers to achieve better adaptability and…
▽ More
As a representative cyber-physical system (CPS), robotic manipulator has been widely adopted in various academic research and industrial processes, indicating its potential to act as a universal interface between the cyber and the physical worlds. Recent studies in robotics manipulation have started employing artificial intelligence (AI) approaches as controllers to achieve better adaptability and performance. However, the inherent challenge of explaining AI components introduces uncertainty and unreliability to these AI-enabled robotics systems, necessitating a reliable development platform for system design and performance assessment. As a foundational step towards building reliable AI-enabled robotics systems, we propose a public industrial benchmark for robotics manipulation in this paper. It leverages NVIDIA Omniverse Isaac Sim as the simulation platform, encompassing eight representative manipulation tasks and multiple AI software controllers. An extensive evaluation is conducted to analyze the performance of AI controllers in solving robotics manipulation tasks, enabling a thorough understanding of their effectiveness. To further demonstrate the applicability of our benchmark, we develop a falsification framework that is compatible with physical simulators and OpenAI Gym environments. This framework bridges the gap between traditional testing methods and modern physics engine-based simulations. The effectiveness of different optimization methods in falsifying AI-enabled robotics manipulation with physical simulators is examined via a falsification test. Our work not only establishes a foundation for the design and development of AI-enabled robotics systems but also provides practical experience and guidance to practitioners in this field, promoting further research in this critical academic and industrial domain.
△ Less
Submitted 31 July, 2023;
originally announced August 2023.
-
CopyRNeRF: Protecting the CopyRight of Neural Radiance Fields
Authors:
Ziyuan Luo,
Qing Guo,
Ka Chun Cheung,
Simon See,
Renjie Wan
Abstract:
Neural Radiance Fields (NeRF) have the potential to be a major representation of media. Since training a NeRF has never been an easy task, the protection of its model copyright should be a priority. In this paper, by analyzing the pros and cons of possible copyright protection solutions, we propose to protect the copyright of NeRF models by replacing the original color representation in NeRF with…
▽ More
Neural Radiance Fields (NeRF) have the potential to be a major representation of media. Since training a NeRF has never been an easy task, the protection of its model copyright should be a priority. In this paper, by analyzing the pros and cons of possible copyright protection solutions, we propose to protect the copyright of NeRF models by replacing the original color representation in NeRF with a watermarked color representation. Then, a distortion-resistant rendering scheme is designed to guarantee robust message extraction in 2D renderings of NeRF. Our proposed method can directly protect the copyright of NeRF models while maintaining high rendering quality and bit accuracy when compared among optional solutions.
△ Less
Submitted 29 July, 2023; v1 submitted 21 July, 2023;
originally announced July 2023.