-
Less Is More: Sparse and Cooperative Perturbation for Point Cloud Attacks
Authors:
Keke Tang,
Tianyu Hao,
Xiaofei Wang,
Weilong Peng,
Denghui Zhang,
Peican Zhu,
Zhihong Tian
Abstract:
Most adversarial attacks on point clouds perturb a large number of points, causing widespread geometric changes and limiting applicability in real-world scenarios. While recent works explore sparse attacks by modifying only a few points, such approaches often struggle to maintain effectiveness due to the limited influence of individual perturbations. In this paper, we propose SCP, a sparse and coo…
▽ More
Most adversarial attacks on point clouds perturb a large number of points, causing widespread geometric changes and limiting applicability in real-world scenarios. While recent works explore sparse attacks by modifying only a few points, such approaches often struggle to maintain effectiveness due to the limited influence of individual perturbations. In this paper, we propose SCP, a sparse and cooperative perturbation framework that selects and leverages a compact subset of points whose joint perturbations produce amplified adversarial effects. Specifically, SCP identifies the subset where the misclassification loss is locally convex with respect to their joint perturbations, determined by checking the positivedefiniteness of the corresponding Hessian block. The selected subset is then optimized to generate high-impact adversarial examples with minimal modifications. Extensive experiments show that SCP achieves 100% attack success rates, surpassing state-of-the-art sparse attacks, and delivers superior imperceptibility to dense attacks with far fewer modifications.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Authors:
Shengao Wang,
Wenqi Wang,
Zecheng Wang,
Max Whitton,
Michael Wakeham,
Arjun Chandra,
Joey Huang,
Pengyue Zhu,
Helen Chen,
David Li,
Jeffrey Li,
Shawn Li,
Andrew Zagula,
Amy Zhao,
Andrew Zhu,
Sayaka Nakamura,
Yuki Yamamoto,
Jerry Jun Yokono,
Aaron Mueller,
Bryan A. Plummer,
Kate Saenko,
Venkatesh Saligrama,
Boqing Gong
Abstract:
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive…
▽ More
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration
Authors:
Wenlong Jiao,
Heyang Lee,
Ping Wang,
Pengfei Zhu,
Qinghua Hu,
Dongwei Ren
Abstract:
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying…
▽ More
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
An End-to-end Planning Framework with Agentic LLMs and PDDL
Authors:
Emanuele La Malfa,
Ping Zhu,
Samuele Marro,
Sara Bernardini,
Michael Wooldridge
Abstract:
We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguitie…
▽ More
We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguities and contradictions that may exist in the human specification. The validated domain and problem are then passed to an external planning engine to generate a plan. The orchestrator and agents are powered by Large Language Models (LLMs) and require no human intervention at any stage of the process. Finally, a module translates the final plan back into natural language to improve human readability while maintaining the correctness of each step. We demonstrate the flexibility and effectiveness of our framework across various domains and tasks, including the Google NaturalPlan benchmark and PlanBench, as well as planning problems like Blocksworld and the Tower of Hanoi (where LLMs are known to struggle even with small instances). Our framework can be integrated with any PDDL planning engine and validator (such as Fast Downward, LPG, POPF, VAL, and uVAL, which we have tested) and represents a significant step toward end-to-end planning aided by LLMs.
△ Less
Submitted 10 December, 2025;
originally announced December 2025.
-
MatMart: Material Reconstruction of 3D Objects via Diffusion
Authors:
Xiuchao Wu,
Pengfei Zhu,
Jiangjing Lyu,
Xinguo Liu,
Jie Guo,
Yanwen Guo,
Weiwei Xu,
Chengfei Lyu
Abstract:
Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose \ttt, a novel material reconstruction framework for 3D objects, offering the following advantages. First, \ttt\ adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobse…
▽ More
Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose \ttt, a novel material reconstruction framework for 3D objects, offering the following advantages. First, \ttt\ adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), \ttt\ enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, \ttt\ achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that \ttt\ achieves superior performance in material reconstruction compared to existing methods.
△ Less
Submitted 24 November, 2025;
originally announced November 2025.
-
Point Cloud Quantization through Multimodal Prompting for 3D Understanding
Authors:
Hongxuan Li,
Wencheng Zhu,
Huiying Xu,
Xinzhong Zhu,
Pengfei Zhu
Abstract:
Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrate…
▽ More
Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.
△ Less
Submitted 19 November, 2025; v1 submitted 15 November, 2025;
originally announced November 2025.
-
Transferable Hypergraph Attack via Injecting Nodes into Pivotal Hyperedges
Authors:
Meixia He,
Peican Zhu,
Le Cheng,
Yangming Guo,
Manman Yuan,
Keke Tang
Abstract:
Recent studies have demonstrated that hypergraph neural networks (HGNNs) are susceptible to adversarial attacks. However, existing methods rely on the specific information mechanisms of target HGNNs, overlooking the common vulnerability caused by the significant differences in hyperedge pivotality along aggregation paths in most HGNNs, thereby limiting the transferability and effectiveness of atta…
▽ More
Recent studies have demonstrated that hypergraph neural networks (HGNNs) are susceptible to adversarial attacks. However, existing methods rely on the specific information mechanisms of target HGNNs, overlooking the common vulnerability caused by the significant differences in hyperedge pivotality along aggregation paths in most HGNNs, thereby limiting the transferability and effectiveness of attacks. In this paper, we present a novel framework, i.e., Transferable Hypergraph Attack via Injecting Nodes into Pivotal Hyperedges (TH-Attack), to address these limitations. Specifically, we design a hyperedge recognizer via pivotality assessment to obtain pivotal hyperedges within the aggregation paths of HGNNs. Furthermore, we introduce a feature inverter based on pivotal hyperedges, which generates malicious nodes by maximizing the semantic divergence between the generated features and the pivotal hyperedges features. Lastly, by injecting these malicious nodes into the pivotal hyperedges, TH-Attack improves the transferability and effectiveness of attacks. Extensive experiments are conducted on six authentic datasets to validate the effectiveness of TH-Attack and the corresponding superiority to state-of-the-art methods.
△ Less
Submitted 12 November, 2025;
originally announced November 2025.
-
The Role and Mechanism of Deep Statistical Machine Learning In Biological Target Screening and Immune Microenvironment Regulation of Asthma
Authors:
Pengwei Zhu
Abstract:
As an important source of small molecule drugs, natural products show remarkable biological activities with their rich types and unique structures. However, due to the limited number of samples and structural complexity, the rapid discovery of lead compounds is limited. Therefore, in this study, natural inhibitors of phosphodiesterase 4 (PDE4) and Phosphodiesterase 7 (PDE7) were screened by combin…
▽ More
As an important source of small molecule drugs, natural products show remarkable biological activities with their rich types and unique structures. However, due to the limited number of samples and structural complexity, the rapid discovery of lead compounds is limited. Therefore, in this study, natural inhibitors of phosphodiesterase 4 (PDE4) and Phosphodiesterase 7 (PDE7) were screened by combining computer aided drug design (CADD) technology and deep learning method, and their activities were verified by enzyme activity experiment and enzymo-linked immunoassay. These two enzymes have important application potential in the treatment of inflammatory diseases such as chronic obstructive pulmonary disease and asthma, but PDE4 inhibitors may cause adverse reactions, so it is particularly important to develop both effective and safe dual-target inhibitors. In addition, as a potential target of hyperuricemia, the development of natural inhibitors of xanthine oxidase (X0) is also of great value. We used pharmacophore technology for virtual screening, combined with molecular docking technology to improve accuracy, and finally selected 16 potential natural inhibitors of PDE4/7, and verified their binding stability through molecular dynamics simulation. The results of this study laid a foundation for establishing an efficient dual-target inhibitor screening system and exploring the lead compounds of novel X0 inhibitors.
△ Less
Submitted 8 November, 2025;
originally announced November 2025.
-
Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning
Authors:
Liwei Luo,
Shuaitengyuan Li,
Dongwei Ren,
Qilong Wang,
Pengfei Zhu,
Qinghua Hu
Abstract:
Recently, remarkable progress has been made in large-scale pre-trained model tuning, and inference efficiency is becoming more crucial for practical deployment. Early exiting in conjunction with multi-stage predictors, when cooperated with a parameter-efficient fine-tuning strategy, offers a straightforward way to achieve an inference-efficient model. However, a key challenge remains unresolved: H…
▽ More
Recently, remarkable progress has been made in large-scale pre-trained model tuning, and inference efficiency is becoming more crucial for practical deployment. Early exiting in conjunction with multi-stage predictors, when cooperated with a parameter-efficient fine-tuning strategy, offers a straightforward way to achieve an inference-efficient model. However, a key challenge remains unresolved: How can early stages provide low-level fundamental features to deep stages while simultaneously supplying high-level discriminative features to early-stage predictors? To address this problem, we propose a Decoupled Multi-Predictor Optimization (DMPO) method to effectively decouple the low-level representative ability and high-level discriminative ability in early stages. First, in terms of architecture, we introduce a lightweight bypass module into multi-stage predictors for functional decomposition of shallow features from early stages, while a high-order statistics-based predictor is developed for early stages to effectively enhance their discriminative ability. To reasonably train our multi-predictor architecture, a decoupled optimization is proposed to allocate two-phase loss weights for multi-stage predictors during model tuning, where the initial training phase enables the model to prioritize the acquisition of discriminative ability of deep stages via emphasizing representative ability of early stages, and the latter training phase drives discriminative ability towards earlier stages as much as possible. As such, our DMPO can effectively decouple representative and discriminative abilities in early stages in terms of architecture design and model optimization. Experiments across various datasets and pre-trained backbones demonstrate that DMPO clearly outperforms its counterparts when reducing computational cost.
△ Less
Submitted 5 November, 2025;
originally announced November 2025.
-
BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Authors:
Wei Shang,
Wanying Zhang,
Shuhang Gu,
Pengfei Zhu,
Qinghua Hu,
Dongwei Ren
Abstract:
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors g…
▽ More
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.
△ Less
Submitted 6 November, 2025; v1 submitted 30 October, 2025;
originally announced October 2025.
-
Multimodal Negative Learning
Authors:
Baoquan Gong,
Xiyuan Gao,
Pengfei Zhu,
Qinghua Hu,
Bing Cao
Abstract:
Multimodal learning systems often encounter challenges related to modality imbalance, where a dominant modality may overshadow others, thereby hindering the learning of weak modalities. Conventional approaches often force weak modalities to align with dominant ones in "Learning to be (the same)" (Positive Learning), which risks suppressing the unique information inherent in the weak modalities. To…
▽ More
Multimodal learning systems often encounter challenges related to modality imbalance, where a dominant modality may overshadow others, thereby hindering the learning of weak modalities. Conventional approaches often force weak modalities to align with dominant ones in "Learning to be (the same)" (Positive Learning), which risks suppressing the unique information inherent in the weak modalities. To address this challenge, we offer a new learning paradigm: "Learning Not to be" (Negative Learning). Instead of enhancing weak modalities' target-class predictions, the dominant modalities dynamically guide the weak modality to suppress non-target classes. This stabilizes the decision space and preserves modality-specific information, allowing weak modalities to preserve unique information without being over-aligned. We proceed to reveal multimodal learning from a robustness perspective and theoretically derive the Multimodal Negative Learning (MNL) framework, which introduces a dynamic guidance mechanism tailored for negative learning. Our method provably tightens the robustness lower bound of multimodal learning by increasing the Unimodal Confidence Margin (UCoM) and reduces the empirical error of weak modalities, particularly under noisy and imbalanced scenarios. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generalizability of our approach against competing methods. The code will be available at https://github.com/BaoquanGong/Multimodal-Negative-Learning.git.
△ Less
Submitted 23 October, 2025;
originally announced October 2025.
-
CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations
Authors:
Guangyi Chen,
Yunlong Deng,
Peiyuan Zhu,
Yan Li,
Yifan Shen,
Zijian Li,
Kun Zhang
Abstract:
Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generall…
▽ More
Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:https://causal-verse.github.io/, Dataset:https://huggingface.co/CausalVerse.
△ Less
Submitted 17 October, 2025; v1 submitted 15 October, 2025;
originally announced October 2025.
-
Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems
Authors:
Pengyu Zhu,
Lijun Li,
Yaxing Lyu,
Li Sun,
Sen Su,
Jing Shao
Abstract:
LLM-based multi-agent systems (MAS) demonstrate increasing integration into next-generation applications, but their safety in backdoor attacks remains largely underexplored. However, existing research has focused exclusively on single-agent backdoor attacks, overlooking the novel attack surfaces introduced by agent collaboration in MAS. To bridge this gap, we present the first Distributed Backdoor…
▽ More
LLM-based multi-agent systems (MAS) demonstrate increasing integration into next-generation applications, but their safety in backdoor attacks remains largely underexplored. However, existing research has focused exclusively on single-agent backdoor attacks, overlooking the novel attack surfaces introduced by agent collaboration in MAS. To bridge this gap, we present the first Distributed Backdoor Attack tailored to MAS. We decompose the backdoor into multiple distributed attack primitives that are embedded within MAS tools. These primitives remain dormant individually but collectively activate only when agents collaborate in a specific sequence, thereby assembling the full backdoor to execute targeted attacks such as data exfiltration. To fully assess this threat, we introduce a benchmark for multi-role collaborative tasks and a sandboxed framework to evaluate. Extensive experiments demonstrate that our attack achieves an attack success rate exceeding 95% without degrading performance on benign tasks. This work exposes novel backdoor attack surfaces that exploit agent collaboration, underscoring the need to move beyond single-agent protection. Code and benchmark are available at https://github.com/whfeLingYu/Distributed-Backdoor-Attacks-in-MAS.
△ Less
Submitted 13 October, 2025;
originally announced October 2025.
-
Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array
Authors:
Yitong Chen,
Xinyao Xu,
Ping Zhu,
Xinyong Han,
Fangbo Qin,
Shan Yu
Abstract:
Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation syst…
▽ More
Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.
△ Less
Submitted 10 October, 2025;
originally announced October 2025.
-
MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
Authors:
Guobin Ma,
Jixun Yao,
Ziqian Ning,
Yuepeng Jiang,
Lingxin Xiong,
Lei Xie,
Pengcheng Zhu
Abstract:
Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autore…
▽ More
Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.
△ Less
Submitted 21 December, 2025; v1 submitted 9 October, 2025;
originally announced October 2025.
-
Everything-Grasping (EG) Gripper: A Universal Gripper with Synergistic Suction-Grasping Capabilities for Cross-Scale and Cross-State Manipulation
Authors:
Jianshu Zhou,
Jing Shu,
Tianle Pan,
Puchen Zhu,
Jiajun An,
Huayu Zhang,
Junda Huang,
Upinder Kaur,
Xin Ma,
Masayoshi Tomizuka
Abstract:
Grasping objects across vastly different sizes and physical states-including both solids and liquids-with a single robotic gripper remains a fundamental challenge in soft robotics. We present the Everything-Grasping (EG) Gripper, a soft end-effector that synergistically integrates distributed surface suction with internal granular jamming, enabling cross-scale and cross-state manipulation without…
▽ More
Grasping objects across vastly different sizes and physical states-including both solids and liquids-with a single robotic gripper remains a fundamental challenge in soft robotics. We present the Everything-Grasping (EG) Gripper, a soft end-effector that synergistically integrates distributed surface suction with internal granular jamming, enabling cross-scale and cross-state manipulation without requiring airtight sealing at the contact interface with target objects. The EG Gripper can handle objects with surface areas ranging from sub-millimeter scale 0.2 mm2 (glass bead) to over 62,000 mm2 (A4 sized paper and woven bag), enabling manipulation of objects nearly 3,500X smaller and 88X larger than its own contact area (approximated at 707 mm2 for a 30 mm-diameter base). We further introduce a tactile sensing framework that combines liquid detection and pressure-based suction feedback, enabling real-time differentiation between solid and liquid targets. Guided by the actile-Inferred Grasping Mode Selection (TIGMS) algorithm, the gripper autonomously selects grasping modes based on distributed pressure and voltage signals. Experiments across diverse tasks-including underwater grasping, fragile object handling, and liquid capture-demonstrate robust and repeatable performance. To our knowledge, this is the first soft gripper to reliably grasp both solid and liquid objects across scales using a unified compliant architecture.
△ Less
Submitted 6 October, 2025;
originally announced October 2025.
-
HydroFusion-LMF: Semi-Supervised Multi-Network Fusion with Large-Model Adaptation for Long-Term Daily Runoff Forecasting
Authors:
Qianfei Fan,
Jiayu Wei,
Peijun Zhu,
Wensheng Ye,
Meie Fang
Abstract:
Accurate decade-scale daily runoff forecasting in small watersheds is difficult because signals blend drifting trends, multi-scale seasonal cycles, regime shifts, and sparse extremes. Prior deep models (DLinear, TimesNet, PatchTST, TiDE, Nonstationary Transformer, LSTNet, LSTM) usually target single facets and under-utilize unlabeled spans, limiting regime adaptivity. We propose HydroFusion-LMF, a…
▽ More
Accurate decade-scale daily runoff forecasting in small watersheds is difficult because signals blend drifting trends, multi-scale seasonal cycles, regime shifts, and sparse extremes. Prior deep models (DLinear, TimesNet, PatchTST, TiDE, Nonstationary Transformer, LSTNet, LSTM) usually target single facets and under-utilize unlabeled spans, limiting regime adaptivity. We propose HydroFusion-LMF, a unified framework that (i) performs a learnable trend-seasonal-residual decomposition to reduce non-stationarity, (ii) routes residuals through a compact heterogeneous expert set (linear refinement, frequency kernel, patch Transformer, recurrent memory, dynamically normalized attention), (iii) fuses expert outputs via a hydrologic context-aware gate conditioned on day-of-year phase, antecedent precipitation, local variance, flood indicators, and static basin attributes, and (iv) augments supervision with a semi-supervised multi-task objective (composite MSE/MAE + extreme emphasis + NSE/KGE, masked reconstruction, multi-scale contrastive alignment, augmentation consistency, variance-filtered pseudo-labeling). Optional adapter / LoRA layers inject a frozen foundation time-series encoder efficiently. On a ~10-year daily dataset HydroFusion-LMF attains MSE 1.0128 / MAE 0.5818, improving the strongest baseline (DLinear) by 10.2% / 10.3% and the mean baseline by 24.6% / 17.1%. We observe simultaneous MSE and MAE reductions relative to baselines. The framework balances interpretability (explicit components, sparse gating) with performance, advancing label-efficient hydrologic forecasting under non-stationarity.
△ Less
Submitted 4 October, 2025;
originally announced October 2025.
-
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
Authors:
Peijun Zhu,
Ning Yang,
Jiayu Wei,
Jinghang Wu,
Haijun Zhang
Abstract:
Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activat…
▽ More
Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.
△ Less
Submitted 27 September, 2025;
originally announced October 2025.
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Authors:
Minhui Zhu,
Minyang Tian,
Xiaocheng Yang,
Tianci Zhou,
Lifan Yuan,
Penghao Zhu,
Eli Chertkov,
Shengyan Liu,
Yufeng Du,
Ziming Ji,
Indranil Das,
Junyi Cao,
Yufeng Du,
Jiabin Yu,
Peixue Wu,
Jinchen He,
Yifan Su,
Yikun Jiang,
Yujie Zhang,
Chang Liu,
Ze-Min Huang,
Weizhen Jia,
Yunkai Wang,
Farshid Jafarpour,
Yong Zhao
, et al. (39 additional authors not shown)
Abstract:
While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integr…
▽ More
While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.
△ Less
Submitted 20 November, 2025; v1 submitted 30 September, 2025;
originally announced September 2025.
-
Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Authors:
Qingyu Liu,
Yushen Chen,
Zhikang Niu,
Chunhui Wang,
Yunting Yang,
Bowen Zhang,
Jian Zhao,
Pengcheng Zhu,
Kai Yu,
Xie Chen
Abstract:
Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-…
▽ More
Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-based TTS models to remove audio prompt transcripts are identifying word boundaries during training and determining appropriate duration during inference. In this paper, we introduce Cross-Lingual F5-TTS, a framework that enables cross-lingual voice cloning without audio prompt transcripts. Our method preprocesses audio prompts by forced alignment to obtain word boundaries, enabling direct synthesis from audio prompts while excluding transcripts during training. To address the duration modeling challenge, we train speaking rate predictors at different linguistic granularities to derive duration from speaker pace. Experiments show that our approach matches the performance of F5-TTS while enabling cross-lingual voice cloning.
△ Less
Submitted 20 September, 2025; v1 submitted 17 September, 2025;
originally announced September 2025.
-
StableIntrinsic: Detail-preserving One-step Diffusion Model for Multi-view Material Estimation
Authors:
Xiuchao Wu,
Pengfei Zhu,
Jiangjing Lyu,
Xinguo Liu,
Jie Guo,
Yanwen Guo,
Weiwei Xu,
Chengfei Lyu
Abstract:
Recovering material information from images has been extensively studied in computer graphics and vision. Recent works in material estimation leverage diffusion model showing promising results. However, these diffusion-based methods adopt a multi-step denoising strategy, which is time-consuming for each estimation. Such stochastic inference also conflicts with the deterministic material estimation…
▽ More
Recovering material information from images has been extensively studied in computer graphics and vision. Recent works in material estimation leverage diffusion model showing promising results. However, these diffusion-based methods adopt a multi-step denoising strategy, which is time-consuming for each estimation. Such stochastic inference also conflicts with the deterministic material estimation task, leading to a high variance estimated results. In this paper, we introduce StableIntrinsic, a one-step diffusion model for multi-view material estimation that can produce high-quality material parameters with low variance. To address the overly-smoothing problem in one-step diffusion, StableIntrinsic applies losses in pixel space, with each loss designed based on the properties of the material. Additionally, StableIntrinsic introduces a Detail Injection Network (DIN) to eliminate the detail loss caused by VAE encoding, while further enhancing the sharpness of material prediction results. The experimental results indicate that our method surpasses the current state-of-the-art techniques by achieving a $9.9\%$ improvement in the Peak Signal-to-Noise Ratio (PSNR) of albedo, and by reducing the Mean Square Error (MSE) for metallic and roughness by $44.4\%$ and $60.0\%$, respectively.
△ Less
Submitted 27 August, 2025;
originally announced August 2025.
-
Average Achievable Rate Analysis of Cell-Free Massive MIMO in the Finite Blocklength Regime with Imperfect CSI
Authors:
Kai Chen,
Feng Ye,
Jiamin Li,
Pengcheng Zhu,
Dongming Wang,
Xiaohu You
Abstract:
Acquiring perfect channel state information (CSI) introduces substantial challenges in cell-free massive MIMO (CF-mMIMO) systems, primarily due to the large dimensionality of channel parameters, especially under ultra-reliable low-latency communication (uRLLC) constraints. Furthermore, the impact of imperfect CSI on the average achievable rate within the finite blocklength regime remains largely u…
▽ More
Acquiring perfect channel state information (CSI) introduces substantial challenges in cell-free massive MIMO (CF-mMIMO) systems, primarily due to the large dimensionality of channel parameters, especially under ultra-reliable low-latency communication (uRLLC) constraints. Furthermore, the impact of imperfect CSI on the average achievable rate within the finite blocklength regime remains largely unexplored. Motivated by this gap, this paper proposes a novel analytical framework that provides a closed-form expression for the average achievable rate with imperfect CSI in the Laplace domain. We demonstrate analytically that both the channel dispersion and the expected channel capacity can be expressed explicitly in terms of the Laplace transform of the large-scale fading component. Numerical simulations confirm that the derived expressions match closely with Monte Carlo simulations, verifying their accuracy. Furthermore, we theoretically show that although imperfect CSI degrades performance in the finite blocklength regime, the inherent characteristics of CF-mMIMO architecture effectively mitigates this loss.
△ Less
Submitted 25 August, 2025; v1 submitted 24 August, 2025;
originally announced August 2025.
-
HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation
Authors:
Xuepeng Liu,
Zheng Jiang,
Pinan Zhu,
Hanyu Liu,
Chao Li
Abstract:
Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via H&E-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex H&E images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) mod…
▽ More
Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via H&E-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex H&E images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) modeling gene-specific variation across expression channels. We propose HaDM-ST (Histology-assisted Differential Modeling for ST Generation), a high-resolution ST generation framework conditioned on H&E images and low-resolution ST. HaDM-ST includes: (i) a semantic distillation network to extract predictive cues from H&E; (ii) a spatial alignment module enforcing pixel-wise correspondence with low-resolution ST; and (iii) a channel-aware adversarial learner for fine-grained gene-level modeling. Experiments on 200 genes across diverse tissues and species show HaDM-ST consistently outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions.
△ Less
Submitted 10 August, 2025;
originally announced August 2025.
-
Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion
Authors:
Timing Li,
Bing Cao,
Jiahe Feng,
Haifang Cao,
Qinghau Hu,
Pengfei Zhu
Abstract:
Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment…
▽ More
Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H$^{2}$CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.
△ Less
Submitted 31 July, 2025;
originally announced July 2025.
-
A Segmentation Framework for Accurate Diagnosis of Amyloid Positivity without Structural Images
Authors:
Penghan Zhu,
Shurui Mei,
Shushan Chen,
Xiaobo Chu,
Shanbo He,
Ziyi Liu
Abstract:
This study proposes a deep learning-based framework for automated segmentation of brain regions and classification of amyloid positivity using positron emission tomography (PET) images alone, without the need for structural MRI or CT. A 3D U-Net architecture with four layers of depth was trained and validated on a dataset of 200 F18-florbetapir amyloid-PET scans, with an 130/20/50 train/validation…
▽ More
This study proposes a deep learning-based framework for automated segmentation of brain regions and classification of amyloid positivity using positron emission tomography (PET) images alone, without the need for structural MRI or CT. A 3D U-Net architecture with four layers of depth was trained and validated on a dataset of 200 F18-florbetapir amyloid-PET scans, with an 130/20/50 train/validation/test split. Segmentation performance was evaluated using Dice similarity coefficients across 30 brain regions, with scores ranging from 0.45 to 0.88, demonstrating high anatomical accuracy, particularly in subcortical structures. Quantitative fidelity of PET uptake within clinically relevant regions. Precuneus, prefrontal cortex, gyrus rectus, and lateral temporal cortex was assessed using normalized root mean square error, achieving values as low as 0.0011. Furthermore, the model achieved a classification accuracy of 0.98 for amyloid positivity based on regional uptake quantification, with an area under the ROC curve (AUC) of 0.99. These results highlight the model's potential for integration into PET only diagnostic pipelines, particularly in settings where structural imaging is not available. This approach reduces dependence on coregistration and manual delineation, enabling scalable, reliable, and reproducible analysis in clinical and research applications. Future work will focus on clinical validation and extension to diverse PET tracers including C11 PiB and other F18 labeled compounds.
△ Less
Submitted 29 July, 2025;
originally announced July 2025.
-
Transferable and Undefendable Point Cloud Attacks via Medial Axis Transform
Authors:
Keke Tang,
Yuze Gao,
Weilong Peng,
Xiaofei Wang,
Meie Fang,
Peican Zhu
Abstract:
Studying adversarial attacks on point clouds is essential for evaluating and improving the robustness of 3D deep learning models. However, most existing attack methods are developed under ideal white-box settings and often suffer from limited transferability to unseen models and insufficient robustness against common defense mechanisms. In this paper, we propose MAT-Adv, a novel adversarial attack…
▽ More
Studying adversarial attacks on point clouds is essential for evaluating and improving the robustness of 3D deep learning models. However, most existing attack methods are developed under ideal white-box settings and often suffer from limited transferability to unseen models and insufficient robustness against common defense mechanisms. In this paper, we propose MAT-Adv, a novel adversarial attack framework that enhances both transferability and undefendability by explicitly perturbing the medial axis transform (MAT) representations, in order to induce inherent adversarialness in the resulting point clouds. Specifically, we employ an autoencoder to project input point clouds into compact MAT representations that capture the intrinsic geometric structure of point clouds. By perturbing these intrinsic representations, MAT-Adv introduces structural-level adversarial characteristics that remain effective across diverse models and defense strategies. To mitigate overfitting and prevent perturbation collapse, we incorporate a dropout strategy into the optimization of MAT perturbations, further improving transferability and undefendability. Extensive experiments demonstrate that MAT-Adv significantly outperforms existing state-of-the-art methods in both transferability and undefendability. Codes will be made public upon paper acceptance.
△ Less
Submitted 24 July, 2025;
originally announced July 2025.
-
SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law
Authors:
Shanghai AI Lab,
:,
Yicheng Bao,
Guanxu Chen,
Mingkang Chen,
Yunhao Chen,
Chiyu Chen,
Lingjie Chen,
Sirui Chen,
Xinquan Chen,
Jie Cheng,
Yu Cheng,
Dengke Deng,
Yizhuo Ding,
Dan Ding,
Xiaoshan Ding,
Yi Ding,
Zhichen Dong,
Lingxiao Du,
Yuyu Fan,
Xinshun Feng,
Yanwei Fu,
Yuxuan Gao,
Ruijun Ge,
Tianle Gu
, et al. (93 additional authors not shown)
Abstract:
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn…
▽ More
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha' moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
△ Less
Submitted 7 August, 2025; v1 submitted 24 July, 2025;
originally announced July 2025.
-
SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection
Authors:
Peican Zhu,
Yubo Jing,
Le Cheng,
Bin Chen,
Xiaodong Cui,
Lianwei Wu,
Keke Tang
Abstract:
Previous studies on multimodal fake news detection mainly focus on the alignment and integration of cross-modal features, as well as the application of text-image consistency. However, they overlook the semantic enhancement effects of large multimodal models and pay little attention to the emotional features of news. In addition, people find that fake news is more inclined to contain negative emot…
▽ More
Previous studies on multimodal fake news detection mainly focus on the alignment and integration of cross-modal features, as well as the application of text-image consistency. However, they overlook the semantic enhancement effects of large multimodal models and pay little attention to the emotional features of news. In addition, people find that fake news is more inclined to contain negative emotions than real ones. Therefore, we propose a novel Semantic Enhancement and Emotional Reasoning (SEER) Network for multimodal fake news detection. We generate summarized captions for image semantic understanding and utilize the products of large multimodal models for semantic enhancement. Inspired by the perceived relationship between news authenticity and emotional tendencies, we propose an expert emotional reasoning module that simulates real-life scenarios to optimize emotional features and infer the authenticity of news. Extensive experiments on two real-world datasets demonstrate the superiority of our SEER over state-of-the-art baselines.
△ Less
Submitted 17 July, 2025;
originally announced July 2025.
-
Tao-Technology for Teen Mobile Use: Harmonizing Adaptation, Autonomy, and Reflection
Authors:
Pengyu Zhu,
Janghee Cho
Abstract:
Adolescents' mobile technology use is often regulated through rigid control mechanisms that fail to account for their autonomy and natural usage patterns. Drawing on Taoist philosophy, particularly Wu Wei, Yin-Yang, and Zi Ran, this position paper proposes Tao-Technology, a self-organizing, adaptive regulatory framework. Integrating insights from Reflective Informatics and Information Ecologies, w…
▽ More
Adolescents' mobile technology use is often regulated through rigid control mechanisms that fail to account for their autonomy and natural usage patterns. Drawing on Taoist philosophy, particularly Wu Wei, Yin-Yang, and Zi Ran, this position paper proposes Tao-Technology, a self-organizing, adaptive regulatory framework. Integrating insights from Reflective Informatics and Information Ecologies, we explore how mobile technology can dynamically adjust to context while fostering self-reflection and meaning-making. This approach shifts from external restrictions to dynamic co-adaptative regulation, ensuring technology governance remains flexible yet structured, supporting adolescents in cultivating a balanced and intentional relationship with digital technology.
△ Less
Submitted 16 July, 2025;
originally announced July 2025.
-
AdaBrain-Bench: Benchmarking Brain Foundation Models for Brain-Computer Interface Applications
Authors:
Jiamin Wu,
Zichen Ren,
Junyu Wang,
Pengyu Zhu,
Yonghao Song,
Mianxin Liu,
Qihao Zheng,
Lei Bai,
Wanli Ouyang,
Chunfeng Song
Abstract:
Non-invasive Brain-Computer Interfaces (BCI) offer a safe and accessible means of connecting the human brain to external devices, with broad applications in home and clinical settings to enhance human capabilities. However, the high noise level and limited task-specific data in non-invasive signals constrain decoding capabilities. Recently, the adoption of self-supervised pre-training is transform…
▽ More
Non-invasive Brain-Computer Interfaces (BCI) offer a safe and accessible means of connecting the human brain to external devices, with broad applications in home and clinical settings to enhance human capabilities. However, the high noise level and limited task-specific data in non-invasive signals constrain decoding capabilities. Recently, the adoption of self-supervised pre-training is transforming the landscape of non-invasive BCI research, enabling the development of brain foundation models to capture generic neural representations from large-scale unlabeled electroencephalography (EEG) signals with substantial noises. However, despite these advances, the field currently lacks comprehensive, practical and extensible benchmarks to assess the utility of the public foundation models across diverse BCI tasks, hindering their widespread adoption. To address this challenge, we present AdaBrain-Bench, a large-scale standardized benchmark to systematically evaluate brain foundation models in widespread non-invasive BCI tasks. AdaBrain-Bench encompasses a diverse collection of representative BCI decoding datasets spanning 7 key applications. It introduces a streamlined task adaptation pipeline integrated with multi-dimensional evaluation metrics and a set of adaptation tools. The benchmark delivers an inclusive framework for assessing generalizability of brain foundation models across key transfer settings, including cross-subject, multi-subject, and few-shot scenarios. We leverage AdaBrain-Bench to evaluate a suite of publicly available brain foundation models and offer insights into practices for selecting appropriate models in various scenarios. We make our benchmark pipeline available to enable reproducible research and external use, offering a continuously evolving platform to foster progress toward robust and generalized neural decoding solutions.
△ Less
Submitted 5 August, 2025; v1 submitted 13 July, 2025;
originally announced July 2025.
-
KEN: Knowledge Augmentation and Emotion Guidance Network for Multimodal Fake News Detection
Authors:
Peican Zhu,
Yubo Jing,
Le Cheng,
Keke Tang,
Yangming Guo
Abstract:
In recent years, the rampant spread of misinformation on social media has made accurate detection of multimodal fake news a critical research focus. However, previous research has not adequately understood the semantics of images, and models struggle to discern news authenticity with limited textual information. Meanwhile, treating all emotional types of news uniformly without tailored approaches…
▽ More
In recent years, the rampant spread of misinformation on social media has made accurate detection of multimodal fake news a critical research focus. However, previous research has not adequately understood the semantics of images, and models struggle to discern news authenticity with limited textual information. Meanwhile, treating all emotional types of news uniformly without tailored approaches further leads to performance degradation. Therefore, we propose a novel Knowledge Augmentation and Emotion Guidance Network (KEN). On the one hand, we effectively leverage LVLM's powerful semantic understanding and extensive world knowledge. For images, the generated captions provide a comprehensive understanding of image content and scenes, while for text, the retrieved evidence helps break the information silos caused by the closed and limited text and context. On the other hand, we consider inter-class differences between different emotional types of news through balanced learning, achieving fine-grained modeling of the relationship between emotional types and authenticity. Extensive experiments on two real-world datasets demonstrate the superiority of our KEN.
△ Less
Submitted 17 July, 2025; v1 submitted 13 July, 2025;
originally announced July 2025.
-
Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring
Authors:
Wei Shang,
Dongwei Ren,
Wanying Zhang,
Pengfei Zhu,
Qinghua Hu,
Wangmeng Zuo
Abstract:
Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trai…
▽ More
Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49\% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.
△ Less
Submitted 10 July, 2025;
originally announced July 2025.
-
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Authors:
Ziqi Miao,
Lijun Li,
Yuan Xiong,
Zhenhua Liu,
Pengyu Zhu,
Jing Shao
Abstract:
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. While existing jailbreak attacks largely rely on single-turn or multi-turn prompt manipulations,…
▽ More
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. While existing jailbreak attacks largely rely on single-turn or multi-turn prompt manipulations, or inject static in-context examples, these methods suffer from limited effectiveness, inefficiency, or semantic drift. We introduce Response Attack (RA), a novel framework that strategically leverages intermediate, mildly harmful responses as contextual primers within a dialogue. By reformulating harmful queries and injecting these intermediate responses before issuing a targeted trigger prompt, RA exploits a previously overlooked vulnerability in LLMs. Extensive experiments across eight state-of-the-art LLMs show that RA consistently achieves significantly higher attack success rates than nine leading jailbreak baselines. Our results demonstrate that the success of RA is directly attributable to the strategic use of intermediate responses, which induce models to generate more explicit and relevant harmful content while maintaining stealth, efficiency, and fidelity to the original query. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.
△ Less
Submitted 21 November, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Cage-Based Deformation for Transferable and Undefendable Point Cloud Attack
Authors:
Keke Tang,
Ziyong Du,
Weilong Peng,
Xiaofei Wang,
Peican Zhu,
Ligang Liu,
Zhihong Tian
Abstract:
Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose…
▽ More
Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose CageAttack, a cage-based deformation framework that produces natural adversarial point clouds. It first constructs a cage around the target object, providing a structured basis for smooth, natural-looking deformation. Perturbations are then applied to the cage vertices, which seamlessly propagate to the point cloud, ensuring that the resulting deformations remain intrinsic to the object and preserve plausibility. Extensive experiments on seven 3D deep neural network classifiers across three datasets show that CageAttack achieves a superior balance among transferability, undefendability, and plausibility, outperforming state-of-the-art methods. Codes will be made public upon acceptance.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding
Authors:
Weiheng Lu,
Chunfeng Song,
Jiamin Wu,
Pengyu Zhu,
Yuchen Zhou,
Weijian Mai,
Qihao Zheng,
Wanli Ouyang
Abstract:
Decoding human brain activity from electroencephalography (EEG) signals is a central challenge at the intersection of neuroscience and artificial intelligence, enabling diverse applications in mental state assessment, clinical monitoring, and human-machine interaction. Recent efforts have extensively explored EEG-based brain foundation models for generalized brain decoding, employing large-scale t…
▽ More
Decoding human brain activity from electroencephalography (EEG) signals is a central challenge at the intersection of neuroscience and artificial intelligence, enabling diverse applications in mental state assessment, clinical monitoring, and human-machine interaction. Recent efforts have extensively explored EEG-based brain foundation models for generalized brain decoding, employing large-scale training on multiple datasets. However, most of these attempts struggle with generalizability and fail to achieve satisfactory performance without task-specific tuning due to pronounced inherent heterogeneity among decoding tasks. To address these challenges, we present UniMind, a general-purpose EEG foundation model for unified multi-task brain decoding by uniquely unleashing the power of large language models to comprehend complex neural patterns. UniMind offers several advantages. First, we design a Neuro-Language Connector to bridge the modality gap between neural signals and large language models, distilling and transforming the spatiotemporal neural patterns of EEG data into representations understandable by language models. Second, a Task-aware Query Selection module is proposed to inject task-awareness into the cross-modal alignment by dynamically generating task-adaptive query tokens, enabling learning of task-relevant neural patterns across diverse tasks. Extensive experiments across ten datasets demonstrate that UniMind substantially outperforms state-of-the-art multi-task decoding models, with an average gain of 12 percent, while also offering valuable neuroscientific insights into neural functional correlations across tasks. The code will be made publicly available.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Surprise Calibration for Better In-Context Learning
Authors:
Zhihang Tan,
Jingrui Hou,
Ping Wang,
Qibiao Hu,
Peng Zhu
Abstract:
In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models (LLMs), where models infer underlying task structures from a few demonstrations. However, ICL remains susceptible to biases that arise from prior knowledge and contextual demonstrations, which can degrade the performance of LLMs. Existing bias calibration methods typically apply fixed class pr…
▽ More
In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models (LLMs), where models infer underlying task structures from a few demonstrations. However, ICL remains susceptible to biases that arise from prior knowledge and contextual demonstrations, which can degrade the performance of LLMs. Existing bias calibration methods typically apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings where the context for each query differs. To address these limitations, we adopt implicit sequential Bayesian inference as a framework for interpreting ICL, identify "surprise" as an informative signal for class prior shift, and introduce a novel method--Surprise Calibration (SC). SC leverages the notion of surprise to capture the temporal dynamics of class priors, providing a more adaptive and computationally efficient solution for in-context learning. We empirically demonstrate the superiority of SC over existing bias calibration techniques across a range of benchmark natural language processing tasks.
△ Less
Submitted 17 June, 2025; v1 submitted 15 June, 2025;
originally announced June 2025.
-
Serving Large Language Models on Huawei CloudMatrix384
Authors:
Pengfei Zuo,
Huimin Lin,
Junbo Deng,
Nan Zou,
Xingkun Yang,
Yingyu Diao,
Weifeng Gao,
Ke Xu,
Zhangyu Chen,
Shirui Lu,
Zhao Qiu,
Peiyang Li,
Xianyu Chang,
Zhengzhong Yu,
Fangzheng Miao,
Jia Zheng,
Ying Li,
Yuan Feng,
Bei Wang,
Zaijian Zong,
Mosong Zhou,
Wenli Zhou,
Houjiang Chen,
Xingyu Liao,
Yipeng Li
, et al. (21 additional authors not shown)
Abstract:
The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-leve…
▽ More
The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.
△ Less
Submitted 19 June, 2025; v1 submitted 14 June, 2025;
originally announced June 2025.
-
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Authors:
Yu Gao,
Haoyuan Guo,
Tuyen Hoang,
Weilin Huang,
Lu Jiang,
Fangyuan Kong,
Huixia Li,
Jiashi Li,
Liang Li,
Xiaojie Li,
Xunsong Li,
Yifu Li,
Shanchuan Lin,
Zhijie Lin,
Jiawei Liu,
Shu Liu,
Xiaonan Nie,
Zhiwu Qing,
Yuxi Ren,
Li Sun,
Zhi Tian,
Rui Wang,
Sen Wang,
Guoqiang Wei,
Guohong Wu
, et al. (19 additional authors not shown)
Abstract:
Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core tec…
▽ More
Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.
△ Less
Submitted 28 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation
Authors:
Chao Yin,
Hao Li,
Kequan Yang,
Jide Li,
Pinpin Zhu,
Xiaoqiang Li
Abstract:
While promptable segmentation (\textit{e.g.}, SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Ob…
▽ More
While promptable segmentation (\textit{e.g.}, SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) \textit{\textbf{semantic ambiguity in getting instance-specific text prompts}}, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) \textit{\textbf{semantic discrepancy combined with spatial separation in getting instance-specific visual prompts}}, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose \textbf{RDVP-MSD}, a novel training-free test-time adaptation framework that synergizes \textbf{R}egion-constrained \textbf{D}ual-stream \textbf{V}isual \textbf{P}rompting (RDVP) via \textbf{M}ultimodal \textbf{S}tepwise \textbf{D}ecomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \href{https://github.com/ycyinchao/RDVP-MSD}{https://github.com/ycyinchao/RDVP-MSD}
△ Less
Submitted 14 August, 2025; v1 submitted 7 June, 2025;
originally announced June 2025.
-
LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting
Authors:
Pai Zhu,
Quan Wang,
Dhruuv Agarwal,
Kurt Partridge
Abstract:
Custom keyword spotting (KWS) allows detecting user-defined spoken keywords from streaming audio. This is achieved by comparing the embeddings from voice enrollments and input audio. State-of-the-art custom KWS models are typically trained contrastively using utterances whose keywords are randomly sampled from training dataset. These KWS models often struggle with confusing keywords, such as "blue…
▽ More
Custom keyword spotting (KWS) allows detecting user-defined spoken keywords from streaming audio. This is achieved by comparing the embeddings from voice enrollments and input audio. State-of-the-art custom KWS models are typically trained contrastively using utterances whose keywords are randomly sampled from training dataset. These KWS models often struggle with confusing keywords, such as "blue" versus "glue". This paper introduces an effective way to augment the training with confusable utterances where keywords are generated and grouped from large language models (LLMs), and speech signals are synthesized with diverse speaking styles from text-to-speech (TTS) engines. To better measure user experience on confusable KWS, we define a new northstar metric using the average area under DET curve from confusable groups (c-AUC). Featuring high scalability and zero labor cost, the proposed method improves AUC by 3.7% and c-AUC by 11.3% on the Speech Commands testing set.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects
Authors:
Chengyan Wu,
Yiqiang Cai,
Yang Liu,
Pengxu Zhu,
Yun Xue,
Ziwei Gong,
Julia Hirschberg,
Bolei Ma
Abstract:
While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accuratel…
▽ More
While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals.
This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.
△ Less
Submitted 9 September, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples
Authors:
Harry Zhang,
Kurt Partridge,
Pai Zhu,
Neng Chen,
Hyun Jin Park,
Dhruuv Agarwal,
Quan Wang
Abstract:
Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversa…
▽ More
Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword's graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data.
△ Less
Submitted 24 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration
Authors:
Pengyan Zhu,
Tingting Yang
Abstract:
Emerging intelligent service scenarios in 6G communication impose stringent requirements for low latency, high reliability, and privacy preservation. Generative large language models (LLMs) are gradually becoming key enablers for the integration of semantic communication and computation. However, due to the limited computational resources of edge devices and the increasing complexity of heterogene…
▽ More
Emerging intelligent service scenarios in 6G communication impose stringent requirements for low latency, high reliability, and privacy preservation. Generative large language models (LLMs) are gradually becoming key enablers for the integration of semantic communication and computation. However, due to the limited computational resources of edge devices and the increasing complexity of heterogeneous terminal access, existing centralized inference approaches fail to meet the dual demands of response efficiency and data privacy in edge-side inference tasks. To address these challenges, this paper proposes a novel collaborative inference architecture that integrates cloud-based LLMs with edge-deployed small language models (SLMs), enabling dynamic scheduling and sharing of semantic-level intermediate states, and establishing a unified computation-communication paradigm tailored for 6G networks. Specifically, a key-value (KV) cache reuse mechanism is introduced to enhance the semantic understanding of edge models through contextual guidance from the cloud, while significantly reducing edge-side computational and storage overhead. Furthermore, a cross-node parallel scheduling mechanism is proposed to achieve asynchronous coordination between model state loading and decoding computation, thereby improving edge responsiveness. In addition, we investigate layer alignment and representation compression strategies between heterogeneous models to alleviate the communication burden on the edge. Experimental results demonstrate that the proposed architecture exhibits superior adaptability and scalability in terms of inference latency, system stability, and concurrent processing capacity.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
DecIF: Improving Instruction-Following through Meta-Decomposition
Authors:
Tingfeng Hui,
Pengyu Zhu,
Bowen Ping,
Ling Tang,
Guanting Dong,
Yaqi Zhang,
Sen Su
Abstract:
Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-qu…
▽ More
Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-quality instruction-following data using only LLMs. DecIF is grounded in the principle of decomposition. For instruction generation, we guide LLMs to iteratively produce various types of meta-information, which are then combined with response constraints to form well-structured and semantically rich instructions. We further utilize LLMs to detect and resolve potential inconsistencies within the generated instructions. Regarding response generation, we decompose each instruction into atomic-level evaluation criteria, enabling rigorous validation and the elimination of inaccurate instruction-response pairs. Extensive experiments across a wide range of scenarios and settings demonstrate DecIF's superior performance on instruction-following tasks. Further analysis highlights its strong flexibility, scalability, and generalizability in automatically synthesizing high-quality instruction data.
△ Less
Submitted 10 June, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs
Authors:
Le Cheng,
Peican Zhu,
Yangming Guo,
Chao Gao,
Zhen Wang,
Keke Tang
Abstract:
Source detection on graphs has demonstrated high efficacy in identifying rumor origins. Despite advances in machine learning-based methods, many fail to capture intrinsic dynamics of rumor propagation. In this work, we present SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs, which harnesses the recent success of the state space model Mamba, known for…
▽ More
Source detection on graphs has demonstrated high efficacy in identifying rumor origins. Despite advances in machine learning-based methods, many fail to capture intrinsic dynamics of rumor propagation. In this work, we present SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs, which harnesses the recent success of the state space model Mamba, known for its superior global modeling capabilities and computational efficiency, to address this challenge. Specifically, we first employ hypergraphs to model high-order interactions within social networks. Subsequently, temporal network snapshots generated during the propagation process are sequentially fed in reverse order into Mamba to infer underlying propagation dynamics. Finally, to empower the sequential model to effectively capture propagation patterns while integrating structural information, we propose a novel graph-aware state update mechanism, wherein the state of each node is propagated and refined by both temporal dependencies and topological context. Extensive evaluations on eight datasets demonstrate that SourceDetMamba consistently outperforms state-of-the-art approaches.
△ Less
Submitted 4 June, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
HyperDet: Source Detection in Hypergraphs via Interactive Relationship Construction and Feature-rich Attention Fusion
Authors:
Le Cheng,
Peican Zhu,
Yangming Guo,
Keke Tang,
Chao Gao,
Zhen Wang
Abstract:
Hypergraphs offer superior modeling capabilities for social networks, particularly in capturing group phenomena that extend beyond pairwise interactions in rumor propagation. Existing approaches in rumor source detection predominantly focus on dyadic interactions, which inadequately address the complexity of more intricate relational structures. In this study, we present a novel approach for Sourc…
▽ More
Hypergraphs offer superior modeling capabilities for social networks, particularly in capturing group phenomena that extend beyond pairwise interactions in rumor propagation. Existing approaches in rumor source detection predominantly focus on dyadic interactions, which inadequately address the complexity of more intricate relational structures. In this study, we present a novel approach for Source Detection in Hypergraphs (HyperDet) via Interactive Relationship Construction and Feature-rich Attention Fusion. Specifically, our methodology employs an Interactive Relationship Construction module to accurately model both the static topology and dynamic interactions among users, followed by the Feature-rich Attention Fusion module, which autonomously learns node features and discriminates between nodes using a self-attention mechanism, thereby effectively learning node representations under the framework of accurately modeled higher-order relationships. Extensive experimental validation confirms the efficacy of our HyperDet approach, showcasing its superiority relative to current state-of-the-art methods.
△ Less
Submitted 4 June, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Q-space Guided Collaborative Attention Translation Network for Flexible Diffusion-Weighted Images Synthesis
Authors:
Pengli Zhu,
Yingji Fu,
Nanguang Chen,
Anqi Qiu
Abstract:
This study, we propose a novel Q-space Guided Collaborative Attention Translation Networks (Q-CATN) for multi-shell, high-angular resolution DWI (MS-HARDI) synthesis from flexible q-space sampling, leveraging the commonly acquired structural MRI data. Q-CATN employs a collaborative attention mechanism to effectively extract complementary information from multiple modalities and dynamically adjust…
▽ More
This study, we propose a novel Q-space Guided Collaborative Attention Translation Networks (Q-CATN) for multi-shell, high-angular resolution DWI (MS-HARDI) synthesis from flexible q-space sampling, leveraging the commonly acquired structural MRI data. Q-CATN employs a collaborative attention mechanism to effectively extract complementary information from multiple modalities and dynamically adjust its internal representations based on flexible q-space information, eliminating the need for fixed sampling schemes. Additionally, we introduce a range of task-specific constraints to preserve anatomical fidelity in DWI, enabling Q-CATN to accurately learn the intrinsic relationships between directional DWI signal distributions and q-space. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that Q-CATN outperforms existing methods, including 1D-qDL, 2D-qDL, MESC-SD, and QGAN, in estimating parameter maps and fiber tracts both quantitatively and qualitatively, while preserving fine-grained details. Notably, its ability to accommodate flexible q-space sampling highlights its potential as a promising toolkit for clinical and research applications. Our code is available at https://github.com/Idea89560041/Q-CATN.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
The Geography of Transportation Cybersecurity: Visitor Flows, Industry Clusters, and Spatial Dynamics
Authors:
Yuhao Wang,
Kailai Wang,
Songhua Hu,
Yunpeng,
Zhang,
Gino Lim,
Pengyu Zhu
Abstract:
The rapid evolution of the transportation cybersecurity ecosystem, encompassing cybersecurity, automotive, and transportation and logistics sectors, will lead to the formation of distinct spatial clusters and visitor flow patterns across the US. This study examines the spatiotemporal dynamics of visitor flows, analyzing how socioeconomic factors shape industry clustering and workforce distribution…
▽ More
The rapid evolution of the transportation cybersecurity ecosystem, encompassing cybersecurity, automotive, and transportation and logistics sectors, will lead to the formation of distinct spatial clusters and visitor flow patterns across the US. This study examines the spatiotemporal dynamics of visitor flows, analyzing how socioeconomic factors shape industry clustering and workforce distribution within these evolving sectors. To model and predict visitor flow patterns, we develop a BiTransGCN framework, integrating an attention-based Transformer architecture with a Graph Convolutional Network backbone. By integrating AI-enabled forecasting techniques with spatial analysis, this study improves our ability to track, interpret, and anticipate changes in industry clustering and mobility trends, thereby supporting strategic planning for a secure and resilient transportation network. It offers a data-driven foundation for economic planning, workforce development, and targeted investments in the transportation cybersecurity ecosystem.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution
Authors:
Wei Shang,
Dongwei Ren,
Wanying Zhang,
Pengfei Zhu,
Qinghua Hu,
Wangmeng Zuo
Abstract:
The primary challenge in accelerating image super-resolution lies in reducing computation while maintaining performance and adaptability. Motivated by the observation that high-frequency regions (e.g., edges and textures) are most critical for reconstruction, we propose a training-free adaptive masking module for acceleration that dynamically focuses computation on these challenging areas. Specifi…
▽ More
The primary challenge in accelerating image super-resolution lies in reducing computation while maintaining performance and adaptability. Motivated by the observation that high-frequency regions (e.g., edges and textures) are most critical for reconstruction, we propose a training-free adaptive masking module for acceleration that dynamically focuses computation on these challenging areas. Specifically, our method first extracts high-frequency components via Gaussian blur subtraction and adaptively generates binary masks using K-means clustering to identify regions requiring intensive processing. Our method can be easily integrated with both CNNs and Transformers. For CNN-based architectures, we replace standard $3 \times 3$ convolutions with an unfold operation followed by $1 \times 1$ convolutions, enabling pixel-wise sparse computation guided by the mask. For Transformer-based models, we partition the mask into non-overlapping windows and selectively process tokens based on their average values. During inference, unnecessary pixels or windows are pruned, significantly reducing computation. Moreover, our method supports dilation-based mask adjustment to control the processing scope without retraining, and is robust to unseen degradations (e.g., noise, compression). Extensive experiments on benchmarks demonstrate that our method reduces FLOPs by 24--43% for state-of-the-art models (e.g., CARN, SwinIR) while achieving comparable or better quantitative metrics. The source code is available at https://github.com/shangwei5/AMSR
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Bi-directional Self-Registration for Misaligned Infrared-Visible Image Fusion
Authors:
Timing Li,
Bing Cao,
Pengfei Zhu,
Bin Xiao,
Qinghua Hu
Abstract:
Acquiring accurately aligned multi-modal image pairs is fundamental for achieving high-quality multi-modal image fusion. To address the lack of ground truth in current multi-modal image registration and fusion methods, we propose a novel self-supervised \textbf{B}i-directional \textbf{S}elf-\textbf{R}egistration framework (\textbf{B-SR}). Specifically, B-SR utilizes a proxy data generator (PDG) an…
▽ More
Acquiring accurately aligned multi-modal image pairs is fundamental for achieving high-quality multi-modal image fusion. To address the lack of ground truth in current multi-modal image registration and fusion methods, we propose a novel self-supervised \textbf{B}i-directional \textbf{S}elf-\textbf{R}egistration framework (\textbf{B-SR}). Specifically, B-SR utilizes a proxy data generator (PDG) and an inverse proxy data generator (IPDG) to achieve self-supervised global-local registration. Visible-infrared image pairs with spatially misaligned differences are aligned to obtain global differences through the registration module. The same image pairs are processed by PDG, such as cropping, flipping, stitching, etc., and then aligned to obtain local differences. IPDG converts the obtained local differences into pseudo-global differences, which are used to perform global-local difference consistency with the global differences. Furthermore, aiming at eliminating the effect of modal gaps on the registration module, we design a neighborhood dynamic alignment loss to achieve cross-modal image edge alignment. Extensive experiments on misaligned multi-modal images demonstrate the effectiveness of the proposed method in multi-modal image alignment and fusion against the competing methods. Our code will be publicly available.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.