Skip to main content

Showing 1–50 of 214 results for author: Guan, W

.
  1. arXiv:2604.07922  [pdf, ps, other

    cs.AI cs.CL

    SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

    Authors: Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen, Weili Guan, Min Zhang

    Abstract: Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive "overthinking", generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framewo… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: accepted to ACL2026 main conference

  2. arXiv:2604.00513  [pdf, ps, other

    cs.LG cs.AI cs.CV cs.IR

    MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

    Authors: Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan, Wanxian Guan, Chuan Yu, Jian Xu, Bo Zheng

    Abstract: With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their abilit… ▽ More

    Submitted 2 April, 2026; v1 submitted 1 April, 2026; originally announced April 2026.

    Comments: 10 pages, 6 figures

  3. arXiv:2604.00404  [pdf, ps, other

    cs.CV

    The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

    Authors: Xusheng He, Canyang Wu, Jinrong Zhang, Weili Guan, Jianlong Wu, Liqiang Nie

    Abstract: This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM… ▽ More

    Submitted 31 March, 2026; originally announced April 2026.

    Comments: 1st Place Solution for the 5th PVUW MeViS-Text Challenge (CVPR 2026 Workshop)

  4. arXiv:2604.00395  [pdf, ps, other

    cs.CV

    Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge

    Authors: Jinrong Zhang, Canyang Wu, Xusheng He, Weili Guan, Jianlong Wu, Liqiang Nie

    Abstract: In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms… ▽ More

    Submitted 31 March, 2026; originally announced April 2026.

    Comments: 1st Place Solution for the 5th PVUW MOSE Challenge (CVPR 2026 Workshop)

  5. arXiv:2603.30014  [pdf, ps, other

    cs.DC cs.AI

    Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

    Authors: Derek Anderson, Amit Bashyal, Markus Diefenthaler, Cristiano Fanelli, Wen Guan, Tanja Horn, Alex Jentsch Meifeng Lin, Tadashi Maeno, Kei Nagai, Hemalata Nayak, Connor Pecar, Karthik Suresh, Fang-Ying Tsai, Anselm Vossen, Tianle Wang, Torre Wenaus

    Abstract: The Production and Distributed Analysis (PanDA) system, originally developed for the ATLAS experiment at the CERN Large Hadron Collider (LHC), has evolved into a robust platform for orchestrating large-scale workflows across distributed computing resources. Coupled with its intelligent Distributed Dispatch and Scheduling (iDDS) component, PanDA supports AI/ML-driven workflows through a scalable an… ▽ More

    Submitted 31 March, 2026; originally announced March 2026.

  6. arXiv:2603.29842  [pdf, ps, other

    cs.CV cs.LG

    Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

    Authors: Minyoung E. Kim, Dae Hee Yun, Aditi V. Patel, Madeline Hon, Webster Guan, Taegeon Lee, Brian Nguyen

    Abstract: Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these… ▽ More

    Submitted 31 March, 2026; originally announced March 2026.

    Comments: 21 pages, 12 figures. Accepted at CVPR 2026

  7. The Vera C. Rubin Observatory Data Preview 1

    Authors: Vera C Rubin Observatory Team, Tatiana Acero Cuellar, Emily Acosta, Christina L Adair, Prakruth Adari, Jennifer K Adelman McCarthy, Anastasia Alexov, Russ Allbery, Robyn Allsman, Yusra AlSayyad, Jhonatan Amado, Nathan Amouroux, Pierre Antilogus, Alexis Aracena Alcayaga, Gonzalo Aravena Rojas, Claudio H Araya Cortes, Eric Aubourg, Tim S Axelrod, John Banovetz, Carlos Barria, Amanda E Bauer, Brian J Bauman, Ellen Bechtol, Keith Bechtol, Andrew C Becker , et al. (303 additional authors not shown)

    Abstract: We present Rubin Data Preview 1 DP1, the first data from the NSF DOE Vera C Rubin Observatory, comprising raw and calibrated single epoch images, coadds, difference images, detection catalogs, and ancillary data products. DP1 is based on 1792 optical near infrared exposures acquired over 48 distinct nights by the Rubin Commissioning Camera LSSTComCam on the Simonyi Survey Telescope at the Summit F… ▽ More

    Submitted 24 March, 2026; originally announced March 2026.

    Comments: 59 pages, 41 figures

    Report number: RTN-095.lsst.io

  8. arXiv:2603.20236  [pdf, ps, other

    cs.RO

    EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models

    Authors: Mingchen Song, Xiang Deng, Jie Wei, Dongmei Jiang, Liqiang Nie, Weili Guan

    Abstract: Recent advances in unimanual manipulation policies have achieved remarkable success across diverse robotic tasks through abundant training data and well-established model architectures. However, extending these capabilities to bimanual manipulation remains challenging due to the lack of bimanual demonstration data and the complexity of coordinating dual-arm actions. Existing approaches either rely… ▽ More

    Submitted 9 March, 2026; originally announced March 2026.

  9. arXiv:2603.14251  [pdf, ps, other

    cs.CL cs.AI

    Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

    Authors: Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu, Chengyang Fang, Xiaoshuai Hao, Can Ma, Weiping Wang

    Abstract: Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning.… ▽ More

    Submitted 15 March, 2026; originally announced March 2026.

  10. arXiv:2603.12606  [pdf, ps, other

    cs.CV cs.AI

    Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

    Authors: Zesheng Yang, Xi Jiang, Bingzhang Hu, Weili Guan, Runmin Cong, Guo-Jun Qi, Feng Zheng

    Abstract: Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To addres… ▽ More

    Submitted 12 March, 2026; originally announced March 2026.

    Comments: 12 pages, 6 figures

  11. arXiv:2603.12138  [pdf, ps, other

    cs.CV

    HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

    Authors: Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen

    Abstract: Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from th… ▽ More

    Submitted 12 March, 2026; originally announced March 2026.

    Comments: Accepted by CVPR 2026

  12. arXiv:2603.07536  [pdf, ps, other

    physics.flu-dyn

    Stabilization of premixed NH3/H2/air flames via bluff-body flame holders

    Authors: Lukas Gaipl, Wei Guan, Ganesh Guggilla, Alexey Kropman, Frank Beyrau, Dominique Thévenin

    Abstract: The stabilization mechanisms of fully premixed NH3/H2/air flames anchored behind a bluff body are investigated using combined experiments and direct numerical simulations. Particular attention is given to the interplay between preferential diffusion, heat release, flow recirculation, and turbulence-flame interaction. Comparison between non-reactive and reactive cases shows that thermal expansion s… ▽ More

    Submitted 8 March, 2026; originally announced March 2026.

  13. arXiv:2602.13978  [pdf, ps, other

    math.AP

    Constrained variational problems on perturbed lattice graphs

    Authors: Weiqi Guan

    Abstract: In this paper, we solve some constrained variational problems on perturbed lattice graphs $G$. The first problem addresses the existence of ground state normalized solutions to Schrödinger equations \begin{equation*} \left\{ \begin{aligned} &-Δ_{G} u+λu=\vert u\vert^{p-2}u,x\in G &\Vert u\Vert_{l^2(G)}^2=a. \end{aligned} \right. \end{equation*} We prove that if the graph is obtained by… ▽ More

    Submitted 14 February, 2026; originally announced February 2026.

    Comments: 17 pages

  14. arXiv:2602.01167  [pdf, ps, other

    cs.AI

    Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

    Authors: Zhiming Liu, Yujie Wei, Lei Feng, Xiu Su, Xiaobo Xia, Weili Guan, Zeke Xie, Shuo Yang

    Abstract: Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematic… ▽ More

    Submitted 1 February, 2026; originally announced February 2026.

  15. arXiv:2602.00557  [pdf, ps, other

    cs.RO

    ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

    Authors: Weisheng Dai, Kai Lan, Jianyi Zhou, Bo Zhao, Xiu Su, Junwen Tong, Weili Guan, Shuo Yang

    Abstract: Vision-Language-Action (VLA) models achieve preliminary generalization through pretraining on large scale robot teleoperation datasets. However, acquiring datasets that comprehensively cover diverse tasks and environments is extremely costly and difficult to scale. In contrast, human demonstration videos offer a rich and scalable source of diverse scenes and manipulation behaviors, yet their lack… ▽ More

    Submitted 31 January, 2026; originally announced February 2026.

  16. arXiv:2601.22803  [pdf, ps, other

    cs.AI cs.SE

    CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

    Authors: Ji Shi, Peiming Guo, Meishan Zhang, Miao Zhang, Xuebo Liu, Min Zhang, Weili Guan

    Abstract: Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL… ▽ More

    Submitted 30 January, 2026; originally announced January 2026.

    Comments: 17 pages, 3 figures

  17. arXiv:2601.20597  [pdf, ps, other

    cs.CV

    StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

    Authors: Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu, Liqiang Nie

    Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caus… ▽ More

    Submitted 28 January, 2026; originally announced January 2026.

  18. arXiv:2601.20526  [pdf, ps, other

    cs.CV

    IOTA: Corrective Knowledge-Guided Prompt Learning via Black-White Box Framework

    Authors: Shaokun Wang, Yifan Yu, Yuhang He, Weili Guan, Yihong Gong

    Abstract: Recently, adapting pre-trained models to downstream tasks has attracted increasing interest. Previous Parameter-Efficient-Tuning (PET) methods regard the pre-trained model as an opaque Black Box model, relying purely on data-driven optimization and underutilizing their inherent prior knowledge. This oversight limits the models' potential for effective downstream task adaptation. To address these i… ▽ More

    Submitted 28 January, 2026; originally announced January 2026.

  19. arXiv:2601.09636  [pdf, ps, other

    cs.AI cs.CV cs.HC cs.LG

    PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

    Authors: Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

    Abstract: While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted pr… ▽ More

    Submitted 14 January, 2026; originally announced January 2026.

  20. arXiv:2601.03632  [pdf, ps, other

    eess.AS cs.AI cs.SD

    ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

    Authors: Haitao Li, Chunxiang Jin, Chenglin Li, Wenhao Guan, Zhengxing Huang, Xie Chen

    Abstract: Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to… ▽ More

    Submitted 7 January, 2026; originally announced January 2026.

  21. arXiv:2512.15751  [pdf, ps, other

    cs.LG cs.AI cs.MA

    GLOW: Graph-Language Co-Reasoning for Agentic Workflow Performance Prediction

    Authors: Wei Guan, Jian Cao, Jinyu Cai, Qiqi Cai, Jianqi Gao, See-Kiong Ng

    Abstract: Agentic Workflows (AWs) have emerged as a promising paradigm for solving complex tasks. However, the scalability of automating their generation is severely constrained by the high cost and latency of execution-based evaluation. Existing AW performance prediction methods act as surrogates but fail to simultaneously capture the intricate topological dependencies and the deep semantic logic embedded… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  22. arXiv:2512.15049  [pdf, ps, other

    cs.IT

    On the Stochastic Analysis of Random Linear Streaming Codes in Multi-Hop Relay Networks

    Authors: Kai Huang, Xinyu Xie, Chunpeng Chen, Wenjie Guan, Xiaoran Wang, Jinbei Zhang

    Abstract: In this paper, we aim to explore the stochastic performance limit of large-field-size Random Linear Streaming Codes (RLSCs) in multi-hop relay networks. In our model, a source transmits a sequence of streaming messages to a destination through multiple relays subject to a delay constraint. Most previous research focused on deterministic adversarial channel which introduces only restricted types of… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  23. arXiv:2512.10416  [pdf, ps, other

    cs.CV cs.AI

    Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

    Authors: Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Chen Min, Yu Hu

    Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse en… ▽ More

    Submitted 8 March, 2026; v1 submitted 11 December, 2025; originally announced December 2025.

    Comments: This revision improves clarity and consistency throughout the paper. We refine terminology to more precisely describe the vertex extraction optimization, add motivational context to the edge feature encoding section, and clarify the overall inference pipeline. We also add an Acknowledgments section

  24. arXiv:2512.10394  [pdf, ps, other

    cs.RO cs.LG

    RoboNeuron: A Middle-Layer Infrastructure for Agent-Driven Orchestration in Embodied AI

    Authors: Weifan Guan, Qinghao Hu, Huasen Xi, Chenxiao Zhang, Aosheng Li, Jian Cheng

    Abstract: Vision-language-action (VLA) models and LLM agents have advanced rapidly, yet reliable deployment on physical robots is often hindered by an interface mismatch between agent tool APIs and robot middleware. Current implementations typically rely on ad-hoc wrappers that are difficult to reuse, and changes to the VLA backend or serving stack often necessitate extensive re-integration. We introduce Ro… ▽ More

    Submitted 1 April, 2026; v1 submitted 11 December, 2025; originally announced December 2025.

  25. arXiv:2512.08475  [pdf, ps, other

    cs.LG

    Solving Oversmoothing in GNNs via Nonlocal Message Passing: Algebraic Smoothing and Depth Scalability

    Authors: Weiqi Guan, Junlin He

    Abstract: The relationship between Layer Normalization (LN) placement and the oversmoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid oversmoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience oversmoothing. To resolve this, we propose a new method based on Post-LN that induces algebraic smoothin… ▽ More

    Submitted 10 December, 2025; v1 submitted 9 December, 2025; originally announced December 2025.

    Comments: 18 pages, 4 figures

  26. arXiv:2512.06782  [pdf, ps, other

    cs.LG

    Measuring Over-smoothing beyond Dirichlet energy

    Authors: Weiqi Guan, Zihao Shi

    Abstract: While Dirichlet energy serves as a prevalent metric for quantifying over-smoothing, it is inherently restricted to capturing first-order feature derivatives. To address this limitation, we propose a generalized family of node similarity measures based on the energy of higher-order feature derivatives. Through a rigorous theoretical analysis of the relationships among these measures, we establish t… ▽ More

    Submitted 7 December, 2025; originally announced December 2025.

    Comments: 17 pages, 1 figure

  27. arXiv:2512.05126  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.CV cs.MM cs.SD

    SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

    Authors: Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

    Abstract: Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) mod… ▽ More

    Submitted 23 November, 2025; originally announced December 2025.

  28. arXiv:2512.02792  [pdf, ps, other

    cs.CV cs.MM

    HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

    Authors: Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, Weili Guan

    Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semant… ▽ More

    Submitted 14 December, 2025; v1 submitted 2 December, 2025; originally announced December 2025.

    Comments: Accepted by ACM MM 2025

  29. arXiv:2511.19432  [pdf, ps, other

    physics.app-ph cond-mat.mes-hall quant-ph

    Robotic chip-scale nanofabrication for superior consistency

    Authors: Felix M. Mayor, Wenyan Guan, Erik Szakiel, Amir H. Safavi-Naeini, Samuel Gyger

    Abstract: Unlike the rigid, high-volume automation found in industry, academic research requires process flexibility that has historically relied on variable manual operations. This hinders the fabrication of advanced, complex devices. We propose to address this gap by automating these low-volume, high-stakes tasks using a robotic arm to improve process control and consistency. As a proof of concept, we dep… ▽ More

    Submitted 24 November, 2025; originally announced November 2025.

    Comments: 5 pages, 3 figures

  30. arXiv:2511.12449  [pdf, ps, other

    cs.CV cs.AI cs.IR cs.LG

    MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

    Authors: Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

    Abstract: Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data.… ▽ More

    Submitted 23 March, 2026; v1 submitted 15 November, 2025; originally announced November 2025.

    Comments: 11 pages, 7 figures

  31. arXiv:2511.11305  [pdf, ps, other

    cs.IR cs.AI cs.CV cs.LG

    MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising

    Authors: Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

    Abstract: We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves a… ▽ More

    Submitted 18 November, 2025; v1 submitted 14 November, 2025; originally announced November 2025.

    Comments: 31 pages, 12 figures

  32. arXiv:2511.05769  [pdf

    cs.HC cs.AI

    Lived Experience in Dialogue: Co-designing Personalization in Large Language Models to Support Youth Mental Well-being

    Authors: Kathleen W. Guan, Sarthak Giri, Mohammed Amara, Bernard J. Jansen, Enrico Liscio, Milena Esherick, Mohammed Al Owayyed, Ausrine Ratkute, Gayane Sedrakyan, Mark de Reuver, Joao Fernando Ferreira Goncalves, Caroline A. Figueroa

    Abstract: Youth increasingly turn to large language models (LLMs) for mental well-being support, yet current personalization in LLMs can overlook the heterogeneous lived experiences shaping their needs. We conducted a participatory study with youth, parents, and youth care workers (N=38), using co-created youth personas as scaffolds, to elicit community perspectives on how LLMs can facilitate more meaningfu… ▽ More

    Submitted 7 November, 2025; originally announced November 2025.

  33. arXiv:2510.24762  [pdf, ps, other

    cs.CL cs.AI

    Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

    Authors: Wenzhen Luo, Wei Guan, Yifan Yao, Yimin Pan, Feng Wang, Zhipeng Yu, Zhe Wen, Liang Chen, Yihong Zhuang

    Abstract: We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and… ▽ More

    Submitted 22 October, 2025; originally announced October 2025.

  34. arXiv:2510.22622  [pdf, ps, other

    cs.CR cs.CV cs.MM

    DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection

    Authors: Kangran Zhao, Yupeng Chen, Xiaoyu Zhang, Yize Chen, Weinan Guan, Baicheng Chen, Chengzhe Sun, Soumyya Kanti Datta, Qingshan Liu, Siwei Lyu, Baoyuan Wu

    Abstract: The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training da… ▽ More

    Submitted 26 October, 2025; originally announced October 2025.

    Comments: Preprint

  35. arXiv:2510.18032  [pdf, ps, other

    cs.AI cs.MA

    OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

    Authors: Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang

    Abstract: Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agen… ▽ More

    Submitted 20 October, 2025; originally announced October 2025.

    Comments: 8 pages for main content

  36. arXiv:2510.17111  [pdf, ps, other

    cs.RO cs.AI cs.LG

    Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

    Authors: Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng

    Abstract: Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real… ▽ More

    Submitted 23 October, 2025; v1 submitted 19 October, 2025; originally announced October 2025.

  37. arXiv:2510.16833  [pdf, ps, other

    cs.CV cs.GR

    From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display

    Authors: Xiangyu Mu, Dongliang Zhou, Jie Hou, Haijun Zhang, Weili Guan

    Abstract: Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose… ▽ More

    Submitted 19 October, 2025; originally announced October 2025.

  38. arXiv:2510.04593  [pdf, ps, other

    eess.AS cs.SD

    UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

    Authors: Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

    Abstract: Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization… ▽ More

    Submitted 19 November, 2025; v1 submitted 6 October, 2025; originally announced October 2025.

  39. arXiv:2510.02930  [pdf, ps, other

    cs.DC

    iDDS: Intelligent Distributed Dispatch and Scheduling for Workflow Orchestration

    Authors: Wen Guan, Tadashi Maeno, Aleksandr Alekseev, Fernando Harald Barreiro Megino, Kaushik De, Edward Karavakis, Alexei Klimentov, Tatiana Korchuganova, FaHui Lin, Paul Nilsson, Torre Wenaus, Zhaoyu Yang, Xin Zhao

    Abstract: The intelligent Distributed Dispatch and Scheduling (iDDS) service is a versatile workflow orchestration system designed for large-scale, distributed scientific computing. iDDS extends traditional workload and data management by integrating data-aware execution, conditional logic, and programmable workflows, enabling automation of complex and dynamic processing pipelines. Originally developed for… ▽ More

    Submitted 19 December, 2025; v1 submitted 3 October, 2025; originally announced October 2025.

  40. arXiv:2509.21984  [pdf, ps, other

    cs.CV cs.CL

    Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models

    Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Yongshuai Hou, Weili Guan, Jun Yu, Min Zhang

    Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled p… ▽ More

    Submitted 3 February, 2026; v1 submitted 26 September, 2025; originally announced September 2025.

  41. arXiv:2509.21363  [pdf, ps, other

    cs.CV cs.AI

    A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision--Revised

    Authors: Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, Huchuan Lu, Errui Ding

    Abstract: Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from n… ▽ More

    Submitted 21 September, 2025; originally announced September 2025.

    Comments: 11 pages

    Journal ref: CVPR.2019.00834

  42. arXiv:2509.20410  [pdf, ps, other

    eess.AS cs.SD

    Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction

    Authors: Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong

    Abstract: Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension ca… ▽ More

    Submitted 4 November, 2025; v1 submitted 24 September, 2025; originally announced September 2025.

    Comments: It requires internal PR approval

  43. arXiv:2509.18102  [pdf, ps, other

    cs.SD eess.AS

    XMUspeech Systems for the ASVspoof 5 Challenge

    Authors: Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong

    Abstract: In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the perfor… ▽ More

    Submitted 5 September, 2025; originally announced September 2025.

  44. arXiv:2509.07817  [pdf, ps, other

    cs.CL cs.MM

    Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

    Authors: Xiaolin Chen, Xuemeng Song, Haokun Wen, Weili Guan, Xiangyu Zhao, Liqiang Nie

    Abstract: Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. In… ▽ More

    Submitted 9 September, 2025; originally announced September 2025.

  45. arXiv:2509.03890  [pdf, ps, other

    cs.AI

    FaMA: LLM-Empowered Agentic Assistant for Consumer-to-Consumer Marketplace

    Authors: Yineng Yan, Xidong Wang, Jin Seng Cheng, Ran Hu, Wentao Guan, Nahid Farahmand, Hengte Lin, Yue Li

    Abstract: The emergence of agentic AI, powered by Large Language Models (LLMs), marks a paradigm shift from reactive generative systems to proactive, goal-oriented autonomous agents capable of sophisticated planning, memory, and tool use. This evolution presents a novel opportunity to address long-standing challenges in complex digital environments. Core tasks on Consumer-to-Consumer (C2C) e-commerce platfo… ▽ More

    Submitted 4 September, 2025; originally announced September 2025.

  46. arXiv:2509.01894  [pdf, ps, other

    cs.IT

    On the Analysis of Random Linear Streaming Codes in Stochastic Channels

    Authors: Kai Huang, Wenjie Guan, Xiaoran Wang, Jinbei Zhang, Kechao Cai

    Abstract: Random Linear Streaming Codes (RLSCs) can dramatically reduce the queuing delay of block codes in real-time services. In this paper, we aim to explore the fundamental limit of large-field-size RLSCs in stochastic symbol erasure channels (SEC). The Non-systematic RLSCs (NRLSCs) in i.i.d. SEC has been analyzed in [Pinwen Su et al. 2022]. In this work, we first derive the closed-form expression on th… ▽ More

    Submitted 1 September, 2025; originally announced September 2025.

  47. arXiv:2508.15904  [pdf, ps, other

    cs.CV

    Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping

    Authors: Dexuan He, Xiao Zhou, Wenbin Guan, Liyuan Zhang, Xiaoman Zhang, Sinuo Xu, Ge Wang, Lifeng Wang, Xiaojun Yuan, Xin Sun, Yanfeng Wang, Kun Sun, Ya Zhang, Weidi Xie

    Abstract: Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instanc… ▽ More

    Submitted 21 August, 2025; originally announced August 2025.

  48. arXiv:2508.12628  [pdf, ps, other

    cs.CV

    Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning

    Authors: Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu

    Abstract: Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to selec… ▽ More

    Submitted 18 August, 2025; originally announced August 2025.

  49. arXiv:2508.11999  [pdf, ps, other

    cs.CV cs.AI cs.IR cs.LG

    MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

    Authors: Daoze Zhang, Chenghan Fu, Zhanheng Nie, Jianyu Liu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng

    Abstract: With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that ge… ▽ More

    Submitted 28 February, 2026; v1 submitted 16 August, 2025; originally announced August 2025.

    Comments: Accepted by WSDM 2026 (oral). 11 pages, 9 figures

  50. arXiv:2508.11247  [pdf, ps, other

    cs.CL cs.AI

    Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering

    Authors: Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, Ning Jiang

    Abstract: Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by levera… ▽ More

    Submitted 15 August, 2025; originally announced August 2025.