Skip to main content

Showing 1–50 of 679 results for author: Shi, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.20619  [pdf, ps, other

    cs.CV

    SemanticGen: Video Generation in Semantic Space

    Authors: Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai

    Abstract: State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generatin… ▽ More

    Submitted 23 December, 2025; originally announced December 2025.

    Comments: Project page: https://jianhongbai.github.io/SemanticGen/

  2. arXiv:2512.18256  [pdf, ps, other

    cs.AI cs.LO

    MSC-180: A Benchmark for Automated Formal Theorem Proving from Mathematical Subject Classification

    Authors: Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, Lun Du

    Abstract: Automated Theorem Proving (ATP) represents a core research direction in artificial intelligence for achieving formal reasoning and verification, playing a significant role in advancing machine intelligence. However, current large language model (LLM)-based theorem provers suffer from limitations such as restricted domain coverage and weak generalization in mathematical reasoning. To address these… ▽ More

    Submitted 20 December, 2025; originally announced December 2025.

  3. arXiv:2512.16776  [pdf, ps, other

    cs.CV

    Kling-Omni Technical Report

    Authors: Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu , et al. (43 additional authors not shown)

    Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supp… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Kling-Omni Technical Report

  4. arXiv:2512.14157  [pdf, ps, other

    cs.AI cs.CV

    Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis

    Authors: Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, Shihui Zhen

    Abstract: Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual e… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  5. arXiv:2512.13313  [pdf, ps, other

    cs.CV

    KlingAvatar 2.0 Technical Report

    Authors: Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang , et al. (3 additional authors not shown)

    Abstract: Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs ups… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

    Comments: 14 pages, 7 figures

  6. arXiv:2512.11143  [pdf, ps, other

    cs.CR

    Automated Penetration Testing with LLM Agents and Classical Planning

    Authors: Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan LI, Yan Chen

    Abstract: While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the "Planner-Executor-Perceptor (PEP)" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, w… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  7. arXiv:2512.07612  [pdf, ps, other

    cs.CL cs.AI cs.LG

    PCMind-2.1-Kaiyuan-2B Technical Report

    Authors: Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu, Yunyi Chen, Chenyi Dang, Hengtao Tao, Hui Wang, Fangming Liu, Kaifeng Lyu, Wenguang Chen

    Abstract: The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness un… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

  8. arXiv:2512.05597  [pdf, ps, other

    cs.CV

    Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction

    Authors: Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers

    Abstract: Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and ef… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

    Comments: 10 pages, 8 figures

  9. arXiv:2512.03715  [pdf, ps, other

    cs.CV

    DINO-RotateMatch: A Rotation-Aware Deep Framework for Robust Image Matching in Large-Scale 3D Reconstruction

    Authors: Kaichen Zhang, Tianxiang Sheng, Xuanming Shi

    Abstract: This paper presents DINO-RotateMatch, a deep-learning framework designed to address the chal lenges of image matching in large-scale 3D reconstruction from unstructured Internet images. The method integrates a dataset-adaptive image pairing strategy with rotation-aware keypoint extraction and matching. DINO is employed to retrieve semantically relevant image pairs in large collections, while… ▽ More

    Submitted 3 December, 2025; originally announced December 2025.

    Comments: 9 pages, 5 figures, 1 table

  10. arXiv:2512.03041  [pdf, ps, other

    cs.CV

    MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

    Authors: Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia

    Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two n… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

    Comments: Project Page: https://qinghew.github.io/MultiShotMaster

  11. arXiv:2512.02554  [pdf, ps, other

    cs.CV

    OmniPerson: Unified Identity-Preserving Pedestrian Generation

    Authors: Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang, Longkun Zhou, Yujia Zhang, Shangze Li, Yifan Xu

    Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

  12. arXiv:2512.01556  [pdf, ps, other

    cs.AI cs.CL cs.LG

    LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems

    Authors: Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

    Abstract: Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed… ▽ More

    Submitted 1 December, 2025; originally announced December 2025.

  13. arXiv:2511.22902  [pdf, ps, other

    cs.IT

    Leveraging Channel Knowledge Map for Multi-User Hierarchical Beam Training Under Position Uncertainty

    Authors: Xu Shi, Haohan Wang, Yashuai Cao, Hengyu Zhang, Sufang Yang, Jintao Wang

    Abstract: Channel knowledge map (CKM) emerges as a promising framework to acquire location-specific channel information without consuming wireless resources, creating new horizons for advanced wireless network design and optimization. Despite its potential, the practical application of CKM in beam training faces several challenges. On one hand, the user's precise location is typically unavailable prior to b… ▽ More

    Submitted 28 November, 2025; originally announced November 2025.

  14. arXiv:2511.22275  [pdf, ps, other

    cs.AI

    RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems

    Authors: Mengfan Li, Xuanhua Shi, Yang Deng

    Abstract: Large Language models are revolutionizing the conversational recommender systems through their impressive capabilities in instruction comprehension, reasoning, and human interaction. A core factor underlying effective recommendation dialogue is the ability to infer and reason about users' mental states (such as desire, intention, and belief), a cognitive capacity commonly referred to as Theory of… ▽ More

    Submitted 27 November, 2025; originally announced November 2025.

    Comments: Accepted by AAAI 2026

  15. arXiv:2511.20106  [pdf, ps, other

    cs.CL

    EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

    Authors: Xingfeng Li, Xiaohan Shi, Junjie Li, Yongwei Li, Masashi Unoki, Tomoki Toda, Masato Akagi

    Abstract: This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in… ▽ More

    Submitted 25 November, 2025; originally announced November 2025.

    Comments: Submitted to IEEE Transactions on Affective computing

  16. arXiv:2511.20044  [pdf, ps, other

    cs.LG

    RED-F: Reconstruction-Elimination based Dual-stream Contrastive Forecasting for Multivariate Time Series Anomaly Prediction

    Authors: PengYu Chen, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das

    Abstract: The proactive prediction of anomalies (AP) in multivariate time series (MTS) is a critical challenge to ensure system dependability. The difficulty lies in identifying subtle anomaly precursors concealed within normal signals. However, existing unsupervised methods, trained exclusively on normal data, demonstrate a fundamental propensity to reconstruct normal patterns. Consequently, when confronte… ▽ More

    Submitted 25 November, 2025; originally announced November 2025.

    Comments: 13 pages, 12 figures

  17. arXiv:2511.18903  [pdf, ps, other

    cs.LG cs.AI cs.CL

    How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

    Authors: Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

    Abstract: Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies hav… ▽ More

    Submitted 24 November, 2025; originally announced November 2025.

  18. arXiv:2511.18333  [pdf, ps, other

    cs.CV

    ConsistCompose: Unified Multimodal Layout Control for Image Composition

    Authors: Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Dahua Lin, Quan Wang

    Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We pr… ▽ More

    Submitted 11 December, 2025; v1 submitted 23 November, 2025; originally announced November 2025.

    Comments: 22 pages, 17 figures

  19. arXiv:2511.17550  [pdf, ps, other

    cs.NE cs.AI cs.LG

    Gate-level boolean evolutionary geometric attention neural networks

    Authors: Xianshuai Shi, Jianfeng Zhu, Leibo Liu

    Abstract: This paper presents a gate-level Boolean evolutionary geometric attention neural network that models images as Boolean fields governed by logic gates. Each pixel is a Boolean variable (0 or 1) embedded on a two-dimensional geometric manifold (for example, a discrete toroidal lattice), which defines adjacency and information propagation among pixels. The network updates image states through a Boole… ▽ More

    Submitted 11 November, 2025; originally announced November 2025.

  20. arXiv:2511.16651  [pdf, ps, other

    cs.RO

    InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

    Authors: Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, Jiangmiao Pang

    Abstract: Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strong… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

  21. arXiv:2511.16248  [pdf, ps, other

    cs.AI

    Revisiting Fairness-aware Interactive Recommendation: Item Lifecycle as a Control Knob

    Authors: Yun Lu, Xiaoyu Shi, Hong Xie, Chongjun Xia, Zhenhui Gong, Mingsheng Shang

    Abstract: This paper revisits fairness-aware interactive recommendation (e.g., TikTok, KuaiShou) by introducing a novel control knob, i.e., the lifecycle of items. We make threefold contributions. First, we conduct a comprehensive empirical analysis and uncover that item lifecycles in short-video platforms follow a compressed three-phase pattern, i.e., rapid growth, transient stability, and sharp decay, whi… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

    Comments: 8 pages, 5 figures, conference

  22. arXiv:2511.15393  [pdf, ps, other

    cs.LG

    EVA-Net: Interpretable Anomaly Detection for Brain Health via Learning Continuous Aging Prototypes from One-Class EEG Cohorts

    Authors: Kunyu Zhang, Mingxuan Wang, Xiangjie Shi, Haoxing Xu, Chao Zhang

    Abstract: The brain age is a key indicator of brain health. While electroencephalography (EEG) is a practical tool for this task, existing models struggle with the common challenge of imperfect medical data, such as learning a ``normal'' baseline from weakly supervised, healthy-only cohorts. This is a critical anomaly detection task for identifying disease, but standard models are often black boxes lacking… ▽ More

    Submitted 23 November, 2025; v1 submitted 19 November, 2025; originally announced November 2025.

  23. arXiv:2511.14759  [pdf, ps, other

    cs.LG cs.RO

    $Ï€^{*}_{0.6}$: a VLA That Learns From Experience

    Authors: Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen , et al. (31 additional authors not shown)

    Abstract: We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demon… ▽ More

    Submitted 18 November, 2025; v1 submitted 18 November, 2025; originally announced November 2025.

  24. arXiv:2511.13719  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.MM cs.RO

    Scaling Spatial Intelligence with Multimodal Foundation Models

    Authors: Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li , et al. (4 additional authors not shown)

    Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and gen… ▽ More

    Submitted 27 November, 2025; v1 submitted 17 November, 2025; originally announced November 2025.

    Comments: Codebase: https://github.com/OpenSenseNova/SenseNova-SI; Models: https://huggingface.co/collections/sensenova/sensenova-si

  25. arXiv:2511.12020  [pdf, ps, other

    cs.CV

    LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension

    Authors: Xianglong Shi, Silin Cheng, Sirui Zhao, Yunhan Jiang, Enhong Chen, Yang Liu, Sebastien Ourselin

    Abstract: Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practica… ▽ More

    Submitted 14 November, 2025; originally announced November 2025.

  26. arXiv:2511.11660  [pdf, ps, other

    cs.DC

    HeteroSTA: A CPU-GPU Heterogeneous Static Timing Analysis Engine with Holistic Industrial Design Support

    Authors: Zizheng Guo, Haichuan Liu, Xizhe Shi, Shenglu Hua, Zuodong Zhang, Chunyuan Zhao, Runsheng Wang, Yibo Lin

    Abstract: We introduce in this paper, HeteroSTA, the first CPU-GPU heterogeneous timing analysis engine that efficiently supports: (1) a set of delay calculation models providing versatile accuracy-speed choices without relying on an external golden tool, (2) robust support for industry formats, including especially the .sdc constraints containing all common timing exceptions, clock domains, and case analys… ▽ More

    Submitted 11 November, 2025; originally announced November 2025.

    Comments: 7 pages, 3 figures, to be published in ASP-DAC 2026

  27. arXiv:2511.10554  [pdf, ps, other

    cs.CR

    GraphFaaS: Serverless GNN Inference for Burst-Resilient, Real-Time Intrusion Detection

    Authors: Lingzhi Wang, Vinod Yegneswaran, Xinyi Shi, Ziyu Li, Ashish Gehani, Yan Chen

    Abstract: Provenance-based intrusion detection is an increasingly popular application of graphical machine learning in cybersecurity, where system activities are modeled as provenance graphs to capture causality and correlations among potentially malicious actions. Graph Neural Networks (GNNs) have demonstrated strong performance in this setting. However, traditional statically-provisioned GNN inference arc… ▽ More

    Submitted 13 November, 2025; originally announced November 2025.

    Comments: Accepted by ML For Systems workshop at Neural Information Processing Systems (NeurIPS 2025)

  28. arXiv:2511.09064  [pdf, ps, other

    cs.CV

    Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference

    Authors: Chengze Jiang, Minjing Dong, Xinli Shi, Jie Gui

    Abstract: Vision-language pre-training models (VLPs) demonstrate strong multimodal understanding and zero-shot generalization, yet remain vulnerable to adversarial examples, raising concerns about their reliability. Recent work, Test-Time Counterattack (TTC), improves robustness by generating perturbations that maximize the embedding deviation of adversarial inputs using PGD, pushing them away from their ad… ▽ More

    Submitted 12 November, 2025; originally announced November 2025.

    Comments: Accepted to AAAI-2026 Oral

  29. arXiv:2511.08243   

    cs.LG

    A Unified Geometric Field Theory Framework for Transformers: From Manifold Embeddings to Kernel Modulation

    Authors: Xianshuai Shi, Jianfeng Zhu, Leibo Liu

    Abstract: The Transformer architecture has achieved tremendous success in natural language processing, computer vision, and scientific computing through its self-attention mechanism. However, its core components-positional encoding and attention mechanisms-have lacked a unified physical or mathematical interpretation. This paper proposes a structural theoretical framework that integrates positional encoding… ▽ More

    Submitted 12 November, 2025; v1 submitted 11 November, 2025; originally announced November 2025.

    Comments: Withdrawing to allow a significant revision and resubmission with improvements

  30. arXiv:2511.07985  [pdf, ps, other

    cs.AR

    PIMfused: Near-Bank DRAM-PIM with Fused-layer Dataflow for CNN Data Transfer Optimization

    Authors: Simei Yang, Xinyu Shi, Lu Zhao, Yunyu Ling, Quanjun Wang, Francky Catthoor

    Abstract: Near-bank Processing-in-Memory (PIM) architectures integrate processing cores (PIMcores) close to DRAM banks to mitigate the high cost of off-chip memory accesses. When accelerating convolutional neural network (CNN) on DRAM-PIM, performance is often constrained by cross-bank (or cross-PIMcore) data transfers, which are induced by the conventional layer-by-layer dataflow that enforces inter-bank (… ▽ More

    Submitted 11 November, 2025; originally announced November 2025.

    Comments: 6 pages

  31. arXiv:2511.06271  [pdf, ps, other

    cs.CV

    RelightMaster: Precise Video Relighting with Multi-plane Light Images

    Authors: Weikang Bian, Xiaoyu Shi, Zhaoyang Huang, Jianhong Bai, Qinghe Wang, Xintao Wang, Pengfei Wan, Kun Gai, Hongsheng Li

    Abstract: Recent advances in diffusion models enable high-quality video generation and editing, but precise relighting with consistent video contents, which is critical for shaping scene atmosphere and viewer attention, remains unexplored. Mainstream text-to-video (T2V) models lack fine-grained lighting control due to text's inherent limitation in describing lighting details and insufficient pre-training on… ▽ More

    Submitted 9 November, 2025; originally announced November 2025.

    Comments: Project Page: https://wkbian.github.io/Projects/RelightMaster/

  32. arXiv:2511.00933  [pdf, ps, other

    cs.RO cs.CV

    Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation

    Authors: Xiangyu Shi, Zerui Li, Yanyuan Qiao, Qi Wu

    Abstract: Recent advances in Vision-and-Language Navigation in Continuous Environments (VLN-CE) have leveraged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast-Smart… ▽ More

    Submitted 2 November, 2025; originally announced November 2025.

  33. arXiv:2510.27237  [pdf, ps, other

    cs.CV

    Fusion of Multi-scale Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis

    Authors: Zhidong Yang, Xiuhui Shi, Wei Ba, Zhigang Song, Haijing Luan, Taiyuan Hu, Senlin Lin, Jiguang Wang, Shaohua Kevin Zhou, Rui Yan

    Abstract: Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathology foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level multi-scale features from WSIs. However, current pathology FMs have exhibited substantial heterogeneity caused by diverse private training d… ▽ More

    Submitted 20 November, 2025; v1 submitted 31 October, 2025; originally announced October 2025.

    Comments: 22 pages, 9 figures

  34. arXiv:2510.26474  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing

    Authors: Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang

    Abstract: Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (… ▽ More

    Submitted 30 October, 2025; originally announced October 2025.

    Comments: Preprint

  35. arXiv:2510.25772  [pdf, ps, other

    cs.CV

    VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

    Authors: Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, Xu Jia

    Abstract: Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first un… ▽ More

    Submitted 29 October, 2025; originally announced October 2025.

    Comments: Project Page URL:https://libaolu312.github.io/VFXMaster/

  36. arXiv:2510.25210  [pdf, ps, other

    cs.CV

    U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching

    Authors: Junsheng Zhou, Xingyu Shi, Haichuan Song, Yi Fang, Yu-Shen Liu, Zhizhong Han

    Abstract: Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised fr… ▽ More

    Submitted 29 October, 2025; originally announced October 2025.

    Comments: Accepted by NeurIPS 2025. Project page: https://gloriasze.github.io/U-CAN/

  37. arXiv:2510.22115  [pdf, ps, other

    cs.CL cs.AI

    Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

    Authors: Ling Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chilin Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu , et al. (117 additional authors not shown)

    Abstract: We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three… ▽ More

    Submitted 6 November, 2025; v1 submitted 24 October, 2025; originally announced October 2025.

    Comments: Ling 2.0 Technical Report

  38. arXiv:2510.17897  [pdf, ps, other

    eess.IV cs.CV

    Conformal Lesion Segmentation for 3D Medical Images

    Authors: Binyu Tan, Zhiyuan Wang, Jinhao Duan, Kaidi Xu, Heng Tao Shen, Xiaoshuang Shi, Fumin Shen

    Abstract: Medical image segmentation serves as a critical component of precision medicine, enabling accurate localization and delineation of pathological regions, such as lesions. However, existing models empirically apply fixed thresholds (e.g., 0.5) to differentiate lesions from the background, offering no statistical guarantees on key metrics such as the false negative rate (FNR). This lack of principled… ▽ More

    Submitted 19 October, 2025; originally announced October 2025.

  39. arXiv:2510.16290  [pdf, ps, other

    cs.CV cs.CL

    Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models

    Authors: Yue Zheng, Xiufang Shi, Jiming Chen, Yuanchao Shu

    Abstract: Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real… ▽ More

    Submitted 17 October, 2025; originally announced October 2025.

  40. arXiv:2510.12985  [pdf, ps, other

    cs.AI

    SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents

    Authors: Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, Qi Zhu

    Abstract: We present Sentinel, the first framework for formally evaluating the physical safety of Large Language Model(LLM-based) embodied agents across the semantic, plan, and trajectory levels. Unlike prior methods that rely on heuristic rules or subjective LLM judgments, Sentinel grounds practical safety requirements in formal temporal logic (TL) semantics that can precisely specify state invariants, tem… ▽ More

    Submitted 14 October, 2025; originally announced October 2025.

  41. arXiv:2510.11184  [pdf, ps, other

    cs.LG cs.CL

    Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains?

    Authors: Zhengyu Chen, Jinluan Yang, Teng Xiao, Ruochen Zhou, Luan Zhang, Xiangyu Xi, Xiaowei Shi, Wei Wang, Jinggang Wang

    Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains underexplored. In this work, we investigate the cross-domain generalization of an LLM agent equipped with a code interpreter tool, which is exclusively trained on mathema… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

  42. arXiv:2510.10944  [pdf, ps, other

    cs.IT

    Throughput Maximization for Multiuser Communications with Flexible-Sector 6DMA

    Authors: Xiaoming Shi, Yunli Li, Xiaodan Shao, Jie Xu, Rui Zhang

    Abstract: This paper presents a cost-effective and easily-deployable flexible-sector six-dimensional movable antenna (6DMA) architecture for future wireless communication networks, which enables flexible antenna configurations to match users' spatial distribution for capacity enhancement. Different from conventional sectorized base station (BS) with fixed-position antennas (FPAs), the flexible-sector 6DMA-e… ▽ More

    Submitted 12 October, 2025; originally announced October 2025.

  43. arXiv:2510.10182  [pdf, ps, other

    cs.CL cs.AI

    A Survey of Inductive Reasoning for Large Language Models

    Authors: Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, Wei Zhang

    Abstract: Reasoning is an important task for large language models (LLMs). Among all the reasoning paradigms, inductive reasoning is one of the fundamental types, which is characterized by its particular-to-general thinking process and the non-uniqueness of its answers. The inductive mode is crucial for knowledge generalization and aligns better with human cognition, so it is a fundamental mode of learning,… ▽ More

    Submitted 11 October, 2025; originally announced October 2025.

  44. arXiv:2510.10125  [pdf, ps, other

    cs.RO cs.AI

    Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    Authors: Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, Chelsea Finn

    Abstract: Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and diffic… ▽ More

    Submitted 14 October, 2025; v1 submitted 11 October, 2025; originally announced October 2025.

    Comments: 17 pages

  45. arXiv:2510.08081  [pdf, ps, other

    cs.AI cs.CL

    AutoQual: An LLM Agent for Automated Discovery of Interpretable Features for Review Quality Assessment

    Authors: Xiaochong Lan, Jie Feng, Yinxing Liu, Xinlei Shi, Yong Li

    Abstract: Ranking online reviews by their intrinsic quality is a critical task for e-commerce platforms and information services, impacting user experience and business outcomes. However, quality is a domain-dependent and dynamic concept, making its assessment a formidable challenge. Traditional methods relying on hand-crafted features are unscalable across domains and fail to adapt to evolving content patt… ▽ More

    Submitted 9 October, 2025; originally announced October 2025.

    Comments: EMNLP 2025

  46. arXiv:2510.07355  [pdf, ps, other

    cs.MM cs.SD

    AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

    Authors: Krish Patel, Dingkun Zhou, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli

    Abstract: Emotions conveyed through voice and face shape engagement and context in human-AI interaction. Despite rapid progress in omni-modal large language models (LLMs), the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV-EMO-Reasoning, a benchmark designed to systematically assess emotional coherence in LLMs. The framework leverages a… ▽ More

    Submitted 8 October, 2025; originally announced October 2025.

  47. arXiv:2510.07206  [pdf, ps, other

    cs.CV

    EigenScore: OOD Detection using Covariance in Diffusion Models

    Authors: Shirin Shoushtari, Yi Wang, Xiao Shi, M. Salman Asif, Ulugbek S. Kamilov

    Abstract: Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have recently emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection. We propose EigenScore, a new OOD dete… ▽ More

    Submitted 8 October, 2025; originally announced October 2025.

  48. arXiv:2510.04020  [pdf, ps, other

    cs.LG cs.AI

    Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models

    Authors: Hao Wu, Yuan Gao, Xingjian Shi, Shuaipeng Li, Fan Xu, Fan Zhang, Zhihong Zhu, Weiyan Wang, Xiao Luo, Kun Wang, Xian Wu, Xiaomeng Huang

    Abstract: To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an "imagination-based" environmental simulation… ▽ More

    Submitted 9 October, 2025; v1 submitted 4 October, 2025; originally announced October 2025.

  49. arXiv:2510.03346  [pdf, ps, other

    cs.LG cs.AI cs.MA

    KVComm: Enabling Efficient LLM Communication through Selective KV Sharing

    Authors: Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire Jr., Dejan Kostic

    Abstract: Large Language Models (LLMs) are increasingly deployed in multi-agent systems, where effective inter-model communication is crucial. Existing communication protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose KVComm, a novel com… ▽ More

    Submitted 2 October, 2025; originally announced October 2025.

  50. arXiv:2509.26086  [pdf, ps, other

    cs.NI

    Flexible-Sector 6DMA Base Station: Modeling and Design

    Authors: Yunli Li, Xiaoming Shi, Xiaodan Shao, Jie Xu, Rui Zhang

    Abstract: Six-dimensional movable antenna (6DMA) has emerged as a promising new technology for future wireless networks, which can adaptively adjust the three-dimensional (3D) positions and 3D rotations of antennas/antenna arrays for performance enhancement. This paper proposes a novel cost-effective 6DMA-based base station (BS) architecture, termed the \textit{flexible-sector} BS, which allows the deployed… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.