Skip to main content

Showing 1–50 of 381 results for author: Ren, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.05475  [pdf, ps, other

    cs.CV

    A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator

    Authors: Kidus Zewde, Yuchen Zhou, Dennis Ng, Neo Tiangratanakul, Tommy Duong, Ankit Raj, Yuxin Zhang, Xingyu Shen, Simiao Ren

    Abstract: Large vision-language models have achieved remarkable capabilities by training on massive internet-scale data, yet a fundamental asymmetry persists: while LLMs can leverage self-supervised pretraining on abundant text and image data, the same is not true for many behavioral modalities. Video-based behavioral data -- gestures, eye movements, social signals -- remains scarce, expensive to annotate,… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: Synthetic eye movement dataset generation via 3D eye simulator; iris trajectory replay; script reading detection; behavioral data augmentation

  2. arXiv:2604.03044  [pdf, ps, other

    cs.CL cs.AI

    JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

    Authors: Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai, Chang Li, Changjian Jiang, Changkai Lu, Chao Xue, Chaocai Liang, Cheng Zhang, Dongkai Liu, Fei Wang, Guoqiang Huang, Haijian Ke, Han Lin, Hao Wang, Ji Miao, Jiacheng Zhang, Jialong Shi, Jifeng Zhu, Jingjing Qian, Junhui Luo, Junwu Xiong, Lam So , et al. (44 additional authors not shown)

    Abstract: We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimi… ▽ More

    Submitted 8 April, 2026; v1 submitted 3 April, 2026; originally announced April 2026.

    Comments: Xiaodong He is the corresponding author

  3. ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

    Authors: Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang

    Abstract: Functional verification consumes over 50% of the IC development lifecycle, where SystemVerilog Assertions (SVAs) are indispensable for formal property verification and enhanced simulation-based debugging. However, manual SVA authoring is labor-intensive and error-prone. While Large Language Models (LLMs) show promise, their direct deployment is hindered by low functional accuracy and a severe scar… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

    Comments: Accepted by DAC 2026

  4. arXiv:2603.27739  [pdf, ps, other

    cs.CR

    Ordering Power is Sanctioning Power: Sanction Evasion-MEV and the Limits of On-Chain Enforcement

    Authors: Di Wu, Yuman Bai, Shoupeng Ren, Xinyu Zhang, Yiyue Cao, Xuechao Wang, Wu Wen, Jian Liu

    Abstract: Centralized stablecoins such as USDT and USDC enforce financial sanctions through contract-layer blacklist functions, yet on public blockchains a freeze is merely an ordinary transaction that must compete for execution priority. We identify a fundamental gap between contract-layer authority and consensus-layer enforcement: when a sanctioned entity's transfer and the issuer's freeze race for inclus… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

  5. arXiv:2603.27538  [pdf, ps, other

    cs.CV cs.CL

    LongCat-Next: Lexicalizing Modalities as Discrete Tokens

    Authors: Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li , et al. (64 additional authors not shown)

    Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Aut… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

    Comments: LongCat-Next Technical Report

  6. arXiv:2603.26668  [pdf, ps, other

    cs.IR cs.AI cs.CL

    Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter

    Authors: Zihang Li, Wenjun Liu, Yikun Zong, Jiawen Tao, Siying Dai, Songcheng Ren, Zirui Liu, Yanbing Jiang, Tong Yang

    Abstract: As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces the two challenges regarding retrieval accuracy and computational efficiency. This paper presents a novel RAG framework called Bridge-RAG. To overcome the accuracy challenge, we introduce the concept of abstract to bridge query entities and document chunks, prov… ▽ More

    Submitted 11 January, 2026; originally announced March 2026.

  7. arXiv:2603.19325  [pdf, ps, other

    cs.LG cs.AI

    Target Concept Tuning Improves Extreme Weather Forecasting

    Authors: Shijie Ren, Xinyue Gu, Ziheng Peng, Haifan Zhang, Peisong Niu, Bo Wu, Xiting Wang, Liang Sun, Jirong Wen

    Abstract: Deep learning models for meteorological forecasting often fail in rare but high-impact events such as typhoons, where relevant data is scarce. Existing fine-tuning methods typically face a trade-off between overlooking these extreme events and overfitting them at the expense of overall performance. We propose TaCT, an interpretable concept-gated fine-tuning framework that solves the aforementioned… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

  8. arXiv:2603.11442  [pdf, ps, other

    cs.AI cs.CV

    GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

    Authors: Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng, Alex Shen, Jiayu Xue, Yuxin Zhang, Evelyn Marotta

    Abstract: Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, y… ▽ More

    Submitted 24 March, 2026; v1 submitted 11 March, 2026; originally announced March 2026.

    Comments: 12 pages, 7 figures, 7 tables

    ACM Class: I.4.9; I.2.10

  9. arXiv:2603.02705  [pdf, ps, other

    cs.CY

    Small Bottle, Big Pipe: Quantifying and Addressing the Impact of Data Centers on Public Water Systems

    Authors: Yuelin Han, Pengfei Li, Adam Wierman, Shaolei Ren

    Abstract: Water is a critical resource for data centers and an efficient means of cooling. However, meeting the growing water demand of data centers requires substantial peak water withdrawals, which many communities in the United States cannot supply, especially during the hottest days of the year. This largely overlooked water capacity constraint is emerging as a bottleneck for data centers and can force… ▽ More

    Submitted 18 March, 2026; v1 submitted 3 March, 2026; originally announced March 2026.

    Comments: 51 pages; updates include the EPA's nationwide statistics of water treatment surplus capacity

  10. arXiv:2603.02176  [pdf, ps, other

    cs.CL

    Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

    Authors: Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, Shuyue Hu

    Abstract: The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via n… ▽ More

    Submitted 2 March, 2026; originally announced March 2026.

  11. arXiv:2603.01433  [pdf, ps, other

    cs.CV

    DOCFORGE-BENCH: A Comprehensive 0-shot Benchmark for Document Forgery Detection and Analysis

    Authors: Zengqi Zhao, Weidi Xia, En Wei, Yan Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Yiran Tao, Simiao Ren

    Abstract: We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation --… ▽ More

    Submitted 10 March, 2026; v1 submitted 1 March, 2026; originally announced March 2026.

  12. arXiv:2602.22642  [pdf, ps, other

    cs.LG

    Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

    Authors: Qin-Wen Luo, Sheng Ren, Xiang Chen, Rui Liu, Jun Fang, Naiqiang Tan, Sheng-Jun Huang

    Abstract: Chain-of-Thought (CoT) has substantially empowered Large Language Models (LLMs) to tackle complex reasoning tasks, yet the verbose nature of explicit reasoning steps incurs prohibitive inference latency and computational costs, limiting real-world deployment. While existing compression methods - ranging from self-training to Reinforcement Learning (RL) with length constraints - attempt to mitigate… ▽ More

    Submitted 26 February, 2026; originally announced February 2026.

  13. arXiv:2602.20569  [pdf, ps, other

    cs.CV

    AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

    Authors: Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren, Jiayu Xue, Meige Yang, Siying Chen, Jingheng Huan

    Abstract: We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. A… ▽ More

    Submitted 24 February, 2026; originally announced February 2026.

    Comments: 17 pages, 10 figures

  14. arXiv:2602.19539  [pdf, ps, other

    cs.CV cs.CR cs.LG

    Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

    Authors: Xingyu Shen, Tommy Duong, Xiaodong An, Zengqi Zhao, Zebang Hu, Haoyu Hu, Ziyou Wang, Finn Guo, Simiao Ren

    Abstract: Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at sc… ▽ More

    Submitted 23 February, 2026; originally announced February 2026.

    Comments: 13 pages, 6 figures

  15. arXiv:2602.11761  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

    Authors: MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang, Yudong Wang , et al. (22 additional authors not shown)

    Abstract: The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a… ▽ More

    Submitted 28 February, 2026; v1 submitted 12 February, 2026; originally announced February 2026.

    Comments: MiniCPM-SALA Technical Report

  16. arXiv:2602.10098  [pdf, ps, other

    cs.RO cs.CV

    VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

    Authors: Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen

    Abstract: Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps t… ▽ More

    Submitted 13 February, 2026; v1 submitted 10 February, 2026; originally announced February 2026.

  17. arXiv:2602.08071  [pdf, ps, other

    cs.CV

    ViT-5: Vision Transformers for The Mid-2020s

    Authors: Feng Wang, Sucheng Ren, Tiezheng Zhang, Predrag Neskovic, Anand Bhattad, Cihang Xie, Alan Yuille

    Abstract: This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation o… ▽ More

    Submitted 8 February, 2026; originally announced February 2026.

    Comments: Code is available at https://github.com/wangf3014/ViT-5

  18. arXiv:2602.07815  [pdf, ps, other

    cs.CV

    Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

    Authors: Simiao Ren, Xingyu Shen, Ankit Raj, Albert Dai, Caroline, Zhang, Yuan Xu, Zexi Chen, Siqi Wu, Chen Gong, Yuxin Zhang

    Abstract: Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weigh… ▽ More

    Submitted 11 February, 2026; v1 submitted 7 February, 2026; originally announced February 2026.

  19. arXiv:2602.07814  [pdf, ps, other

    cs.CV cs.AI

    How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

    Authors: Simiao Ren, Yuchen Zhou, Xingyu Shen, Kidus Zewde, Tommy Duong, George Huang, Hatsanai, Tiangratanakul, Tsang, Ng, En Wei, Jiayu Xue

    Abstract: As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment sce… ▽ More

    Submitted 7 February, 2026; originally announced February 2026.

  20. arXiv:2602.07095  [pdf, ps, other

    cs.CV cs.AI

    WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark

    Authors: Wang Lin, Feng Wang, Majun Zhang, Wentao Hu, Tao Jin, Zhou Zhao, Fei Wu, Jingyuan Chen, Alan Yuille, Sucheng Ren

    Abstract: Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise b… ▽ More

    Submitted 6 February, 2026; originally announced February 2026.

  21. arXiv:2602.06953  [pdf, ps, other

    cs.CL

    DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

    Authors: Lizhuo Luo, Zhuoran Shi, Jiajun Luo, Zhi Wang, Shen Ren, Wenya Wang, Tianwei Zhang

    Abstract: Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality--speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be f… ▽ More

    Submitted 6 February, 2026; originally announced February 2026.

  22. arXiv:2602.05258  [pdf, ps, other

    cs.CL cs.AI cs.LG

    CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

    Authors: Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang

    Abstract: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the atte… ▽ More

    Submitted 4 February, 2026; originally announced February 2026.

  23. arXiv:2602.04789  [pdf, ps, other

    cs.CV

    Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

    Authors: Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang

    Abstract: Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reason… ▽ More

    Submitted 4 February, 2026; originally announced February 2026.

    Comments: 14 pages, 7 figures

  24. arXiv:2602.03227  [pdf, ps, other

    cs.CV

    Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane

    Authors: Haoyu Liu, Sucheng Ren, Tingyu Zhu, Peng Wang, Cihang Xie, Alan Yuille, Zeyu Zheng, Feng Wang

    Abstract: Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial formulation decomposes two-dimensional spatial positions into horizontal and vertical components, implicitly restricting positional encoding to axis-aligned directions.… ▽ More

    Submitted 3 February, 2026; originally announced February 2026.

  25. arXiv:2602.01536  [pdf, ps, other

    cs.RO cs.CV

    UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning

    Authors: Shuai Liu, Siheng Ren, Xiaoyao Zhu, Quanmin Liang, Zefeng Li, Qiang Li, Xin Hu, Kai Huang

    Abstract: Achieving reliable and efficient planning in complex driving environments requires a model that can reason over the scene's geometry, appearance, and dynamics. We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning. UniDWM constructs a structure- and dynamic-aware latent world representation that serves as a physically grounde… ▽ More

    Submitted 1 February, 2026; originally announced February 2026.

  26. arXiv:2601.18623  [pdf, ps, other

    cs.CV

    Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation

    Authors: Zihao Wang, Yuzhou Chen, Shaogang Ren

    Abstract: Cross-modal image translation remains brittle and inefficient. Standard diffusion approaches often rely on a single, global linear transfer between domains. We find that this shortcut forces the sampler to traverse off-manifold, high-cost regions, inflating the correction burden and inviting semantic drift. We refer to this shared failure mode as fixed-schedule domain transfer. In this paper, we e… ▽ More

    Submitted 2 February, 2026; v1 submitted 26 January, 2026; originally announced January 2026.

    Comments: Paper accepted as a conference paper at ICLR 2026

  27. arXiv:2601.15369  [pdf, ps, other

    eess.IV cs.AI

    OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

    Authors: Letian Zhang, Sucheng Ren, Yanqing Liu, Xianhang Li, Zeyu Wang, Yuyin Zhou, Huaxiu Yao, Zeyu Zheng, Weili Nie, Guilin Liu, Zhiding Yu, Cihang Xie

    Abstract: This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to rec… ▽ More

    Submitted 12 March, 2026; v1 submitted 21 January, 2026; originally announced January 2026.

  28. arXiv:2601.14304  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding

    Authors: Juncheng Wang, Zhe Hu, Chao Xu, Siyue Ren, Yuxiang Feng, Yang Liu, Baigui Sun, Shujun Wang

    Abstract: Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count a… ▽ More

    Submitted 18 January, 2026; originally announced January 2026.

    Comments: Accepted at EACL 2026

  29. arXiv:2601.12280  [pdf, ps, other

    cs.HC

    Democratizing Music Therapy: LLM-Based Automated EEG Analysis and Progress Tracking for Low-Cost Home Devices

    Authors: Huixin Xue, Guangjun Xu, Shihong Ren, Xian Gao, Ruian Tie, Zhen Zhou, Hao Liu, Yue Gao

    Abstract: Home-based music therapy devices require accessible and cost-effective solutions for users to understand and track their therapeutic progress. Traditional physiological signal analysis, particularly EEG interpretation, relies heavily on domain experts, creating barriers to scalability and home adoption. Meanwhile, few experts are capable of interpreting physiological signal data while also making… ▽ More

    Submitted 18 January, 2026; originally announced January 2026.

    Comments: 9 pages, 6 figures

  30. arXiv:2601.09112  [pdf, ps, other

    cs.CY

    Seeking Human Security Consensus: A Unified Value Scale for Generative AI Value Safety

    Authors: Ying He, Baiyang Li, Yule Cao, Huirun Xu, Qiuxian Chen, Shu Chen, Shangsheng Ren

    Abstract: The rapid development of generative AI has brought value- and ethics-related risks to the forefront, making value safety a critical concern while a unified consensus remains lacking. In this work, we propose an internationally inclusive and resilient unified value framework, the GenAI Value Safety Scale (GVS-Scale): Grounded in a lifecycle-oriented perspective, we develop a taxonomy of GenAI value… ▽ More

    Submitted 13 January, 2026; originally announced January 2026.

  31. arXiv:2601.02780  [pdf, ps, other

    cs.CL cs.AI

    MiMo-V2-Flash Technical Report

    Authors: Xiaomi LLM-Core Team, :, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi , et al. (102 additional authors not shown)

    Abstract: We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tok… ▽ More

    Submitted 8 January, 2026; v1 submitted 6 January, 2026; originally announced January 2026.

    Comments: 31 pages, technical report

  32. arXiv:2601.01200  [pdf, ps, other

    cs.CV eess.IV

    MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

    Authors: Zhang Chen, Shuai Wan, Yuezhe Zhang, Siyu Ren, Fuzheng Yang, Junhui Hou

    Abstract: The unstructured and irregular nature of points poses a significant challenge for accurate point cloud quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes radial basis function (RBF) to represen… ▽ More

    Submitted 27 March, 2026; v1 submitted 3 January, 2026; originally announced January 2026.

  33. arXiv:2601.00626  [pdf, ps, other

    cs.CV cs.LG

    HyperPriv-EPN: Hypergraph Learning with Privileged Knowledge for Ependymoma Prognosis

    Authors: Shuren Gabriel Yu, Sikang Ren, Yongji Tian

    Abstract: Preoperative prognosis of Ependymoma is critical for treatment planning but challenging due to the lack of semantic insights in MRI compared to post-operative surgical reports. Existing multimodal methods fail to leverage this privileged text data when it is unavailable during inference. To bridge this gap, we propose HyperPriv-EPN, a hypergraph-based Learning Using Privileged Information (LUPI) f… ▽ More

    Submitted 2 January, 2026; originally announced January 2026.

    Comments: 6 pages, 2 figures, 2 tables

  34. arXiv:2512.23808  [pdf, ps, other

    cs.CL cs.SD eess.AS

    MiMo-Audio: Audio Language Models are Few-Shot Learners

    Authors: Xiaomi LLM-Core Team, :, Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xingchen Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu , et al. (76 additional authors not shown)

    Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the aud… ▽ More

    Submitted 29 December, 2025; originally announced December 2025.

  35. arXiv:2512.17495  [pdf, ps, other

    cs.CV

    GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

    Authors: Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

    Abstract: Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly visually ground with human-like sophistication, or are they merely pattern-matching on simplified datasets? Cu… ▽ More

    Submitted 23 March, 2026; v1 submitted 19 December, 2025; originally announced December 2025.

  36. arXiv:2512.11131  [pdf, ps, other

    cs.LG cs.AI

    Fairness-Regularized Online Optimization with Switching Costs

    Authors: Pengfei Li, Yuelin Han, Adam Wierman, Shaolei Ren

    Abstract: Fairness and action smoothness are two crucial considerations in many online optimization problems, but they have yet to be addressed simultaneously. In this paper, we study a new and challenging setting of fairness-regularized smoothed online convex optimization with switching costs. First, to highlight the fundamental challenges introduced by the long-term fairness regularizer evaluated based on… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

    Comments: Accepted by NeurIPS 2025

  37. arXiv:2512.01340  [pdf, ps, other

    cs.CV

    EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

    Authors: Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang, Farong Wen, Yu Zhou, Jiezhang Cao, Xiongkuo Min, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu

    Abstract: Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation c… ▽ More

    Submitted 1 December, 2025; originally announced December 2025.

  38. arXiv:2511.22037  [pdf, ps, other

    cs.CY

    What AI Speaks for Your Community: Polling AI Agents for Public Opinion on Data Center Projects

    Authors: Zhifeng Wu, Yuelin Han, Shaolei Ren

    Abstract: The intense computational demands of AI, especially large foundation models, are driving a global boom in data centers. These facilities bring both tangible benefits and potential environmental burdens to local communities. However, the planning processes for data centers often fail to proactively integrate local public opinion in advance, largely because traditional polling is expensive or is con… ▽ More

    Submitted 4 December, 2025; v1 submitted 26 November, 2025; originally announced November 2025.

    Comments: 35 Pages. Accepted to NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (ResponsibleFM)

  39. arXiv:2511.22031  [pdf, ps, other

    cs.LG cs.AI

    Predicting Public Health Impacts of Electricity Usage

    Authors: Yejia Liu, Zhifeng Wu, Pengfei Li, Shaolei Ren

    Abstract: The electric power sector is a leading source of air pollutant emissions, impacting the public health of nearly every community. Although regulatory measures have reduced air pollutants, fossil fuels remain a significant component of the energy supply, highlighting the need for more advanced demand-side approaches to reduce the public health impacts. To enable health-informed demand-side managemen… ▽ More

    Submitted 26 November, 2025; originally announced November 2025.

    Comments: 21 Pages. Accepted to NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (ResponsibleFM)

  40. arXiv:2511.20293  [pdf, ps, other

    cs.DB cs.AI cs.LG

    Forgetting by Pruning: Data Deletion in Join Cardinality Estimation

    Authors: Chaowei He, Yuanjun Liu, Qingzhi Ma, Shenyuan Ren, Xizhao Luo, Lei Zhao, An Liu

    Abstract: Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data deletion, a core component of machine unlearning, faces three critical challenges in learned CE models: attribute-level sensitivity, inter-table propagation and domain disappearance leading to severe overestim… ▽ More

    Submitted 25 November, 2025; originally announced November 2025.

    Comments: AAAI26

  41. arXiv:2511.18382  [pdf, ps, other

    cs.CV

    ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access

    Authors: Timing Yang, Sucheng Ren, Alan Yuille, Feng Wang

    Abstract: Text-to-video generation has surged in interest since Sora, yet open-source models still face a data bottleneck: there is no large, high-quality, easily obtainable video-text corpus. Existing public datasets typically require manual YouTube crawling, which yields low usable volume due to link rot and access limits, and raises licensing uncertainty. This work addresses this challenge by introducing… ▽ More

    Submitted 23 November, 2025; originally announced November 2025.

  42. arXiv:2511.16997  [pdf, ps, other

    cs.AI

    MirrorMind: Empowering OmniScientist with the Expert Perspectives and Collective Knowledge of Human Scientists

    Authors: Qingbin Zeng, Bingbing Fan, Zhiyu Chen, Sijian Ren, Zhilun Zhou, Xuhua Zhang, Yuanyi Zhen, Fengli Xu, Yong Li, Tie-Yan Liu

    Abstract: The emergence of AI Scientists has demonstrated remarkable potential in automating scientific research. However, current approaches largely conceptualize scientific discovery as a solitary optimization or search process, overlooking that knowledge production is inherently a social and historical endeavor. Human scientific insight stems from two distinct yet interconnected sources. First is the ind… ▽ More

    Submitted 21 November, 2025; originally announced November 2025.

    Comments: 26 pages, 4 figures

  43. arXiv:2511.16518  [pdf, ps, other

    cs.RO cs.CL cs.CV

    MiMo-Embodied: X-Embodied Foundation Model Technical Report

    Authors: Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, Yuchen Zhang, Jing Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Mingliang Zhou, Yinan Zheng, Zihao Yue, Shuhao Gu, Hao Tian, Yuannan Shen , et al. (19 additional authors not shown)

    Abstract: We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Percepti… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

    Comments: Code: https://github.com/XiaomiMiMo/MiMo-Embodied Model: https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B

  44. arXiv:2511.14439  [pdf, ps, other

    cs.CL

    MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

    Authors: Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, Jie Xu

    Abstract: Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models,… ▽ More

    Submitted 18 November, 2025; v1 submitted 18 November, 2025; originally announced November 2025.

  45. arXiv:2511.04283  [pdf, ps, other

    cs.CV

    FastGS: Training 3D Gaussian Splatting in 100 Seconds

    Authors: Shiwei Ren, Tianci Wen, Yongchun Fang, Biao Lu

    Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training… ▽ More

    Submitted 5 December, 2025; v1 submitted 6 November, 2025; originally announced November 2025.

    Comments: Project page: https://fastgs.github.io/

    MSC Class: 68T40(Primary)68T45; 68U99 (Secondary) ACM Class: I.4.8; I.3.7

  46. arXiv:2511.00279  [pdf, ps, other

    cs.MM cs.AI cs.CL cs.DC cs.LG cs.SD

    LongCat-Flash-Omni Technical Report

    Authors: Meituan LongCat Team, Bairui Wang, Bayan, Bin Xiao, Bo Zhang, Bolin Rong, Borun Chen, Chang Wan, Chao Zhang, Chen Huang, Chen Chen, Chen Chen, Chengxu Yang, Chengzuo Yang, Cong Han, Dandan Peng, Delian Ruan, Detai Xin, Disong Wang, Dongchao Yang, Fanfan Liu, Fengjiao Chen, Fengyu Yang, Gan Dong, Gang Huang , et al. (108 additional authors not shown)

    Abstract: We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong… ▽ More

    Submitted 28 November, 2025; v1 submitted 31 October, 2025; originally announced November 2025.

  47. arXiv:2510.27469  [pdf, ps, other

    cs.CL

    Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning

    Authors: Chenyang Shao, Sijian Ren, Fengli Xu, Yong Li

    Abstract: In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs'… ▽ More

    Submitted 31 October, 2025; originally announced October 2025.

  48. arXiv:2510.27267  [pdf, ps, other

    cs.CL cs.AI

    MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

    Authors: Kangkun Mao, Jinru Ding, Jiayuan Chen, Mouxiao Bian, Ruiyao Chen, Xinwei Peng, Sijie Ren, Linyang Li, Jie Xu

    Abstract: As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs'… ▽ More

    Submitted 31 October, 2025; originally announced October 2025.

  49. arXiv:2510.22200  [pdf, ps, other

    cs.CV

    LongCat-Video Technical Report

    Authors: Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang

    Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step tow… ▽ More

    Submitted 28 October, 2025; v1 submitted 25 October, 2025; originally announced October 2025.

  50. arXiv:2510.21623  [pdf, ps, other

    cs.CL cs.AI

    The Universal Landscape of Human Reasoning

    Authors: Qiguang Chen, Jinhao Liu, Libo Qin, Yimeng Zhang, Yihao Liang, Shangxu Ren, Chengyu Luan, Dengyun Peng, Hanjing Li, Jiannan Guan, Zheng Yan, Jiaqi Wang, Mengkang Hu, Yantao Du, Zhi Chen, Xie Chen, Wanxiang Che

    Abstract: Understanding how information is dynamically accumulated and transformed in human reasoning has long challenged cognitive psychology, philosophy, and artificial intelligence. Existing accounts, from classical logic to probabilistic models, illuminate aspects of output or individual modelling, but do not offer a unified, quantitative description of general human reasoning dynamics. To solve this, w… ▽ More

    Submitted 24 October, 2025; originally announced October 2025.

    Comments: Preprint