Skip to main content

Showing 1–50 of 365 results for author: Ye, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.11483  [pdf, ps, other

    cs.LG q-bio.QM

    CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

    Authors: Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, Li Liu

    Abstract: Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein--ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural val… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  2. arXiv:2604.11407  [pdf, ps, other

    cs.CL cs.AI

    Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

    Authors: Bo Li, Mingda Wang, Gexiang Fang, Shikun Zhang, Wei Ye

    Abstract: We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-gui… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

    Comments: Github: https://github.com/WisdomShell/GRIP HuggingFace:https://huggingface.co/collections/WisdomShell/grip

    Journal ref: ACL2026, Main Conference

  3. arXiv:2604.11365  [pdf, ps, other

    cs.AI cs.CL

    Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

    Authors: Peiyang Liu, Zhirui Chen, Xi Wang, Di Liang, Youru Li, Zhi Cai, Wei Ye

    Abstract: Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms su… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  4. arXiv:2604.10448  [pdf, ps, other

    cs.CL

    Instruction Data Selection via Answer Divergence

    Authors: Bo Li, Mingda Wang, Shikun Zhang, Wei Ye

    Abstract: Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an outp… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: Github: https://github.com/WisdomShell/ADG Project: https://wisdomshell.github.io/ADG/

    Journal ref: ACL2026, Main Conference

  5. arXiv:2604.10098  [pdf, ps, other

    cs.LG

    Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    Authors: Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong

    Abstract: As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, signifi… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

  6. arXiv:2604.08089  [pdf, ps, other

    cs.SE

    GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair

    Authors: Zhuoyao Liu, Zhengran Zeng, Shu-Dong Huang, Yang Liu, Shikun Zhang, Wei Ye

    Abstract: Large Language Model (LLM)-based Automated Program Repair (APR) has shown strong potential on textual benchmarks, yet struggles in multimodal scenarios where bugs are reported with GUI screenshots. Existing methods typically convert images into plain text, which discards critical spatial relationships and causes a severe disconnect between visual observations and code components, leading localizat… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: Code available at: https://github.com/lzyyyyy666/GALA

    ACM Class: D.2.5; I.2.7; I.4.8

  7. arXiv:2604.07892  [pdf, ps, other

    cs.CL cs.AI

    Data Selection for Multi-turn Dialogue Instruction Tuning

    Authors: Bo Li, Shikun Zhang, Wei Ye

    Abstract: Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversa… ▽ More

    Submitted 11 April, 2026; v1 submitted 9 April, 2026; originally announced April 2026.

    Comments: Github: https://github.com/WisdomShell/MDS Project: https://wisdomshell.github.io/MDS/

    Journal ref: Findings of ACL 2026

  8. arXiv:2604.07769  [pdf, ps, other

    cs.SE

    An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

    Authors: Chengli Xing, Zhengran Zeng, Gexiang Fang, Rui Xie, Wei Ye, Shikun Zhang

    Abstract: Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for improving LLM performance. However, most of the existing research on pre-training data filtering has focused on general datasets, and little attention for programmi… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  9. arXiv:2604.05246  [pdf, ps, other

    cs.PL

    A Gradual Probabilistic Lambda Calculus

    Authors: Wenjia Ye, Matías Toro, Federico Olmedo

    Abstract: Probabilistic programming languages have recently gained a lot of attention, in particular due to their applications in domains such as machine learning and differential privacy. To establish invariants of interest, many such languages include some form of static checking in the form of type systems. However, adopting such a type discipline can be cumbersome or overly conservative. Gradual typin… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

  10. arXiv:2604.04351  [pdf, ps, other

    cs.HC

    Cognibit: From Digital Exhaustion to Real-World Connection Through Gamified Territory Control and LLM-Powered Twin Networking

    Authors: Wanghao Ye, Sihan Chen, Yiting Wang, Shwai He, Bowei Tian, Guoheng Sun, Ziyi Wang, Ziyao Wang, Yexiao He, Zheyu Shen, Meng Liu, Yuning Zhang, Meng Feng, Yifei Dong, Yanhong Qian, Yang Wang, Siyuan Peng, Yilong Dai, Zhenle Duan, Joshua Liu, Lang Xiong, Hanzhang Qin, Ang Li

    Abstract: We present an LLM-powered social discovery platform that uses digital twins to autonomously evaluate interpersonal compatibility through behavioral simulation. The platform unifies three key pillars: (1) digital twins that engage in autonomous multi-turn conversations on behalf of users to estimate compatibility, (2) gamified territory conquest mechanics that incentivize real-world exploration and… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

    Comments: 9 pages main body, 155 pages total with appendices

  11. arXiv:2604.04138  [pdf, ps, other

    cs.RO cs.AI

    Learning Dexterous Grasping from Sparse Taxonomy Guidance

    Authors: Juhan Park, Taerim Yoon, Seungmin Kim, Joonggil Kim, Wontae Ye, Jeongeun Park, Yoonbyung Chai, Geonwoo Cho, Geunwoo Cho, Dohyeong Kim, Kyungjae Lee, Yongjae Kim, Sungjoon Choi

    Abstract: Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to interve… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

  12. Recruiting Heterogeneous Crowdsource Vehicles for Updating a High-definition Map

    Authors: Wentao Ye, Yuan Luo, Bo Liu, Jianwei Huang

    Abstract: The high-definition map is a cornerstone of autonomous driving. Unlike constructing a costly fleet of mapping vehicles, the crowdsourcing paradigm is a cost-effective way to keep an HD map up to date. Achieving practical success for crowdsourcing-based HD maps is contingent on addressing two critical issues: freshness and recruitment costs. Given that crowdsource vehicles are often heterogeneous i… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

    Comments: 13 pages, 5 figures. This is the author's accepted manuscript of the paper presented at the 2023 IEEE WiOpt conference

    Journal ref: 2023 21st International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pp. 1-8, Aug. 2023

  13. Efficient and Cost-effective Vehicle Recruitment for HD Map Crowdsourcing

    Authors: Wentao Ye, Yuan Luo, Bo Liu, Jianwei Huang

    Abstract: The high-definition (HD) map is a cornerstone of autonomous driving. The crowdsourcing paradigm is a cost-effective way to keep an HD map up-to-date. Current HD map crowdsourcing mechanisms aim to enhance HD map freshness within recruitment budgets. However, many overlook unique and critical traits of crowdsourcing vehicles, such as random arrival and heterogeneity, leading to either compromised m… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

    Comments: 14 pages, 11 figures. This is the author's accepted manuscript of the article published in IEEE Transactions on Mobile Computing

    Journal ref: IEEE Transactions on Mobile Computing, vol. 24, no. 8, pp. 7505-7518, Aug. 2025

  14. arXiv:2603.12795  [pdf, ps, other

    cs.CL

    SteerRM: Debiasing Reward Models via Sparse Autoencoders

    Authors: Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye

    Abstract: Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, t… ▽ More

    Submitted 13 March, 2026; originally announced March 2026.

  15. arXiv:2603.02025  [pdf, ps, other

    cs.LG cs.AI

    Revealing Combinatorial Reasoning of GNNs via Graph Concept Bottleneck Layer

    Authors: Yue Niu, Zhaokai Sun, Jiayi Yang, Xiaofeng Cao, Rui Fan, Xin Sun, Hanli Wang, Wei Ye

    Abstract: Despite their success in various domains, the growing dependence on GNNs raises a critical concern about the nature of the combinatorial reasoning underlying their predictions, which is often hidden within their black-box architectures. Addressing this challenge requires understanding how GNNs translate topological patterns into logical rules. However, current works only uncover the hard logical r… ▽ More

    Submitted 2 March, 2026; originally announced March 2026.

    Comments: 20 pages

  16. arXiv:2603.01452  [pdf, ps, other

    cs.AI cs.RO

    Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

    Authors: Shaohuai Liu, Weirui Ye, Yilun Du, Le Xie

    Abstract: Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. T… ▽ More

    Submitted 2 March, 2026; originally announced March 2026.

  17. arXiv:2602.22859  [pdf, ps, other

    cs.CV

    From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

    Authors: Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye

    Abstract: As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based cor… ▽ More

    Submitted 26 February, 2026; originally announced February 2026.

  18. arXiv:2602.22732  [pdf, ps, other

    cs.IR cs.LG

    Generative Recommendation for Large-Scale Advertising

    Authors: Ben Xue, Dan Liu, Lixiang Wang, Mingjie Sun, Peng Wang, Pengfei Zhang, Shaoyun Shi, Tianyu Xu, Yunhao Sha, Zhiqiang Liu, Bo Kong, Bo Wang, Hang Yang, Jieting Xue, Junhao Wang, Shengyu Wang, Shuping Hui, Wencai Ye, Xiao Lin, Yongzhi Li, Yuhang Chen, Zhihui Yin, Quan Chen, Shiyang Wen, Wenjin Wu , et al. (5 additional authors not shown)

    Abstract: Generative recommendation has recently attracted widespread attention in industry due to its potential for scaling and stronger model capacity. However, deploying real-time generative recommendation in large-scale advertising requires designs beyond large-language-model (LLM)-style training and serving recipes. We present a production-oriented generative recommender co-designed across architecture… ▽ More

    Submitted 1 April, 2026; v1 submitted 26 February, 2026; originally announced February 2026.

    Comments: 13 pages, 6 figures, under review

  19. arXiv:2602.21780  [pdf, ps, other

    cs.CV

    XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

    Authors: Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong

    Abstract: Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to… ▽ More

    Submitted 25 February, 2026; originally announced February 2026.

    Comments: Submission to the Journal of the Society for Information Display

  20. arXiv:2602.18534  [pdf, ps, other

    cs.SE cs.PL

    Validated Code Translation for Projects with External Libraries

    Authors: Hanliang Zhang, Arindam Sharma, Cristina David, Meng Wang, Brandon Paulsen, Daniel Kroening, Wenjia Ye, Taro Sekiyama

    Abstract: Large Language Models (LLMs) have shown promise for program translation, particularly for migrating systems code to memory-safe languages such as Rust. However, existing approaches struggle when source programs depend on external libraries: LLMs frequently hallucinate non-existent target APIs and fail to generate call-enabling imports; moreover, validating semantic equivalence is challenging when… ▽ More

    Submitted 20 February, 2026; originally announced February 2026.

  21. arXiv:2602.13588  [pdf, ps, other

    cs.CV cs.AI

    Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks

    Authors: Guanfeng Tang, Hongbo Zhao, Ziwei Long, Jiayao Li, Bohong Xiao, Wei Ye, Hanli Wang, Rui Fan

    Abstract: Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual f… ▽ More

    Submitted 13 February, 2026; originally announced February 2026.

  22. arXiv:2602.12670  [pdf, ps, other

    cs.AI

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Authors: Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li , et al. (16 additional authors not shown)

    Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-gen… ▽ More

    Submitted 13 March, 2026; v1 submitted 13 February, 2026; originally announced February 2026.

  23. arXiv:2602.09555  [pdf, ps, other

    cs.CL

    Advancing Block Diffusion Language Models for Test-Time Scaling

    Authors: Yi Lu, Deyang Kong, Jianing Wang, Linsen Guo, Xue Wang, Qi Guo, Tao Gui, Xuanjing Huang, Wei Ye, Shikun Zhang, Wei Wang

    Abstract: Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified fra… ▽ More

    Submitted 10 February, 2026; v1 submitted 10 February, 2026; originally announced February 2026.

  24. arXiv:2602.08676  [pdf, ps, other

    cs.LG cs.AI

    LLaDA2.1: Speeding Up Text Diffusion via Token Editing

    Authors: Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, Zhuocheng Gong, Yanmei Gu, Jian Guan, Kaiyuan Guan, Hongliang He, Zenan Huang, Juyong Jiang, Zhonghui Jiang, Zhenzhong Lan, Chengxi Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu , et al. (25 additional authors not shown)

    Abstract: While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T)… ▽ More

    Submitted 13 February, 2026; v1 submitted 9 February, 2026; originally announced February 2026.

    Comments: 11 pages, 3 figures

  25. arXiv:2602.08344  [pdf, ps, other

    cs.AI

    OPE: Overcoming Information Saturation in Parallel Thinking via Outline-Guided Path Exploration

    Authors: Qi Guo, Jianing Wang, Deyang Kong, Xiangyu Xi, Jianfei Zhang, Yi Lu, Jingang Wang, Wei Wang, Shikun Zhang, Wei Ye

    Abstract: Parallel thinking has emerged as a new paradigm for large reasoning models (LRMs) in tackling complex problems. Recent methods leverage Reinforcement Learning (RL) to enhance parallel thinking, aiming to address the limitations in computational resources and effectiveness encountered with supervised fine-tuning. However, most existing studies primarily focus on optimizing the aggregation phase, wi… ▽ More

    Submitted 9 February, 2026; originally announced February 2026.

  26. TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

    Authors: Jiangping Huang, Wenguang Ye, Weisong Sun, Jian Zhang, Mingyue Zhang, Yang Liu

    Abstract: Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficien… ▽ More

    Submitted 6 February, 2026; originally announced February 2026.

    MSC Class: D.2

  27. arXiv:2602.06488  [pdf, ps, other

    cs.CV

    Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

    Authors: Zizhan Guo, Yi Feng, Mengtan Zhang, Haoran Zhang, Wei Ye, Rui Fan

    Abstract: Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the… ▽ More

    Submitted 6 February, 2026; originally announced February 2026.

  28. arXiv:2602.03416  [pdf, ps, other

    cs.IR

    AesRec: A Dataset for Aesthetics-Aligned Clothing Outfit Recommendation

    Authors: Wenxin Ye, Lin Li, Ming Li, Yang Shen, Kanghong Wang, Jimmy Xiangji Huang

    Abstract: Clothing recommendation extends beyond merely generating personalized outfits; it serves as a crucial medium for aesthetic guidance. However, existing methods predominantly rely on user-item-outfit interaction behaviors while overlooking explicit representations of clothing aesthetics. To bridge this gap, we present the AesRec benchmark dataset featuring systematic quantitative aesthetic annotatio… ▽ More

    Submitted 3 February, 2026; originally announced February 2026.

  29. arXiv:2602.02276  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Kimi K2.5: Visual Agentic Intelligence

    Authors: Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen , et al. (301 additional authors not shown)

    Abstract: We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5… ▽ More

    Submitted 2 February, 2026; originally announced February 2026.

    Comments: Kimi K2.5 tech report

  30. arXiv:2602.01611  [pdf, ps, other

    cs.LG

    What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?

    Authors: Weizheng Gu, Chengze Li, Zhuohao Yu, Mengyuan Sun, Zhibang Yang, Wei Wang, Hongrui Jia, Shikun Zhang, Wei Ye

    Abstract: Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capab… ▽ More

    Submitted 1 February, 2026; originally announced February 2026.

  31. arXiv:2602.01227  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority

    Authors: Zhanming Shen, Zeyu Qin, Jiaqi Hu, Wentao Ye, Hao Chen, Xiaomeng Hu, Haokai Xu, Gang Chen, Yi R. Fung, Haobo Wang

    Abstract: The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshap… ▽ More

    Submitted 9 February, 2026; v1 submitted 1 February, 2026; originally announced February 2026.

  32. arXiv:2602.00037  [pdf, ps, other

    q-fin.ST cs.AI cs.CE cs.LG

    Bitcoin Price Prediction using Machine Learning and Combinatorial Fusion Analysis

    Authors: Yuanhong Wu, Wei Ye, Jingyan Xu, D. Frank Hsu

    Abstract: In this work, we propose to apply a new model fusion and learning paradigm, known as Combinatorial Fusion Analysis (CFA), to the field of Bitcoin price prediction. Price prediction of financial product has always been a big topic in finance, as the successful prediction of the price can yield significant profit. Every machine learning model has its own strength and weakness, which hinders progress… ▽ More

    Submitted 8 March, 2026; v1 submitted 18 January, 2026; originally announced February 2026.

    Comments: 8 pages, 5 figures, 3 tables; Accepted to 2025 IEEE Conference on Artificial Intelligence (IEEE CAI)

  33. arXiv:2601.23090  [pdf, ps, other

    cs.CE q-bio.QM

    Omni-fMRI: A Universal Atlas-Free fMRI Foundation Model

    Authors: Mo Wang, Wenhao Ye, Junfeng Xia, Junxiang Zhang, Xuanye Pan, Minghao Xu, Haotian Deng, Hongkai Wen, Quanying Liu

    Abstract: Self-supervised fMRI foundation models have shown promising transfer performance, yet most rely on predefined region-level parcellations that discard fine-grained voxel information and introduce atlas-dependent biases. We propose Omni-fMRI, an atlas-free foundation model that operates directly on voxel-level signals. To enable scalable pretraining on 49,497 fMRI sessions across nine datasets, Omni… ▽ More

    Submitted 30 January, 2026; originally announced January 2026.

  34. arXiv:2601.21947  [pdf, ps, other

    cs.AI

    ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models

    Authors: Bowen Fang, Wen Ye, Yunyue Su, Jinghao Zhang, Qiang Liu, Yesheng Liu, Xin Sun, Shu Wu, Jiabing Yang, Baole Wei, Liang Wang

    Abstract: Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn an… ▽ More

    Submitted 29 January, 2026; originally announced January 2026.

    Comments: 10pages, 12 figures, Accepted to ICLR 2026

    ACM Class: I.2.7

  35. arXiv:2601.21408  [pdf, ps, other

    cs.CV

    MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

    Authors: Xinan He, Kaiqing Lin, Yue Zhou, Jiaming Zhong, Wei Ye, Wenhui Yi, Bing Fan, Feng Ding, Haodong Li, Bo Cao, Bin Li

    Abstract: With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of… ▽ More

    Submitted 2 February, 2026; v1 submitted 29 January, 2026; originally announced January 2026.

  36. arXiv:2601.15671  [pdf, ps, other

    cs.HC cs.AI

    StreetDesignAI: A Multi-Persona Evaluation System for Inclusive Infrastructure Design

    Authors: Ziyi Wang, Yilong Dai, Duanya Lyu, Mateo Nader, Sihan Chen, Wanghao Ye, Zjian Ding, Xiang Yan

    Abstract: Designing cycling infrastructure requires balancing the competing needs of diverse user groups, yet designers often struggle to anticipate how different cyclists experience the same street environment. We investigate how persona-based evaluation can support cycling infrastructure design by making experiential conflicts explicit during the design process. Informed by a formative study with 12 domai… ▽ More

    Submitted 12 April, 2026; v1 submitted 22 January, 2026; originally announced January 2026.

  37. arXiv:2601.13856  [pdf, ps, other

    cs.IR

    QKVQA: Question-Focused Filtering for Knowledge-based VQA

    Authors: Wei Ye, Yixin Su, Yueguo Chen, Longxiang Gao, Jianjun Li, Ruixuan Li, Rui Zhang

    Abstract: Visual Question Answering (VQA) is the task of answering questions based on image content. Building upon this, Knowledge-Based VQA (KB-VQA) requires models to answer questions that depend on external knowledge beyond the visual content of an image. In such settings, effective knowledge filtering is essential for achieving high question answering accuracy. Typical filtering methods suffer from two… ▽ More

    Submitted 7 April, 2026; v1 submitted 20 January, 2026; originally announced January 2026.

  38. arXiv:2601.12684  [pdf, ps, other

    cs.CE

    A Model Fusion Approach for Enhancing Credit Approval Decision Making

    Authors: Yuanhong Wu, Jingyan Xu, Wei Ye, Christina Schweikert, D. Frank Hsu

    Abstract: Credit default poses significant challenges to financial institutions and consumers, resulting in substantial financial losses and diminished trust. As such, credit default risk management has been a critical topic in the financial industry. In this paper, we present Combinatorial Fusion Analysis (CFA), a model fusion framework, that combines multiple machine learning algorithms to detect and pred… ▽ More

    Submitted 18 January, 2026; originally announced January 2026.

    Comments: 7 pages, 3 figures, 2 tables; Accepted to 2025 IEEE International Conference on AI x Business (AIxB 2025)

  39. arXiv:2601.12051  [pdf, ps, other

    cs.CV

    A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models

    Authors: Weixin Ye, Wei Wang, Yahui Liu, Yue Song, Bin Ren, Wei Bi, Rita Cucchiara, Nicu Sebe

    Abstract: In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To… ▽ More

    Submitted 17 January, 2026; originally announced January 2026.

    Comments: 9 figures, 12 tables

  40. arXiv:2601.10348  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Training-Trajectory-Aware Token Selection

    Authors: Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao

    Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharpl… ▽ More

    Submitted 15 January, 2026; originally announced January 2026.

  41. arXiv:2601.10156  [pdf, ps, other

    cs.CL

    ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

    Authors: Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao

    Abstract: While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invo… ▽ More

    Submitted 15 January, 2026; originally announced January 2026.

    Comments: Work in Progress. Code available: https://github.com/MurrayTom/ToolSafe

  42. arXiv:2601.09269  [pdf, ps, other

    cs.AI

    RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering

    Authors: Wencheng Ye, Xiaoyang Yuan, Yi Bin, Pengpeng Zeng, Hengyu Jin, Liang Peng, Heng Tao Shen

    Abstract: Recent work on domain-specific reasoning with large language models (LLMs) often relies on training-intensive approaches that require parameter updates. While activation steering has emerged as a parameter efficient alternative, existing methods apply static, manual interventions that fail to adapt to the dynamic nature of complex reasoning. To address this limitation, we propose RISER (Router-bas… ▽ More

    Submitted 19 January, 2026; v1 submitted 14 January, 2026; originally announced January 2026.

  43. arXiv:2601.02358  [pdf, ps, other

    cs.CV

    VINO: A Unified Visual Generator with Interleaved OmniModal Context

    Authors: Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye

    Abstract: We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vis… ▽ More

    Submitted 16 January, 2026; v1 submitted 5 January, 2026; originally announced January 2026.

    Comments: Project page: https://sotamak1r.github.io/VINO-web/

  44. arXiv:2601.01204  [pdf, ps, other

    cs.CV

    XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

    Authors: Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong

    Abstract: Learning-based 3D visual geometry models have benefited substantially from large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention for strong streaming reconstruction, but suffers from unbounded KV cache growth, leading to escalating memory consumption and inference latency as input frames accumulate. We propose XStreamVGGT, a tuning-free approach that systematicall… ▽ More

    Submitted 3 January, 2026; originally announced January 2026.

  45. arXiv:2512.21881  [pdf, ps, other

    cs.CV q-bio.NC

    SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis

    Authors: Mo Wang, Junfeng Xia, Wenhao Ye, Enyu Liu, Kaining Peng, Jianfeng Feng, Quanying Liu, Hongkai Wen

    Abstract: Foundation models are emerging as a powerful paradigm for fMRI analysis, but current approaches face a dual bottleneck of data- and training-efficiency. Atlas-based methods aggregate voxel signals into fixed regions of interest, reducing data dimensionality but discarding fine-grained spatial details, and requiring extremely large cohorts to train effectively as general-purpose foundation models.… ▽ More

    Submitted 30 January, 2026; v1 submitted 26 December, 2025; originally announced December 2025.

    Comments: release code

  46. arXiv:2512.20312  [pdf, ps, other

    cs.LG cs.AI

    TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning

    Authors: Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang, Yuhang Yang, Ning Wang, Yucheng Wei, Liyao Li, Wentao Ye, Hao Chen, Tao Zhang, Junlin Zhou, Haobo Wang, Gang Chen, Junbo Zhao

    Abstract: Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learnin… ▽ More

    Submitted 25 December, 2025; v1 submitted 23 December, 2025; originally announced December 2025.

  47. arXiv:2512.19379  [pdf, ps, other

    cs.LG cs.AI cs.MM

    OmniMER: Auxiliary-Enhanced LLM Adaptation for Indonesian Multimodal Emotion Recognition

    Authors: Xueming Yan, Boyan Xu, Yaochu Jin, Lixian Xiao, Wenlong Ye, Runyang Cai, Zeqi Zheng, Jingfa Liu, Aimin Yang, Yongduan Song

    Abstract: Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emoti… ▽ More

    Submitted 10 February, 2026; v1 submitted 22 December, 2025; originally announced December 2025.

  48. arXiv:2512.18772  [pdf, ps, other

    cs.CV

    In-Context Audio Control of Video Diffusion Transformers

    Authors: Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue

    Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusi… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

  49. arXiv:2512.16776  [pdf, ps, other

    cs.CV

    Kling-Omni Technical Report

    Authors: Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu , et al. (43 additional authors not shown)

    Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supp… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Kling-Omni Technical Report

  50. arXiv:2512.11609  [pdf, ps, other

    cs.RO

    UniBYD: A Unified Framework for Learning Robotic Manipulation Across Embodiments Beyond Imitation of Human Demonstrations

    Authors: Tingyu Yuan, Biaoliang Guan, Wen Ye, Ziyan Tian, Yi Yang, Weijie Zhou, Zhaowen Li, Yan Huang, Peng Wang, Chaoyang Zhao, Jinqiao Wang

    Abstract: In embodied intelligence, the embodiment gap between robotic and human hands brings significant challenges for learning from human demonstrations. Although some studies have attempted to bridge this gap using reinforcement learning, they remain confined to merely reproducing human manipulation, resulting in limited task performance. Moreover, current methods struggle to support diverse robotic han… ▽ More

    Submitted 10 March, 2026; v1 submitted 12 December, 2025; originally announced December 2025.