Skip to main content

Showing 1–50 of 166 results for author: You, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2603.19313  [pdf, ps, other

    cs.CL cs.AI

    Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs

    Authors: Kai Wang, Haoyang You, Yang Zhang, Zhongjie Wang

    Abstract: A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski's "emotional memory" acting theory, this paradigm frames persona… ▽ More

    Submitted 14 March, 2026; originally announced March 2026.

    Comments: 34 pages

  2. arXiv:2603.05552  [pdf, ps, other

    cs.RO

    TEGA: A Tactile-Enhanced Grasping Assistant for Assistive Robotics via Sensor Fusion and Closed-Loop Haptic Feedback

    Authors: Hengxu You, Tianyu Zhou, Fang Xu, Kaleb Smith, Eric Jing Du

    Abstract: Recent advances in teleoperation have enabled sophisticated manipulation of dexterous robotic hands, with most systems concentrating on guiding finger positions to achieve desired grasp configurations. However, while accurate finger positioning is essential, it often overlooks the equally critical task of grasp force modulation, vital for handling objects of diverse hardness, texture, and shape. T… ▽ More

    Submitted 4 March, 2026; originally announced March 2026.

    Comments: Accepted to include in ICRA 2026

  3. arXiv:2603.03664  [pdf, ps, other

    eess.SY cs.LG cs.MA math.OC

    Principled Learning-to-Communicate with Quasi-Classical Information Structures

    Authors: Xiangyu Liu, Haoyi You, Kaiqing Zhang

    Abstract: Learning-to-communicate (LTC) in partially observable environments has received increasing attention in deep multi-agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision-making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these t… ▽ More

    Submitted 3 March, 2026; originally announced March 2026.

    Comments: Preliminary version appeared at IEEE CDC 2025

  4. arXiv:2602.23629  [pdf, ps, other

    stat.ML cs.LG math.ST stat.AP stat.ME

    Multivariate Spatio-Temporal Neural Hawkes Processes

    Authors: Christopher Chukwuemeka, Hojun You, Mikyoung Jun

    Abstract: We propose a Multivariate Spatio-Temporal Neural Hawkes Process for modeling complex multivariate event data with spatio-temporal dynamics. The proposed model extends continuous-time neural Hawkes processes by integrating spatial information into latent state evolution through learned temporal and spatial decay dynamics, enabling flexible modeling of excitation and inhibition without predefined tr… ▽ More

    Submitted 1 March, 2026; v1 submitted 26 February, 2026; originally announced February 2026.

    Comments: 16 pages, 20 figures (including supplementary material)

    MSC Class: 60G55 (Primary); 62M30; 68T07 (Secondary) ACM Class: I.2.6; G.3

  5. arXiv:2602.01140  [pdf, ps, other

    cs.LG

    Generalized Radius and Integrated Codebook Transforms for Differentiable Vector Quantization

    Authors: Haochen You, Heng Zhang, Hongyang He, Yuqi Li, Baojing Liu

    Abstract: Vector quantization (VQ) underpins modern generative and representation models by turning continuous latents into discrete tokens. Yet hard nearest-neighbor assignments are non-differentiable and are typically optimized with heuristic straight-through estimators, which couple the update step size to the quantization gap and train each code in isolation, leading to unstable gradients and severe cod… ▽ More

    Submitted 1 February, 2026; originally announced February 2026.

    Comments: This paper has been accepted as a conference paper at CPAL 2026

  6. arXiv:2512.23745  [pdf, ps, other

    cs.LG cs.SE

    A Comprehensive Study of Deep Learning Model Fixing Approaches

    Authors: Hanmo You, Zan Wang, Zishuo Dong, Luanqi Mo, Jianjun Zhao, Junjie Chen

    Abstract: Deep Learning (DL) has been widely adopted in diverse industrial domains, including autonomous driving, intelligent healthcare, and aided programming. Like traditional software, DL systems are also prone to faults, whose malfunctioning may expose users to significant risks. Consequently, numerous approaches have been proposed to address these issues. In this paper, we conduct a large-scale empiric… ▽ More

    Submitted 26 December, 2025; originally announced December 2025.

  7. arXiv:2512.02652  [pdf, ps, other

    cs.SD cs.AI cs.MM

    Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

    Authors: Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li

    Abstract: Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data rep… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

  8. arXiv:2512.00961  [pdf, ps, other

    cs.LG

    Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

    Authors: Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, Wenjun Zeng

    Abstract: Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provi… ▽ More

    Submitted 3 April, 2026; v1 submitted 30 November, 2025; originally announced December 2025.

    Comments: Accepted by CVPR 2026. Project page: https://qiwang067.github.io/genreward

  9. arXiv:2511.16417  [pdf, ps, other

    cs.AI

    Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report

    Authors: Yan Chen, Yu Zou, Jialei Zeng, Haoran You, Xiaorui Zhou, Aixi Zhong

    Abstract: Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial governance, transforming capital allocation architectures, regulatory frameworks, and systemic risk coordination mechanisms. However, as the core medium for assessing corporate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic reading… ▽ More

    Submitted 29 March, 2026; v1 submitted 20 November, 2025; originally announced November 2025.

    ACM Class: I.2.7

  10. arXiv:2511.14183  [pdf, ps, other

    cs.CV

    UniSER: A Foundation Model for Unified Soft Effects Removal

    Authors: Jingdong Zhang, Lingzhi Zhang, Qing Liu, Mang Tik Chiu, Connelly Barnes, Yizhou Wang, Haoran You, Xiaoyang Liu, Yuqian Zhou, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Xin Li, Wenping Wang, Xiaohang Zhan

    Abstract: Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems.… ▽ More

    Submitted 27 March, 2026; v1 submitted 18 November, 2025; originally announced November 2025.

  11. arXiv:2511.00911   

    cs.GR

    G2rammar: Bilingual Grammar Modeling for Enhanced Text-attributed Graph Learning

    Authors: Heng Zheng, Haochen You, Zijun Liu, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang

    Abstract: Text-attributed graphs require models to effectively integrate both structural topology and semantic content. Recent approaches apply large language models to graphs by linearizing structures into token sequences through random walks. These methods create concise graph vocabularies to replace verbose natural language descriptions. However, they overlook a critical component that makes language exp… ▽ More

    Submitted 22 December, 2025; v1 submitted 2 November, 2025; originally announced November 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  12. arXiv:2511.00908   

    cs.CV cs.GR

    GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks

    Authors: Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang

    Abstract: Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scene… ▽ More

    Submitted 22 December, 2025; v1 submitted 2 November, 2025; originally announced November 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  13. arXiv:2511.00898   

    cs.GR

    Empowering LLMs with Structural Role Inference for Zero-Shot Graph Learning

    Authors: Heng Zhang, Jing Liu, Jiajun Wu, Haochen You, Lubin Gan, Yuling Shi, Xiaodong Gu, Zijian Zhang, Shuai Chen, Wenjun Huang, Jin Huang

    Abstract: Large Language Models have emerged as a promising approach for graph learning due to their powerful reasoning capabilities. However, existing methods exhibit systematic performance degradation on structurally important nodes such as bridges and hubs. We identify the root cause of these limitations. Current approaches encode graph topology into static features but lack reasoning scaffolds to transf… ▽ More

    Submitted 22 December, 2025; v1 submitted 2 November, 2025; originally announced November 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  14. arXiv:2510.19074  [pdf, ps, other

    cs.RO

    Sample-Based Hybrid Mode Control: Asymptotically Optimal Switching of Algorithmic and Non-Differentiable Control Modes

    Authors: Yilang Liu, Haoxiang You, Ian Abraham

    Abstract: This paper investigates a sample-based solution to the hybrid mode control problem across non-differentiable and algorithmic hybrid modes. Our approach reasons about a set of hybrid control modes as an integer-based optimization problem where we select what mode to apply, when to switch to another mode, and the duration for which we are in a given control mode. A sample-based variation is derived… ▽ More

    Submitted 5 March, 2026; v1 submitted 21 October, 2025; originally announced October 2025.

  15. arXiv:2510.14622  [pdf, ps, other

    cs.DC

    MPI-over-CXL: Enhancing Communication Efficiency in Distributed HPC Systems

    Authors: Miryeong Kwon, Donghyun Gouk, Hyein Woo, Junhee Kim, Jinwoo Baek, Kyungkuk Nam, Sangyoon Ji, Jiseon Kim, Hanyeoreum Bae, Junhyeok Jang, Hyunwoo You, Junseok Moon, Myoungsoo Jung

    Abstract: MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In response, we introduce MPI-over-CXL, a novel MPI communication paradigm leveraging CXL, which provides cache-coherent shared memory across multiple hosts. MPI-ov… ▽ More

    Submitted 16 October, 2025; originally announced October 2025.

  16. arXiv:2510.12094  [pdf, ps, other

    cs.LG cs.GR

    H4G: Unlocking Faithful Inference for Zero-Shot Graph Learning in Hyperbolic Space

    Authors: Heng Zhang, Tianyi Zhang, Zijun Liu, Yuling Shi, Yaomin Shen, Haochen You, Haichuan Hu, Lubin Gan, Jin Huang

    Abstract: Text-attributed graphs are widely used across domains, offering rich opportunities for zero-shot learning via graph-text alignment. However, existing methods struggle with tasks requiring fine-grained pattern recognition, particularly on heterophilic graphs. Through empirical and theoretical analysis, we identify an \textbf{over-abstraction problem}: current approaches operate at excessively large… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

  17. arXiv:2510.12085   

    cs.LG cs.GR

    GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs

    Authors: Heng Zhang, Tianyi Zhang, Yuling Shi, Xiaodong Gu, Yaomin Shen, Haochen You, Zijian Zhang, Yilei Yuan, Jin Huang

    Abstract: Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topologic… ▽ More

    Submitted 22 December, 2025; v1 submitted 13 October, 2025; originally announced October 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  18. arXiv:2510.10611  [pdf, ps, other

    cs.MA cs.GR

    HyperAgent: Leveraging Hypergraphs for Topology Optimization in Multi-Agent Communication

    Authors: Heng Zhang, Yuling Shi, Xiaodong Gu, Zijian Zhang, Haochen You, Lubin Gan, Yilei Yuan, Jin Huang

    Abstract: Recent advances in large language model-powered multi-agent systems have demonstrated remarkable collective intelligence through effective communication. However, existing approaches face two primary challenges: (i) \textit{Ineffective group collaboration modeling}, as they rely on pairwise edge representations in graph structures, limiting their ability to capture relationships among multiple age… ▽ More

    Submitted 25 February, 2026; v1 submitted 12 October, 2025; originally announced October 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  19. arXiv:2510.10585  [pdf, ps, other

    cs.GR

    D3MAS: Decompose, Deduce, and Distribute for Enhanced Knowledge Sharing in Multi-Agent Systems

    Authors: Heng Zhang, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Yilei Yuan, Jin Huang

    Abstract: Multi-agent systems powered by large language models exhibit strong capabilities in collaborative problem-solving. However, these systems suffer from substantial knowledge redundancy. Agents duplicate efforts in retrieval and reasoning processes. This inefficiency stems from a deeper issue: current architectures lack mechanisms to ensure agents share minimal sufficient information at each operatio… ▽ More

    Submitted 25 February, 2026; v1 submitted 12 October, 2025; originally announced October 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  20. arXiv:2510.10581   

    cs.GR

    GraphTracer: Graph-Guided Failure Tracing in LLM Agents for Robust Multi-Turn Deep Search

    Authors: Heng Zhang, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Yilei Yuan, Jin Huang

    Abstract: Multi-agent systems powered by Large Language Models excel at complex tasks through coordinated collaboration, yet they face high failure rates in multi-turn deep search scenarios. Existing temporal attribution methods struggle to accurately diagnose root causes, particularly when errors propagate across multiple agents. Attempts to automate failure attribution by analyzing action sequences remain… ▽ More

    Submitted 22 December, 2025; v1 submitted 12 October, 2025; originally announced October 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  21. arXiv:2510.01585  [pdf, ps, other

    cs.CL cs.NI

    ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

    Authors: Haochen You, Baojing Liu

    Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary… ▽ More

    Submitted 1 October, 2025; originally announced October 2025.

    Comments: Accepted as a short paper at ACM Multimedia Asia 2025

  22. arXiv:2510.01578  [pdf, ps, other

    cs.LG

    Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control

    Authors: Haochen You, Baojing Liu

    Abstract: Gradient clipping is widely used to stabilize deep network training, but its formulation as a hard, fixed threshold limits flexibility and ignores gradient distribution dynamics. We propose SPAMP (Statistical Per-layer Adaptive Modulation and Projection), a unified framework that generalizes clipping into smooth, per-layer gradient shaping. SPAMP tracks local gradient statistics, dynamically estim… ▽ More

    Submitted 1 October, 2025; originally announced October 2025.

    Comments: Accepted as a conference paper at ACM Multimedia Asia 2025

  23. arXiv:2509.22294  [pdf, ps, other

    cs.LG math.CO

    A Multi-Level Framework for Multi-Objective Hypergraph Partitioning: Combining Minimum Spanning Tree and Proximal Gradient

    Authors: Yingying Li, Mingxuan Xie, Hailong You, Yongqiang Yao, Hongwei Liu

    Abstract: This paper proposes an efficient hypergraph partitioning framework based on a novel multi-objective non-convex constrained relaxation model. A modified accelerated proximal gradient algorithm is employed to generate diverse $k$-dimensional vertex features to avoid local optima and enhance partition quality. Two MST-based strategies are designed for different data scales: for small-scale data, the… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  24. arXiv:2509.21526  [pdf, ps, other

    cs.LG cs.CV

    TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning

    Authors: Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li, Haochen You, Lifan Sun, Wenqiao Zhang

    Abstract: We introduce TRiCo, a novel triadic game-theoretic co-training framework that rethinks the structure of semi-supervised learning by incorporating a teacher, two students, and an adversarial generator into a unified training paradigm. Unlike existing co-training or teacher-student approaches, TRiCo formulates SSL as a structured interaction among three roles: (i) two student classifiers trained on… ▽ More

    Submitted 30 November, 2025; v1 submitted 25 September, 2025; originally announced September 2025.

    Comments: Accepted by NeurIPS 2025

  25. arXiv:2509.16197  [pdf, ps, other

    cs.CV cs.CL cs.LG

    MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

    Authors: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao , et al. (2 additional authors not shown)

    Abstract: Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training re… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

  26. arXiv:2509.12715   

    cs.CV cs.RO

    AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

    Authors: Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE m… ▽ More

    Submitted 22 December, 2025; v1 submitted 16 September, 2025; originally announced September 2025.

    Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

  27. arXiv:2509.12638  [pdf, ps, other

    cs.CE

    FinSentLLM: Multi-LLM and Structured Semantic Signals for Enhanced Financial Sentiment Forecasting

    Authors: Zijian Zhang, Rong Fu, Yangfan He, Xinze Shen, Yanlong Wang, Xiaojing Du, Haochen You, Jiazhao Shi, Simon Fong

    Abstract: Financial sentiment analysis (FSA) has attracted significant attention, and recent studies increasingly explore large language models (LLMs) for this field. Yet most work evaluates only classification metrics, leaving unclear whether sentiment signals align with market behavior. We propose FinSentLLM, a lightweight multi-LLM framework that integrates an expert panel of sentiment forecasting LLMs,… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

  28. arXiv:2509.08742  [pdf, ps, other

    q-fin.CP cs.AI

    FinZero: Launching Multi-modal Financial Time Series Forecast with Large Reasoning Model

    Authors: Yanlong Wang, Jian Xu, Fei Ma, Hongkang Zhang, Hang Yu, Tiantian Gao, Yu Wang, Haochen You, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

    Abstract: Financial time series forecasting is both highly significant and challenging. Previous approaches typically standardized time series data before feeding it into forecasting models, but this encoding process inherently leads to a loss of important information. Moreover, past time series models generally require fixed numbers of variables or lookback window lengths, which further limits the scalabil… ▽ More

    Submitted 10 September, 2025; originally announced September 2025.

  29. arXiv:2509.08232  [pdf, ps, other

    cs.CV

    GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation

    Authors: Seongho Kim, Sejong Ryu, Hyoukjun You, Je Hyeong Hong

    Abstract: Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft… ▽ More

    Submitted 9 September, 2025; originally announced September 2025.

  30. arXiv:2509.06219  [pdf, ps, other

    cs.LG cs.MM

    MCIGLE: Multimodal Exemplar-Free Class-Incremental Graph Learning

    Authors: Haochen You, Baojing Liu

    Abstract: Exemplar-free class-incremental learning enables models to learn new classes over time without storing data from old ones. As multimodal graph-structured data becomes increasingly prevalent, existing methods struggle with challenges like catastrophic forgetting, distribution bias, memory limits, and weak generalization. We propose MCIGLE, a novel framework that addresses these issues by extracting… ▽ More

    Submitted 7 September, 2025; originally announced September 2025.

    Comments: Accepted as a conference paper at KSEM 2025

  31. arXiv:2509.06214  [pdf, ps, other

    cs.LG

    Metric Embedding Initialization-Based Differentially Private and Explainable Graph Clustering

    Authors: Haochen You, Baojing Liu

    Abstract: Graph clustering under the framework of differential privacy, which aims to process graph-structured data while protecting individual privacy, has been receiving increasing attention. Despite significant achievements in current research, challenges such as high noise, low efficiency and poor interpretability continue to severely constrain the development of this field. In this paper, we construct… ▽ More

    Submitted 7 September, 2025; originally announced September 2025.

    Comments: Accepted as a conference paper at KSEM 2025

  32. arXiv:2508.17426  [pdf, ps, other

    cs.LG

    Modular MeanFlow: Towards Stable and Scalable One-Step Generative Modeling

    Authors: Haochen You, Baojing Liu, Hongyang He

    Abstract: One-step generative modeling seeks to generate high-quality data samples in a single function evaluation, significantly improving efficiency over traditional diffusion or flow-based models. In this work, we introduce Modular MeanFlow (MMF), a flexible and theoretically grounded approach for learning time-averaged velocity fields. Our method derives a family of loss functions based on a differentia… ▽ More

    Submitted 24 August, 2025; originally announced August 2025.

    Comments: Accepted as a conference paper at PRCV 2025

  33. arXiv:2508.17254  [pdf, ps, other

    cs.CV cs.AI

    A biological vision inspired framework for machine perception of abutting grating illusory contours

    Authors: Xiao Zhang, Kai-Fu Yang, Xian-Shi Zhang, Hong-Zhi You, Hong-Mei Yan, Yong-Jie Li

    Abstract: Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Dep… ▽ More

    Submitted 24 August, 2025; originally announced August 2025.

  34. arXiv:2508.12149  [pdf, ps, other

    cs.AI

    MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization

    Authors: Haochen You, Baojing Liu

    Abstract: Recent advances in multimodal learning have largely relied on pairwise contrastive objectives to align different modalities, such as text, video, and audio, in a shared embedding space. While effective in bi-modal setups, these approaches struggle to generalize across multiple modalities and often lack semantic structure in high-dimensional spaces. In this paper, we propose MOVER, a novel framewor… ▽ More

    Submitted 16 August, 2025; originally announced August 2025.

    Comments: Accepted as a conference paper at CIKM 2025

  35. arXiv:2507.20198  [pdf, ps, other

    cs.CV

    A Survey of Token Compression for Efficient Multimodal Large Language Models

    Authors: Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang

    Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-… ▽ More

    Submitted 1 February, 2026; v1 submitted 27 July, 2025; originally announced July 2025.

    Comments: For ongoing updates and to track the latest advances in this promising area, we maintain a public repository: https://github.com/cokeshao/Awesome-Multimodal-Token-Compression

  36. arXiv:2507.14204  [pdf, ps, other

    cs.LG cs.AI cs.CL

    LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

    Authors: Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan, Lin

    Abstract: Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new K… ▽ More

    Submitted 14 July, 2025; originally announced July 2025.

    Comments: ICML 2025. Code: https://github.com/GATECH-EIC/LaCache

  37. arXiv:2507.13575  [pdf, ps, other

    cs.LG cs.AI

    Apple Intelligence Foundation Language Models: Tech Report 2025

    Authors: Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang, Xiyou Zhou, Jun Qin, Dian Ang Yap, Narendran Raghavan, Xuankai Chang, Margit Bowler, Eray Yildiz, John Peebles, Hannah Gillis Coleman, Matteo Ronchi, Peter Gray, Keen You, Anthony Spalvieri-Kruse, Ruoming Pang, Reed Li, Yuli Yang, Emad Soroush, Zhiyun Lu, Crystal Xiao, Rong Situ, Jordan Huffaker, David Griffiths , et al. (373 additional authors not shown)

    Abstract: We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transform… ▽ More

    Submitted 27 August, 2025; v1 submitted 17 July, 2025; originally announced July 2025.

  38. arXiv:2507.13405  [pdf, ps, other

    cs.CV cs.LG

    COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

    Authors: Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma

    Abstract: Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

  39. arXiv:2507.12483  [pdf, ps, other

    cs.SE

    A Survey of Reinforcement Learning for Software Engineering

    Authors: Dong Wang, Hanmo You, Lingwei Zhu, Kaiwei Lin, Zheng Chen, Chen Yang, Junji Yu, Zan Wang, Junjie Chen

    Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for sequential decision-making and has attracted growing interest across various domains, particularly following the advent of Deep Reinforcement Learning (DRL) in 2015. Simultaneously, the rapid advancement of Large Language Models (LLMs) has further fueled interest in integrating RL with LLMs to enable more adaptive and intelligent s… ▽ More

    Submitted 14 July, 2025; originally announced July 2025.

  40. arXiv:2507.01160  [pdf, ps, other

    cs.CL

    Event-based evaluation of abstractive news summarization

    Authors: Huiling You, Samia Touileb, Erik Velldal, Lilja Øvrelid

    Abstract: An abstractive summary of a news article contains its most important information in a condensed version. The evaluation of automatically generated summaries by generative language models relies heavily on human-authored summaries as gold references, by calculating overlapping units or similarity scores. News articles report events, and ideally so should the summaries. In this work, we propose to e… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: to appear at GEM2 workshop@ACL 2025

  41. arXiv:2506.15613  [pdf, ps, other

    cs.AR

    From Block to Byte: Transforming PCIe SSDs with CXL Memory Protocol and Instruction Annotation

    Authors: Miryeong Kwon, Donghyun Gouk, Junhyeok Jang, Jinwoo Baek, Hyunwoo You, Sangyoon Ji, Hongjoo Jung, Junseok Moon, Seungkwan Kang, Seungjun Lee, Myoungsoo Jung

    Abstract: This paper explores how Compute Express Link (CXL) can transform PCIe-based block storage into a scalable, byte-addressable working memory. We address the challenges of adapting block storage to CXL's memory-centric model by emphasizing cacheability as a key enabler and advocating for Type 3 endpoint devices, referred to as CXL-SSDs. To validate our approach, we prototype a CXL-SSD on a custom FPG… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  42. arXiv:2506.12530  [pdf, ps, other

    cs.CV

    Towards Seamless Borders: A Method for Mitigating Inconsistencies in Image Inpainting and Outpainting

    Authors: Xingzhong Hou, Jie Wu, Boxiao Liu, Yi Zhang, Guanglu Song, Yunpeng Liu, Yu Liu, Haihang You

    Abstract: Image inpainting is the task of reconstructing missing or damaged parts of an image in a way that seamlessly blends with the surrounding content. With the advent of advanced generative models, especially diffusion models and generative adversarial networks, inpainting has achieved remarkable improvements in visual quality and coherence. However, achieving seamless continuity remains a significant… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  43. arXiv:2505.21334  [pdf, ps, other

    cs.CV

    HoliTom: Holistic Token Merging for Fast Video Large Language Models

    Authors: Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang

    Abstract: Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (ou… ▽ More

    Submitted 10 October, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: code link: https://github.com/cokeshao/HoliTom

  44. arXiv:2505.19952  [pdf, ps, other

    cs.CV cs.IR

    Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

    Authors: Rong-Cheng Tu, Wenhao Sun, Hanzhe You, Yingjie Wang, Jiaxing Huang, Li Shen, Dacheng Tao

    Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  45. arXiv:2505.10646  [pdf, ps, other

    cs.LG cs.RO

    Accelerating Visual-Policy Learning through Parallel Differentiable Simulation

    Authors: Haoxiang You, Yilang Liu, Ian Abraham

    Abstract: In this work, we propose a computationally efficient algorithm for visual policy learning that leverages differentiable simulation and first-order analytical policy gradients. Our approach decouple the rendering process from the computation graph, enabling seamless integration with existing differentiable simulation ecosystems without the need for specialized differentiable rendering software. Thi… ▽ More

    Submitted 10 November, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

  46. arXiv:2505.08854  [pdf, ps, other

    cs.CV cs.AI cs.RO

    Generative AI for Autonomous Driving: Frontiers and Opportunities

    Authors: Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, Hengxu You, Juntong Peng, Junge Zhang, Zehao Wang, Rui Song, Mingxuan Yan, Walter Zimmer, Xingcheng Zhou, Peiran Li, Zhaohan Lu, Chia-Ju Chen, Yue Huang, Ryan A. Rossi, Lichao Sun, Hongkai Yu , et al. (22 additional authors not shown)

    Abstract: Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, partic… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  47. arXiv:2504.09606  [pdf, other

    cs.CV

    Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

    Authors: Lexington Whalen, Zhenbang Du, Haoran You, Chaojian Li, Sixu Li, Yingyan, Lin

    Abstract: Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) ticke… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: 10 pages, 5 figures. Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

  48. arXiv:2503.16257  [pdf, ps, other

    cs.CV

    Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

    Authors: Keda Tao, Haoxuan You, Yang Sui, Can Qin, Huan Wang

    Abstract: Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, the key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to… ▽ More

    Submitted 28 September, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: 12 pages

  49. arXiv:2503.02171  [pdf, other

    cs.LG math.OC

    Is Bellman Equation Enough for Learning Control?

    Authors: Haoxiang You, Lekan Molu, Ian Abraham

    Abstract: The Bellman equation and its continuous-time counterpart, the Hamilton-Jacobi-Bellman (HJB) equation, serve as necessary conditions for optimality in reinforcement learning and optimal control. While the value function is known to be the unique solution to the Bellman equation in tabular settings, we demonstrate that this uniqueness fails to hold in continuous state spaces. Specifically, for linea… ▽ More

    Submitted 5 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

  50. arXiv:2502.12669  [pdf, ps, other

    cs.AI

    Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

    Authors: Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang

    Abstract: The rapid advancement of perovskite solar cells (PSCs) has led to an exponential growth in research publications, creating an urgent need for efficient knowledge management and reasoning systems in this domain. We present a comprehensive knowledge-enhanced system for PSCs that integrates three key components. First, we develop Perovskite-KG, a domain-specific knowledge graph constructed from 1,517… ▽ More

    Submitted 9 October, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

    Comments: EMNLP 2025 Findings; NeurIPS 2025 AI for Science Workshop