Skip to main content

Showing 1–5 of 5 results for author: Zhuo, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.07958  [pdf, ps, other

    cs.CV

    ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    Authors: Jiayang Xu, Fan Zhuo, Majun Zhang, Changhao Pan, Zehan Wang, Siyu Chen, Xiaoda Yang, Tao Jin, Zhou Zhao

    Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient frame… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  2. arXiv:2603.14889  [pdf, ps, other

    eess.AS cs.CL cs.LG

    Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

    Authors: Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan, Xueyi Pu, Yifu Chen, Chenyuhao Wen, Tianle Liang, Zhou Zhao

    Abstract: The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challe… ▽ More

    Submitted 16 March, 2026; originally announced March 2026.

  3. arXiv:2512.02622  [pdf, ps, other

    cs.CV

    RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

    Authors: Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu

    Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

  4. arXiv:2507.01884  [pdf, ps, other

    cs.CV

    Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification

    Authors: Kunlun Xu, Fan Zhuo, Jiangmeng Li, Xu Zou, Jiahuan Zhou

    Abstract: Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem where LReID methods suffer severe performance degradation. Existing LReID methods, even whe… ▽ More

    Submitted 23 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  5. arXiv:2409.06307  [pdf, other

    cs.SD cs.AI eess.AS

    An End-to-End Approach for Chord-Conditioned Song Generation

    Authors: Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu

    Abstract: The Song Generation task aims to synthesize music composed of vocals and accompaniment from given lyrics. While the existing method, Jukebox, has explored this task, its constrained control over the generations often leads to deficiency in music performance. To mitigate the issue, we introduce an important concept from music composition, namely chords, to song generation networks. Chords form the… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.