-
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality
Authors:
Zhiwei Zhang,
Xingyuan Zeng,
Xinkai Kong,
Kunquan Zhang,
Haoyuan Liang,
Bohan Shi,
Juepeng Zheng,
Jianxi Huang,
Yutong Lu,
Haohuan Fu
Abstract:
Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular b…
▽ More
Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets, Methods and Results
Authors:
Xin Li,
Jiachao Gong,
Xijun Wang,
Shiyao Xiong,
Bingchen Li,
Suhang Yao,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yuxiang Chen,
Shibo Yin,
Yilian Zhong,
Yushun Fang,
Xilei Zhu,
Yahui Wang,
Chen Lu,
Meisong Zheng,
Xiaoxu Chen,
Jing Yang,
Zhaokun Hu,
Jiahui Liu,
Ying Chen,
Haoran Bai,
Sibin Deng,
Shengxi Li
, et al. (53 additional authors not shown)
Abstract:
This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition,…
▽ More
This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
Authors:
Xinyu Zhang,
Zurong Mai,
Qingmei Li,
Zjin Liao,
Yibin Wen,
Yuhang Chen,
Xiaoya Fan,
Chan Tsz Ho,
Bi Tianyuan,
Haoyuan Liang,
Ruifeng Su,
Zihao Qian,
Juepeng Zheng,
Jianxi Huang,
Yutong Lu,
Haohuan Fu
Abstract:
While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this ga…
▽ More
While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy
Authors:
Guoqi Ma,
Liang Zhang,
Hongyao Tu,
Hao Fu,
Hui Li,
Yujie Lin,
Longyue Wang,
Weihua Luo,
Jinsong Su
Abstract:
Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to e…
▽ More
Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
Predicting Alzheimer's disease progression using rs-fMRI and a history-aware graph neural network
Authors:
Mahdi Moghaddami,
Mohammad-Reza Siadat,
Austin Toma,
Connor Laming,
Huirong Fu
Abstract:
Alzheimer's disease (AD) is a neurodegenerative disorder that affects more than seven million people in the United States alone. AD currently has no cure, but there are ways to potentially slow its progression if caught early enough. In this study, we propose a graph neural network (GNN)-based model for predicting whether a subject will transition to a more severe stage of cognitive impairment at…
▽ More
Alzheimer's disease (AD) is a neurodegenerative disorder that affects more than seven million people in the United States alone. AD currently has no cure, but there are ways to potentially slow its progression if caught early enough. In this study, we propose a graph neural network (GNN)-based model for predicting whether a subject will transition to a more severe stage of cognitive impairment at their next clinical visit. We consider three stages of cognitive impairment in order of severity: cognitively normal (CN), mild cognitive impairment (MCI), and AD. We use functional connectivity graphs derived from resting-state functional magnetic resonance imaging (rs-fMRI) scans of 303 subjects, each with a different number of visits. Our GNN-based model incorporates a recurrent neural network (RNN) block, enabling it to process data from the subject's entire visit history. It can also work with irregular time gaps between visits by incorporating visit distance information into our input features. Our model demonstrates robust predictive performance, even with missing visits in the subjects' visit histories. It achieves an accuracy of 82.9%, with an especially impressive accuracy of 68.8% on CN to MCI conversions - a task that poses a substantial challenge in the field. Our results highlight the effectiveness of rs-fMRI in predicting the onset of MCI or AD and, in conjunction with other modalities, could offer a viable method for enabling timely interventions to slow the progression of cognitive impairment.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
Authors:
Honghao Fu,
Miao Xu,
Yiwei Wang,
Dailing Zhang,
Liu Jun,
Yujun Cai
Abstract:
Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic mat…
▽ More
Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.
△ Less
Submitted 12 April, 2026; v1 submitted 7 April, 2026;
originally announced April 2026.
-
Statistics of blob properties in two types of coronal streamers
Authors:
Haiyi Li,
Zhenghua Huang,
Maria S. Madjarska,
Youqian Qi,
Hui Fu,
Ming Xiong,
Lidong Xia
Abstract:
Previous studies have shown that a streamer blob might originate in the lower corona and thus be affected by activity in that region. While the base of one streamer might differ from that of another, it can be cataloged into two distinct types: active region streamers (ARSs) that have active regions at their base, and quiet equatorial streamers (QESs) that do not have an active region underneath.T…
▽ More
Previous studies have shown that a streamer blob might originate in the lower corona and thus be affected by activity in that region. While the base of one streamer might differ from that of another, it can be cataloged into two distinct types: active region streamers (ARSs) that have active regions at their base, and quiet equatorial streamers (QESs) that do not have an active region underneath.The difference between the blob properties in ARSs and those in QESs remains unknown. By analyzing the whole-year observations from SOHO/LASCO/C2 in 2018, we carried out a statistical analysis of the properties of propagating blobs in ARSs and QESs. We found that the properties of streamer blobs are very different from one blob to another. The occurrence rate of blobs in ARSs is about twice as high as that in QESs. On average, the ARS blobs have significantly higher initial velocities and slightly higher accelerations, but slightly lower heights of first appearance than the QES blobs. There is a weak positive correlation between the initial velocities and heights of first appearance in the two groups of streamer blobs. The correlation between the accelerations and heights of first appearance in ARS blobs is negative, while that in QES blobs is positive. Our results provide statistical evidence that a higher degree of activity at the coronal base of a streamer can cause more dynamic blobs higher up, and that it affects the structures of the solar wind originating in the region.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
PRIME: Prototype-Driven Multimodal Pretraining for Cancer Prognosis with Missing Modalities
Authors:
Kai Yu,
Shuang Zhou,
Yiran Song,
Zaifu Zhan,
Jie Peng,
Kaixiong Zhou,
Tianlong Chen,
Feng Xie,
Meng Wang,
Huazhu Fu,
Mingquan Lin,
Rui Zhang
Abstract:
Multimodal self-supervised pretraining offers a promising route to cancer prognosis by integrating histopathology whole-slide images, gene expression, and pathology reports, yet most existing approaches require fully paired and complete inputs. In practice, clinical cohorts are fragmented and often miss one or more modalities, limiting both supervised fusion and scalable multimodal pretraining. We…
▽ More
Multimodal self-supervised pretraining offers a promising route to cancer prognosis by integrating histopathology whole-slide images, gene expression, and pathology reports, yet most existing approaches require fully paired and complete inputs. In practice, clinical cohorts are fragmented and often miss one or more modalities, limiting both supervised fusion and scalable multimodal pretraining. We propose PRIME, a missing-aware multimodal self-supervised pretraining framework that learns robust and transferable representations from partially observed cohorts. PRIME maps heterogeneous modality embeddings into a unified token space and introduces a shared prototype memory bank for latent-space semantic imputation via patient-level consensus retrieval, producing structurally aligned tokens without reconstructing raw signals. Two complementary pretraining objectives: inter-modality alignment and post-fusion consistency under structured missingness augmentation, jointly learn representations that remain predictive under arbitrary modality subsets. We evaluate PRIME on The Cancer Genome Atlas with label-free pretraining on 32 cancer types and downstream 5-fold evaluation on five cohorts across overall survival prediction, 3-year mortality classification, and 3-year recurrence classification. PRIME achieves the best macro-average performance among all compared methods, reaching 0.653 C-index, 0.689 AUROC, and 0.637 AUROC on the three tasks, respectively, while improving robustness under test-time missingness and supporting parameter-efficient and label-efficient adaptation. These results support missing-aware multimodal pretraining as a practical strategy for prognosis modeling in fragmented clinical data settings.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Authors:
Yifan Gao,
Tao Zhou,
Yi Zhou,
Ke Zou,
Yizhe Zhang,
Huazhu Fu
Abstract:
Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision,…
▽ More
Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning
Authors:
Li-Heng Chen,
Ke Cheng,
Yahui Liu,
Lei Shi,
Shi-Sheng Huang,
Hongbo Fu
Abstract:
Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control o…
▽ More
Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
Statistics of transition-region loop brightenings and their heating implication
Authors:
Xiuhui Zuo,
Zhenghua Huang,
Maria S. Madjarska,
Hui Fu,
Hengyuan Wei,
Xinzheng Shi,
Lidong Xia
Abstract:
Transition-region loops are a type of critical magnetic structure in the solar atmosphere, yet their physical properties and evolutionary characteristics remain statistically poorly constrained. We aim to statistically characterize the physical properties of propagating brightening events in transition-region loops and to explore the underlying heating mechanism responsible for these brightenings.…
▽ More
Transition-region loops are a type of critical magnetic structure in the solar atmosphere, yet their physical properties and evolutionary characteristics remain statistically poorly constrained. We aim to statistically characterize the physical properties of propagating brightening events in transition-region loops and to explore the underlying heating mechanism responsible for these brightenings.Using coordinated observations from the Extreme Ultraviolet Imager onboard the Solar Orbiter and the Atmospheric Imaging Assembly (AIA) onboard the Solar Dynamics Observatory, we analyze 42 propagating brightening events in loops that are unambiguously detected in both instrument data. Each of these events evolve simultaneously in the AIA 94, 131, 171, 193, 211, 304, and 335 passband images, suggesting that they are in the transition-region or low-coronal temperature range. Our analyses show that these brightenings are impulsive, with an average brightening time of 118.4 s and a mean intensity decreasing time of 159.4 s. The propagating brightenings are predominantly subsonic, with velocities in the range of 0-90 km/s and an average of 51.3 km/s. The lengths of brightenings range from 3 to 11 Mm, with an average and standard deviation of 6.3 Mm, which are closely related to the propagation velocity and the lifetime. The initial brightening sites are predominantly located near the footpoints of these loops, and the number of brightening events decreases systematically with increasing of loop height. Our results are consistent with an energizing mechanism regulated by enthalpy flows and radiative cooling.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
Derived Weil Representation and Relative Langlands Duality
Authors:
Haoshuo Fu
Abstract:
The Weil representation is a particularly significant linear representation of the metaplectic group, used in the study of theta correspondence. In this paper, I introduce a derived category version of the Weil representation in the local field case. For the dual pair $ (\mathrm{GL}_n,\mathrm{GL}_m) $, I give a coherent description of this category, in the philosophy of relative Langlands duality.
The Weil representation is a particularly significant linear representation of the metaplectic group, used in the study of theta correspondence. In this paper, I introduce a derived category version of the Weil representation in the local field case. For the dual pair $ (\mathrm{GL}_n,\mathrm{GL}_m) $, I give a coherent description of this category, in the philosophy of relative Langlands duality.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
DUGC-VRNet: Joint VR Recognition and Channel Estimation for Spatially Non-Stationary XL-MIMO
Authors:
Jinhao Nie,
Guangchi Zhang,
Miao Cui,
Hao Fu,
Xiaoli Chu
Abstract:
In this letter, we address spatially non-stationary near-field channel estimation for extremely large-scale multiple-input multiple-output (XL-MIMO) systems with a hybrid combining architecture. One key challenge in the considered problem lies in that conventional channel estimation algorithms typically struggle to effectively identify and adapt to the partial antenna visibility caused by varying…
▽ More
In this letter, we address spatially non-stationary near-field channel estimation for extremely large-scale multiple-input multiple-output (XL-MIMO) systems with a hybrid combining architecture. One key challenge in the considered problem lies in that conventional channel estimation algorithms typically struggle to effectively identify and adapt to the partial antenna visibility caused by varying visibility regions (VRs), thereby compromising estimation accuracy. To perform joint VR recognition and channel estimation, we integrate a deep unfolding network (DUN) with a graph convolution network (GCN), leading to a Deep Unfolding and Graph Convolution coupled, Visibility Region Aware Network (DUGC-VRNet). By leveraging the channel's graph structure, the GCN infers and feeds back VR information to dynamically guide the DUN's updates, thereby enhancing reliable channel estimation under spatial non-stationarity. To reduce DUGC-VRNet's complexity, we apply weight pruning to obtain a lightweight network. Simulation results demonstrate that the DUGC-VRNet and its pruned variant achieve superior channel estimation and more accurate VR recognition under spatially non-stationary conditions.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
Propagating Kink Waves in Chromospheric Jet-like Structures and Coronal Plumelets
Authors:
Youqian Qi,
Mingzhe Guo,
Zhenghua Huang,
Tom Van Doorsselaere,
Bo Li,
Lidong Xia,
Hengyuan Wei,
Hui Fu
Abstract:
Coronal plumes and chromospheric jet-like structures are believed to be highly dynamic. We report the first direct observations of a propagating kink wave in a chromospheric jet-like structure and its associated plumelet structure in the upper corona of the solar polar region, using data from the High Resolution Imager (HRI) of the Extreme Ultraviolet Imager (EUI) on board Solar Orbiter (SO). The…
▽ More
Coronal plumes and chromospheric jet-like structures are believed to be highly dynamic. We report the first direct observations of a propagating kink wave in a chromospheric jet-like structure and its associated plumelet structure in the upper corona of the solar polar region, using data from the High Resolution Imager (HRI) of the Extreme Ultraviolet Imager (EUI) on board Solar Orbiter (SO). The dark jet-like structure exhibits transverse oscillation during upward propagation, with a period of approximately 95s and a displacement of about 193km. The corresponding plumelet also displays transverse motion, with an oscillation period of around 99s and a displacement of about 315km. Given that both the dark jet-like structure and the plumelet share the same magnetic skeleton and have similar oscillation period, we suggest that these oscillations are the same transverse propagating wave originating in the chromosphere. This scenario is further supported by a 3D magnetohydrodynamic (MHD) simulation, in which both vertical and transverse perturbations were introduced in a stratified magnetic flux tube. The simulation successfully reproduces the upward propagation of a kink wave through both the chromospheric jet-like structure and the coronal plumelet. These results highlight the potential role of transverse waves in transferring energy from the lower solar atmosphere to the corona.
△ Less
Submitted 25 March, 2026;
originally announced March 2026.
-
Multiple Topological States in LaAgAs2, a Failed Square-Net Semimetal
Authors:
Yang Liu,
Tongrui Li,
Xixi Yuan,
Nour Maraytta,
Alexei V. Fedorov,
Asish K. Kundu,
Turgut Yilmaz,
Elio Vescovo,
Xueliang Wu,
Long Zhang,
Mingquan He,
Yisheng Chai,
Xiaoyuan Zhou,
Michael Merz,
Zhe Sun,
Huixia Fu,
Tonica Valla,
Aifeng Wang
Abstract:
The rational design of new materials emerges as an important direction to explore new topological materials, which is based on the understanding of the correlation between crystal and electronic structures. In this paper, we perform a comprehensive study on the crystal and electronic structures in LaAgAs2 through a combination of single-crystal x-ray diffraction (XRD), quantum oscillation, and ang…
▽ More
The rational design of new materials emerges as an important direction to explore new topological materials, which is based on the understanding of the correlation between crystal and electronic structures. In this paper, we perform a comprehensive study on the crystal and electronic structures in LaAgAs2 through a combination of single-crystal x-ray diffraction (XRD), quantum oscillation, and angle-resolved photoemission spectroscopy (ARPES) experimental measurements, and density functional theory (DFT) calculations. Single-crystal XRD measurements reveal that LaAgAs2 crystallizes into a HfCuSi2-derived structure with the square net distorted into cis-trans chains. Quantum oscillation measurements reveal two frequencies with small effective masses and quasi-two-dimensional (2D) characters. ARPES measurements reveal an electronic structure strikingly different from the square-net-based semimetals, such as LaAgAs2. The Fermi surface is quasi-two-dimensional (2D), with Dirac-like hole pockets at the zone center and a quasi-1D elliptical electron pocket at the zone boundary. Based on the DFT calculations, the measured electronic structure can be well understood regarding the cis-trans distortion, which transforms the two-dimensional square net-derived Dirac bands into quasi-1D trivial bands. Intriguingly, multiple topological states can be identified around the zone center, including a nontrivial Z2 topological surface state and a bulk Dirac state. Our study clarifies the impact of cis-trans distortion and identifies LaAgAs2 as a topological material with multiple topological states near the Fermi level, providing a guideline for intentionally designing new topological materials.
△ Less
Submitted 25 March, 2026;
originally announced March 2026.
-
Dark Matter Detection through Rydberg Atom Transducer
Authors:
J. F. Chen,
Haokun Fu,
Christina Gao,
Jing Shu,
Geng-Bo Wu,
Peiran Yin,
Yi-Ming Zhong,
Ying Zuo
Abstract:
Ultralight bosonic dark matter with masses in the meV range, corresponding to terahertz (THz) Compton frequencies, remains largely unexplored due to the difficulty of achieving both efficient signal conversion and single-photon-sensitive detection at THz frequencies. We propose a hybrid detection architecture that integrates a dielectric haloscope, Rydberg-atom transducer, and superconducting nano…
▽ More
Ultralight bosonic dark matter with masses in the meV range, corresponding to terahertz (THz) Compton frequencies, remains largely unexplored due to the difficulty of achieving both efficient signal conversion and single-photon-sensitive detection at THz frequencies. We propose a hybrid detection architecture that integrates a dielectric haloscope, Rydberg-atom transducer, and superconducting nanowire single-photon detection within a unified cryogenic platform operating at $\lesssim 1\,\text{K}$. The dielectric haloscope converts dark matter into THz photons via phase-matched resonant enhancement, achieving form factors $C \sim 0.4$ and loaded quality factors $Q_L \sim 10^4$. A cold $^{87}$Rb ensemble then coherently up-converts the THz signal to the optical domain through six-wave mixing among Rydberg states. The intrinsic directionality and narrow bandwidth ($Δν_{\mathrm{atomic}} \sim 1\,\text{MHz}$) of this process provide extra suppression of isotropic thermal backgrounds. With 10 days of integration at $0.3\,\text{K}$, we project sensitivity to the axion-photon coupling $g_{aγγ} \sim 10^{-13}\,\mathrm{GeV}^{-1}$ at $m_a \sim 0.4\,\text{meV}$, reaching the QCD axion band and opening the THz window for searches of both axion and dark photon dark matter.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
Moral Mazes in the Era of LLMs
Authors:
Dang Nguyen,
Harvey Yiyun Fu,
Peter West,
Ari Holtzman,
Chenhao Tan
Abstract:
Navigating complex social situations is an integral part of corporate life, ranging from giving critical feedback without hurting morale to rejecting requests without alienating teammates. Although large language models (LLMs) are permeating the workplace, it is unclear how well they can navigate these norms. To investigate this question, we created HR Simulator, a game where users roleplay as an…
▽ More
Navigating complex social situations is an integral part of corporate life, ranging from giving critical feedback without hurting morale to rejecting requests without alienating teammates. Although large language models (LLMs) are permeating the workplace, it is unclear how well they can navigate these norms. To investigate this question, we created HR Simulator, a game where users roleplay as an HR officer and write emails to tackle challenging workplace scenarios, evaluated with GPT-4o as a judge based on scenario-specific rubrics. We analyze over 600 human and LLM emails and find systematic differences in style: LLM emails are more formal and empathetic. Furthermore, humans underperform LLMs (e.g., 23.5% vs. 48-54% scenario pass rate), but human emails rewritten by LLMs can outperform both, which indicates a hybrid advantage. On the evaluation side, judges can exhibit differences in their email preferences: an analysis of 10 judge models reveals evidence for emergent tact, where weaker models prefer direct, blunt communication but stronger models prefer more subtle messages. Judges also agree with each other more as they scale, which hints at a convergence toward shared communicative norms that may differ from humans'. Overall, our results suggest LLMs could substantially reshape communication in the workplace if they are widely adopted in professional correspondence.
△ Less
Submitted 6 April, 2026; v1 submitted 6 March, 2026;
originally announced March 2026.
-
PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
Authors:
Jiadong Liang,
Bojun Xiong,
Jie Tian,
Hua Li,
Xiao Long,
Yong Zheng,
Huan Fu
Abstract:
This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to d…
▽ More
This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.
△ Less
Submitted 20 March, 2026;
originally announced March 2026.
-
Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation
Authors:
Haocheng Li,
Juepeng Zheng,
Shuangxi Miao,
Ruibo Lu,
Guosheng Cai,
Haohuan Fu,
Jianxi Huang
Abstract:
Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary m…
▽ More
Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
Probing the equivalence of chiral LCSRs in $D \to πe ν_e$ decays and extraction of $|V_{cd}|$
Authors:
Xiu-Fen Wang,
Hai-Jiang Tian,
Yin-Long Yang,
Long Zeng,
Hai-Bing Fu
Abstract:
In the paper, we have carried out research on the $D\toπ$ decay process. We employ two different currents to study the $D\toπ$ transition form factors (TFFs) by using the light-cone sum rule within the framework of chiral current approach. Firstly, we follow the right-handed and left-handed currents for the correlators to present the expression of the vector form factors upto next-leading-order an…
▽ More
In the paper, we have carried out research on the $D\toπ$ decay process. We employ two different currents to study the $D\toπ$ transition form factors (TFFs) by using the light-cone sum rule within the framework of chiral current approach. Firstly, we follow the right-handed and left-handed currents for the correlators to present the expression of the vector form factors upto next-leading-order and leading-order accuracy, respectively. Here the twist-2 and twist-3 light-cone distribution amplitudes are constructed by the light-cone harmonic oscillator model. After exploring the TFFs into the whole physical $q^2$-region with the simplified $z$-series expansion, we obtain the branching fractions $\mathcal{B}(D^0\to π^-e^+ν_e)_{\text{I}} = 0.31_{-0.05}^{+0.05}$, $\mathcal{B}(D^+\to π^0e^+ν_e)_{\text{I}} = 0.39_{-0.06}^{+0.06}$, $\mathcal{B}(D^0\to π^-e^+ν_e)_{\text{II}} = 0.27_{-0.03}^{+0.05}$, $\mathcal{B}(D^+\to π^0e^+ν_e)_{\text{II}} = 0.34_{-0.04}^{+0.06}$, and extract the CKM matrix element $|V_{cd}|_{\text{I}} = ( 0.21^{+0.02}_{-0.02} )\times 10^{-2}$ as well as $|V_{cd}|_{\text{II}} = ( 0.23^{+0.02}_{-0.02}) \times 10^{-2}$. To verify the credibility of our calculations, these results are further compared with existing findings in the literature, showing good agreement within uncertainties.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting
Authors:
Jing Wu,
Yang Liu,
Lin Zhang,
Junbo Zeng,
Jiabin Wang,
Zi Ye,
Guowen Li,
Shilei Cao,
Jiashun Cheng,
Fang Wang,
Meng Jin,
Yerong Feng,
Hong Cheng,
Yutong Lu,
Haohuan Fu,
Juepeng Zheng
Abstract:
Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling,…
▽ More
Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling, offering limited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics-priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error accumulation and improve long-horizon stability.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models
Authors:
Jiarui Zhang,
Junqi Hu,
Zurong Mai,
Yuhang Chen,
Shuohong Lou,
Henglian Huang,
Lingyuan Zhao,
Jianxi Huang,
Yutong Lu,
Haohuan Fu,
Juepeng Zheng
Abstract:
Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale Agr…
▽ More
Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models
Authors:
Yuqi Qian,
Yun Cao,
Haocheng Fu,
Meiyang Lv,
Meineng Zhu
Abstract:
Diffusion models have made substantial advances in recent years, enabling high-quality image synthesis; however, the widespread dissemination and reuse of their outputs have introduced new challenges in intellectual property protection and content provenance. Image watermarking offers a solution to these challenges, and recent work has increasingly explored Noise-as-Watermark (NaW) approaches that…
▽ More
Diffusion models have made substantial advances in recent years, enabling high-quality image synthesis; however, the widespread dissemination and reuse of their outputs have introduced new challenges in intellectual property protection and content provenance. Image watermarking offers a solution to these challenges, and recent work has increasingly explored Noise-as-Watermark (NaW) approaches that integrate watermarking directly into the diffusion process. However, existing NaW methods fail to balance robustness and diversity. We attribute this weakness to value encoding, which encodes watermark bits into individual sampled values. It is extremely fragile in practical application scenarios. To address this, we encode watermark bits into the structured noise pattern, so that the watermark is preserved even when individual values are perturbed. To further ensure generation diversity, we introduce a dedicated randomization design that reshuffles the positions of noise elements without changing their values, preventing the watermark from inducing fixed noise patterns or spatial locations. Extensive experiments demonstrate that our method achieves state-of-the-art robustness while maintaining high generation quality across a wide range of lossy scenarios.
△ Less
Submitted 10 March, 2026;
originally announced March 2026.
-
Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Authors:
Shuai Lu,
Meng Wang,
Jia Guo,
Jiawei Du,
Bo Liu,
Shengzhu Yang,
Weihang Zhang,
Huazhu Fu,
Huiqi Li
Abstract:
Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., mic…
▽ More
Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.
△ Less
Submitted 19 March, 2026; v1 submitted 7 March, 2026;
originally announced March 2026.
-
Enhancing User Fairness in Two-Layer RSMA: A Movable Antenna Approach
Authors:
Ji Luo,
Yaxuan Chen,
Guangchi Zhang,
Miao Cui,
Hao Fu,
Changsheng You
Abstract:
Enhancing user fairness in advanced multi-user systems like two-layer rate-splitting multiple access (RSMA) is a critical yet challenging task. This letter proposes a novel movable antenna (MA) approach to address this challenge. We formulate a max-min fairness problem, maximizing the minimum user rate, a key metric for fairness, through the joint optimization of the beamforming matrices, user clu…
▽ More
Enhancing user fairness in advanced multi-user systems like two-layer rate-splitting multiple access (RSMA) is a critical yet challenging task. This letter proposes a novel movable antenna (MA) approach to address this challenge. We formulate a max-min fairness problem, maximizing the minimum user rate, a key metric for fairness, through the joint optimization of the beamforming matrices, user clustering, common rate allocation, and the antenna position vector (APV). To solve this non-convex problem, we develop an efficient two-loop iterative algorithm. The outer-loop leverages the dynamic neighborhood pruning particle swarm optimization method to find a high-quality APV, while the inner-loop optimizes the remaining variables for a given APV. Simulation results validate our approach, demonstrating that the proposed scheme yields significant fairness gains over various benchmark schemes.
△ Less
Submitted 7 March, 2026;
originally announced March 2026.
-
CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning
Authors:
Yuxin Xie,
Yuming Chen,
Yishan Yang,
Yi Zhou,
Tao Zhou,
Zhen Zhao,
Jiacheng Liu,
Huazhu Fu
Abstract:
Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whe…
▽ More
Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06\% (14.89\% higher than the second-best baseline), while reducing the failure rate to 18.42\%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
Electrically tunable circular photocurrent via local-field induced symmetry breaking at a metal-MoTe2 interface
Authors:
Butian Zhang,
Kexin Wang,
Jun-Tao Ma,
Yiya Guo,
Chengyu Yan,
Xin Yi,
Luojun Du,
Youwei Zhang,
Hua-Hua Fu,
Shun Wang
Abstract:
Transition metal dichalcogenides (TMDCs) constitute a promising platform for symmetry-engineered responses to circularly polarized light. The high crystal symmetry of centrosymmetric 2H-phase TMDCs inherently forbids the circular photogalvanic effect, thereby necessitating external stimuli such as electric fields or strain to lower the symmetry for its activation. While Schottky junctions provide…
▽ More
Transition metal dichalcogenides (TMDCs) constitute a promising platform for symmetry-engineered responses to circularly polarized light. The high crystal symmetry of centrosymmetric 2H-phase TMDCs inherently forbids the circular photogalvanic effect, thereby necessitating external stimuli such as electric fields or strain to lower the symmetry for its activation. While Schottky junctions provide a ubiquitous built-in field for potentially inducing circular photocurrents, the mechanism for the generation and control of circular photocurrents in TMDCs is not understood. In this study, we fabricated a localized gold-MoTe2 heterostructure and demonstrate a pronounced circular photocurrent at the interface under normal incidence. The photocurrent is attributed to circular photogalvanic effect governed by the strength and direction of the built-in electric field, enabling continuous modulation via an external bias. First-principles calculations show that the gold interface induces a spin splitting in the valence bands of MoTe2, establishing a valley-dependent spin ordering. The observed circular photocurrent from multilayer 2H-MoTe2 under normal incidence indicates the breaking of C3 rotational symmetry by the local in-plane field. These results establish an effective strategy for developing voltage-tunable circularly polarized photodetectors and valleytronic devices.
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
Authors:
Lianyu Wang,
Meng Wang,
Huazhu Fu,
Daoqiang Zhang
Abstract:
The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic envi…
▽ More
The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
Authors:
Yinpei Dai,
Hongze Fu,
Jayjun Lee,
Yuejiang Liu,
Haoran Zhang,
Jianing Yang,
Chelsea Finn,
Nima Fazeli,
Joyce Chai
Abstract:
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding…
▽ More
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
Nature of $K^*(1680)$ and $q\bar{q}$-hybrid mixing as the SU(3) partner of $η_{1}(1855)$ in the strange sector
Authors:
Samee Ullah,
Ye Cao,
Ming-Xiao Duan,
Hai-Bing Fu,
Qiang Zhao
Abstract:
We presents an investigation of the $K^*(1680)$ state in its strong decays into two-body finial states within the flux-tube model and quark pair creation model. Since the charge conjugation parity is not conserved in the strange sector, the conventional $q\bar{q}$ states of $J^{P(C)}=1^{-(-)}$ can mix with the lowest hybrid states with $J^{P(C)}=1^{-(+)}$. Our analysis of the $K^*(1680)$ two-body…
▽ More
We presents an investigation of the $K^*(1680)$ state in its strong decays into two-body finial states within the flux-tube model and quark pair creation model. Since the charge conjugation parity is not conserved in the strange sector, the conventional $q\bar{q}$ states of $J^{P(C)}=1^{-(-)}$ can mix with the lowest hybrid states with $J^{P(C)}=1^{-(+)}$. Our analysis of the $K^*(1680)$ two-body strong decays indicates that the decay pattern of $K^*(1680)$ cannot be explained by the conventional $q\bar{q}$ scenario. Meanwhile, strong evidence shows the $q\bar{q}$-hybrid mixing mechanism in the strange sector. The phenomenological consequences of such a mixing are also discussed. Our study can provide a guidance for the future search for hybrid multiplets in experiment at BESIII, LHCb, and Belle-II.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.
-
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
Authors:
Yiying Yang,
Wei Cheng,
Sijin Chen,
Honghao Fu,
Xianfang Zeng,
Yujun Cai,
Gang Yu,
Xingjun Ma
Abstract:
OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challen…
▽ More
OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.
-
MergeDJD: A Fast Constructive Algorithm with Piece Merging for the Two-Dimensional Irregular Bin Packing Problem
Authors:
Yi Zhou,
Haocheng Fu,
Yiping Liu,
Jian Mao,
Zhang-Hua Fu,
Yuyi Wang
Abstract:
The two-dimensional irregular bin packing problem (2DIBPP) aims to pack a given set of irregular polygons, referred to as pieces, into fixed-size rectangular bins without overlap, while maximizing bin utilization. Although numerous metaheuristic algorithms have been proposed for the 2DIBPP, many industrial applications favor simpler constructive heuristics due to their deterministic behavior and l…
▽ More
The two-dimensional irregular bin packing problem (2DIBPP) aims to pack a given set of irregular polygons, referred to as pieces, into fixed-size rectangular bins without overlap, while maximizing bin utilization. Although numerous metaheuristic algorithms have been proposed for the 2DIBPP, many industrial applications favor simpler constructive heuristics due to their deterministic behavior and low computational overhead. Among such methods, the DJD algorithm proposed by L'opez-Camacho et al. is one of the most competitive constructive heuristics for the 2DIBPP. However, DJD is less effective for cutting instances, in which many pieces can be seamlessly combined into larger polygons. To address the issue, we propose MergeDJD, a novel constructive algorithm that integrates and extends the DJD framework. MergeDJD first preprocesses the instance by iteratively identifying groups of pieces that can be combined into larger and more regular piece. It then employs an improved version of DJD, in which the placement strategy is enhanced to better handle non-convex and combined shapes, to pack all resulting pieces into bins. Computational experiments on 1,089 well-known benchmark instances show that MergeDJD consistently outperforms DJD on 1,083 instances while maintaining short runtimes. Notably, MergeDJD attains new best known values on 515 instances. Ablation studies further confirm the effectiveness of the proposed components. To facilitate reproducibility and future research, we have open-sourced the complete implementation and provided interfaces for visualizing packing results.
△ Less
Submitted 28 February, 2026;
originally announced March 2026.
-
Array-Carrying Symbolic Execution for Function Contract Generation
Authors:
Weijie Lu,
Jingyu Ke,
Hongfei Fu,
Zhouyue Sun,
Yi Zhou,
Guoqiang Li,
Haokun Li
Abstract:
Function contract generation is a classical problem in program analysis that targets the automated analysis of functions in a program with multiple procedures. The problem is fundamental in inter-procedural analysis where properties of functions are first obtained via the generation of function contracts and then the generated contracts are used as building blocks to analyze the whole program. Typ…
▽ More
Function contract generation is a classical problem in program analysis that targets the automated analysis of functions in a program with multiple procedures. The problem is fundamental in inter-procedural analysis where properties of functions are first obtained via the generation of function contracts and then the generated contracts are used as building blocks to analyze the whole program. Typical objectives in function contract generation include pre-/post-conditions and assigns information (that specifies the modification information over program variables and memory segments during function execution). In programs with array manipulations, a crucial point in function contract generation is the treatment of array segments that imposes challenges in inferring invariants and assigns information over such segments. To address this challenge, we propose a novel symbolic execution framework that carries invariants and assigns information over contiguous segments of arrays. We implement our framework as a prototype within LLVM, and further integrate our prototype with the ACSL assertion format and the Frama-C software verification platform. Experimental evaluation over a variety of benchmarks from the literature and functions from realistic libraries shows that our framework is capable of handling array manipulating functions that indeed involve the carry of array information and are beyond existing approaches.
△ Less
Submitted 27 February, 2026; v1 submitted 26 February, 2026;
originally announced February 2026.
-
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
Authors:
Fang-Duo Tsai,
Yi-An Lai,
Fei-Yueh Chen,
Hsueh-Wei Fu,
Li Chai,
Wei-Jaw Lee,
Hao-Chung Cheng,
Yi-Hsuan Yang
Abstract:
Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability. We advocate a compositional alternative that decomposes the task into melody composition, singing voice synthesis, and singing accompaniment generation. Central to our approach is MIDI-informed singing…
▽ More
Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability. We advocate a compositional alternative that decomposes the task into melody composition, singing voice synthesis, and singing accompaniment generation. Central to our approach is MIDI-informed singing accompaniment generation (MIDI-SAG), which conditions accompaniment on the symbolic vocal-melody MIDI to improve rhythmic and harmonic alignment between singing and instrumentation. Moreover, beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; we address this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions. With lightweight newly trained components requiring only 2.5k hours of audio on a single RTX 3090, our pipeline approaches the perceptual quality of recent open-source end-to-end baselines in several metrics. We provide audio demos and will open-source our model at https://composerflow.github.io/web/.
△ Less
Submitted 24 February, 2026;
originally announced February 2026.
-
SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction
Authors:
Haoxiang Fu,
Lingfeng Zhang,
Hao Li,
Ruibing Hu,
Zhengrong Li,
Guanjing Liu,
Zimu Tan,
Long Chen,
Hangjun Ye,
Xiaoshuai Hao
Abstract:
High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly di…
▽ More
High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.
△ Less
Submitted 25 February, 2026;
originally announced February 2026.
-
Probing $D_s^+ \to η^{(\prime)} \ell^+ν_\ell$ semileptonic decay within LCSR under chiral heavy quark effective field theory
Authors:
Ruiyu Zhou,
Hai-Bing Fu,
Yi Zhang,
Wei Cheng
Abstract:
Motivated by the successful application of Heavy Quark Effective Field Theory in describing decays from heavy to light mesons, this work explores its applicability to the semileptonic decays of charmed mesons. So in this paper we investigate the $D_s^+\to η^{(\prime)} \ell^+ ν_\ell$ transition form factors using the light-cone sum rules approach within the framework of heavy-quark effective field…
▽ More
Motivated by the successful application of Heavy Quark Effective Field Theory in describing decays from heavy to light mesons, this work explores its applicability to the semileptonic decays of charmed mesons. So in this paper we investigate the $D_s^+\to η^{(\prime)} \ell^+ ν_\ell$ transition form factors using the light-cone sum rules approach within the framework of heavy-quark effective field theory. To address the large uncertainties arsing from the $η^{(\prime)}$-meson twist-3 distribution amplitudes, we employ the right-handed chiral correlation function. By applying the converging simplified series expansion method, we extrapolate the form factors to the entire physical $q^2$-region. Our analysis yields the branching fractions precise predictions for semi-leptonic decays $D_s^+\to η^{(\prime)}\ell^+ν_\ell$ with : $\mathcal{B}(D_s^+\toη\ell^+ν_\ell)=2.300^{+0.230}_{-0.227}\%$ ($\ell = e$) and $2.249_{-0.206}^{+0.209}\%$ ($\ell = μ$); $\mathcal{B}(D_s^+\toη^\prime \ell^+ν_\ell)=0.861^{+0.095}_{-0.093}\%$ ($\ell = e$) and $0.821^{+0.082}_{-0.080}\%$ ($\ell = μ$). The derived lepton flavor universality ratios $R^η_{μ,e}=0.977^{+0.008}_{-0.006}$ and $R^{η^{\prime}}_{μ,e} = 0.953^{+0.011}_{-0.009}$ are consistent with lasted BESIII experimental measurements. Additionally, the forward-backward asymmetry parameters $\langle \mathcal{A}^η_{\rm FB}\rangle=-0.034^{+0.003}_{-0.003}$ and $\langle \mathcal{A}^{η^\prime}_{\rm FB}\rangle=-0.073^{+0.007}_{-0.008}$ suggest that no significant violation of lepton flavor universality in this decay process.
△ Less
Submitted 23 February, 2026;
originally announced February 2026.
-
PhantomRun: Auto Repair of Compilation Errors in Embedded Open Source Software
Authors:
Han Fu,
Andreas Ermedahl,
Sigrid Eldh,
Kristian Wiklund,
Philipp Haller,
Cyrille Artho
Abstract:
Continuous Integration (CI) pipelines for embedded software sometimes fail during compilation, consuming significant developer time for debugging. We study four major open-source embedded system projects, spanning over 4000 build failures from the project's CI runs. We find that hardware dependencies account for the majority of compilation failures, followed by syntax errors and build-script issue…
▽ More
Continuous Integration (CI) pipelines for embedded software sometimes fail during compilation, consuming significant developer time for debugging. We study four major open-source embedded system projects, spanning over 4000 build failures from the project's CI runs. We find that hardware dependencies account for the majority of compilation failures, followed by syntax errors and build-script issues. Most repairs need relatively small changes, making automated repair potentially suitable as long as the diverse setups and lack of test data can be handled.
In this paper, we present PhantomRun, an automated framework that leverages large language models (LLMs) to generate and validate fixes for CI compilation failures. The framework addresses the challenge of diverse build infrastructures and tool chains across embedded system projects by providing an adaptation layer for GitHub Actions and GitLab CI and four different build systems. PhantomRun utilizes build logs, source code, historical fixes, and compiler error messages to synthesize fixes using LLMs. Our evaluations show that PhantomRun successfully repairs up to 45% of CI compilation failures across the targeted projects, demonstrating the viability of LLM-based repairs for embedded-system CI pipelines.
△ Less
Submitted 23 February, 2026;
originally announced February 2026.
-
QSolver: A Quantum Constraint Solver
Authors:
Shangzhou Xia,
Haitao Fu,
Jianjun Zhao
Abstract:
With the growing interest in quantum programs, ensuring their correctness is a fundamental challenge. Although constraint-solving techniques can overcome some limitations of traditional testing and verification, they have not yet been sufficiently explored in the context of quantum programs. To address this gap, we present QSolver, the first quantum constraint solver. QSolver provides a structured…
▽ More
With the growing interest in quantum programs, ensuring their correctness is a fundamental challenge. Although constraint-solving techniques can overcome some limitations of traditional testing and verification, they have not yet been sufficiently explored in the context of quantum programs. To address this gap, we present QSolver, the first quantum constraint solver. QSolver provides a structured framework for handling five types of quantum constraints and incorporates an automated assertion generation module to verify quantum states. QSolver transforms quantum programs and multi-moment constraints into symbolic representations, and utilizes an SMT solver to obtain quantum states that satisfy these constraints. To validate the correctness of the generated input states, QSolver automatically generates assertion programs corresponding to each constraint. Experimental results show that QSolver efficiently processes commonly used quantum gates and demonstrates good scalability across quantum programs of different sizes.
△ Less
Submitted 10 February, 2026;
originally announced February 2026.
-
OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models
Authors:
Ling Lin,
Yang Bai,
Heng Su,
Congcong Zhu,
Yaoxing Wang,
Yang Zhou,
Huazhu Fu,
Jingrun Chen
Abstract:
Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distr…
▽ More
Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.
△ Less
Submitted 20 February, 2026;
originally announced February 2026.
-
Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation
Authors:
Jidong Jia,
Youjian Zhang,
Huan Fu,
Dacheng Tao
Abstract:
Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world application…
▽ More
Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: https://jjd1123.github.io/Skeleton2Stage/
△ Less
Submitted 14 February, 2026;
originally announced February 2026.
-
Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework
Authors:
Siyun Yang,
Shixiao Yang,
Jian Wang,
Di Fan,
Kehe Cai,
Haoyan Fu,
Jiaming Zhang,
Wenjin Wu,
Peng Jiang
Abstract:
In online advertising, marketing interventions such as coupons introduce significant confounding bias into Click-Through Rate (CTR) prediction. Observed clicks reflect a mixture of users' intrinsic preferences and the uplift induced by these interventions. This causes conventional models to miscalibrate base CTRs, which distorts downstream ranking and billing decisions. Furthermore, marketing inte…
▽ More
In online advertising, marketing interventions such as coupons introduce significant confounding bias into Click-Through Rate (CTR) prediction. Observed clicks reflect a mixture of users' intrinsic preferences and the uplift induced by these interventions. This causes conventional models to miscalibrate base CTRs, which distorts downstream ranking and billing decisions. Furthermore, marketing interventions often operate as multi-valued treatments with varying magnitudes, introducing additional complexity to CTR prediction.
To address these issues, we propose the \textbf{Uni}fied \textbf{M}ulti-\textbf{V}alued \textbf{T}reatment Network (UniMVT). Specifically, UniMVT disentangles confounding factors from treatment-sensitive representations, enabling a full-space counterfactual inference module to jointly reconstruct the debiased base CTR and intensity-response curves. To handle the complexity of multi-valued treatments, UniMVT employs an auxiliary intensity estimation task to capture treatment propensities and devise a unit uplift objective that normalizes the intervention effect. This ensures comparable estimation across the continuous coupon-value spectrum. UniMVT simultaneously achieves debiased CTR prediction for accurate system calibration and precise uplift estimation for incentive allocation. Extensive experiments on synthetic and industrial datasets demonstrate UniMVT's superiority in both predictive accuracy and calibration. Furthermore, real-world A/B tests confirm that UniMVT significantly improves business metrics through more effective coupon distribution.
△ Less
Submitted 13 February, 2026;
originally announced February 2026.
-
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
Authors:
Gongye Liu,
Bo Yang,
Yida Zhi,
Zhizhou Zhong,
Lei Ke,
Didan Deng,
Han Gao,
Yongxiang Huang,
Kaihao Zhang,
Hongbo Fu,
Wenhan Luo
Abstract:
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator…
▽ More
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
△ Less
Submitted 11 February, 2026;
originally announced February 2026.
-
PRISM: A Principled Framework for Multi-Agent Reasoning via Gain Decomposition
Authors:
Yiming Yang,
Zhuoyuan Li,
Fanxiang Zeng,
Hao Fu,
Yue Liu
Abstract:
Multi-agent collaboration has emerged as a promising paradigm for enhancing reasoning capabilities of Large Language Models (LLMs). However, existing approaches remain largely heuristic, lacking principled guidance on what drives performance gains and how to systematically optimize multi-agent reasoning. Specifically, it remains unclear why multi-agent collaboration outperforms single-agent reason…
▽ More
Multi-agent collaboration has emerged as a promising paradigm for enhancing reasoning capabilities of Large Language Models (LLMs). However, existing approaches remain largely heuristic, lacking principled guidance on what drives performance gains and how to systematically optimize multi-agent reasoning. Specifically, it remains unclear why multi-agent collaboration outperforms single-agent reasoning and which design choices contribute most to these gains, making it difficult to build better systems.
We address this gap by introducing a unified theoretical framework that decomposes multi-agent reasoning gains into three conceptually independent dimensions: Exploration for diverse solution coverage, Information for high-fidelity feedback, and Aggregation for principled consensus. Through this lens, existing methods can be understood as special cases that optimize only subsets of these dimensions. Building upon this decomposition, a novel framework called PRISM (Propose-Review-Integrate Synthesis for Multi-agent Reasoning) is proposed, which jointly maximizes all three dimensions through role-based diversity, execution-grounded feedback with evidence-based cross-evaluation, and iterative synthesis with closed-loop validation. Extensive experiments across mathematical reasoning, code generation, and function calling benchmarks demonstrate that PRISM achieves state-of-the-art performance with superior compute-efficiency compared to methods optimizing partial dimensions. The theoretical framework provides actionable design principles for future multi-agent reasoning systems.
△ Less
Submitted 10 February, 2026; v1 submitted 9 February, 2026;
originally announced February 2026.
-
Manifolds with harmonic Weyl curvature and curvature operator of the second kind
Authors:
Haiping Fu,
Yao Lu
Abstract:
We prove that a compact Riemannian manifold of dimension $n\ge 8$ with harmonic Weyl curvature and $\frac{3(n-1)(n+2)}{4(3n-1)}$-nonnegative curvature operator of the second kind is either globally conformally equivalent to a space of positive constant curvature or is isometric to a flat manifold. In particular, We also give a classification of four-dimensional manifolds with harmonic Weyl curvatu…
▽ More
We prove that a compact Riemannian manifold of dimension $n\ge 8$ with harmonic Weyl curvature and $\frac{3(n-1)(n+2)}{4(3n-1)}$-nonnegative curvature operator of the second kind is either globally conformally equivalent to a space of positive constant curvature or is isometric to a flat manifold. In particular, We also give a classification of four-dimensional manifolds with harmonic Weyl curvature satisfying a cone condition. This result generalizes the work in \cite{DFY24,FLD,Li22}.
△ Less
Submitted 6 February, 2026;
originally announced February 2026.
-
Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment
Authors:
Tianyi Zhang,
Antoine Simoulin,
Kai Li,
Sana Lakdawala,
Shiqing Yu,
Arpit Mittal,
Hongyu Fu,
Yu Lin
Abstract:
Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face chal…
▽ More
Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face challenges in either adapting the single-scale image backbone from CLIP to the detection framework or ensuring robust visual-language alignment. We propose Visual-Language Detection (VLDet), a novel framework that revamps feature pyramid for fine-grained visual-language alignment, leading to improved OVD performance. With the VL-PUB module, VLDet effectively exploits the visual-language knowledge from CLIP and adapts the backbone for object detection through feature pyramid. In addition, we introduce the SigRPN block, which incorporates a sigmoid-based anchor-text contrastive alignment loss to improve detection of novel categories. Through extensive experiments, our approach achieves 58.7 AP for novel classes on COCO2017 and 24.8 AP on LVIS, surpassing all state-of-the-art methods and achieving significant improvements of 27.6% and 6.9%, respectively. Furthermore, VLDet also demonstrates superior zero-shot performance on closed-set object detection.
△ Less
Submitted 31 January, 2026;
originally announced February 2026.
-
Too many or too massive? Investigating the high-$z$ demography of active SMBHs from JWST
Authors:
Daniel Roberts,
Francesco Shankar,
Vieri Cammelli,
Fabio Fontanot,
Alessandro Trinca,
Laura Bisigello,
Elena Dalla Bonta,
Hao Fu,
Roberto Gilli,
Andrea Grazian,
Luca Graziani,
Andrea Lapi,
Nicola Menci,
Jan Scholtz,
Karthik Mahesh Varadarajan
Abstract:
Recent JWST observations have unveiled a numerous population of low-luminosity active galactic nuclei (AGN) at $4< z<10$, with space densities roughly an order of magnitude above pre-JWST estimates, and many of these AGN have masses orders of magnitude above the local black hole mass-stellar mass ($M_{\rm BH}-M_{*}$) scaling relations. We investigate the consistency of these observations within a…
▽ More
Recent JWST observations have unveiled a numerous population of low-luminosity active galactic nuclei (AGN) at $4< z<10$, with space densities roughly an order of magnitude above pre-JWST estimates, and many of these AGN have masses orders of magnitude above the local black hole mass-stellar mass ($M_{\rm BH}-M_{*}$) scaling relations. We investigate the consistency of these observations within a data-driven framework that links the galaxy stellar mass function to the supermassive black hole (SMBH) mass function and AGN luminosity functions using different $M_{\rm BH}-M_{*}$ relations and the observed Eddington-ratio distribution. By comparing our predictions against observed AGN luminosity functions at $z\sim 5.5$ we find that observations can be reproduced either by highly-elevated $M_{\rm BH}-M_{*}$ relations paired with low duty cycles, or moderate relations with higher duty cycles. Through the Soltan argument, we find that $M_{\rm BH}-M_{*}$ relations that are modestly above the local relation for AGN produce consistency between multiple tracers of the SMBH demography at $z\sim 5.5$, while more extreme normalisations would require a weakly-evolving luminosity function at $z> 5.5$. Continuity-equation modelling shows that initially high $M_{\rm BH}-M_{*}$ relations predict a strong two-phase evolutionary scenario and very steep low-mass SMBH mass functions in tension with several current estimates, while more moderate relations generate local SMBH mass functions in better agreement with present determinations and near-constant scaling relations. Our results favour a scenario where SMBHs at $z \sim 5$ on average lie modestly above local AGN scaling relations, with elevated but physically plausible duty cycles. Future wide-field clustering and demographic studies will help break the remaining degeneracies between SMBH scaling relations and AGN duty cycles at early cosmic times.
△ Less
Submitted 30 January, 2026;
originally announced January 2026.
-
Mapping the Extended Lyman-Alpha Emission within the Circumgalactic Medium of Quasars Hosted by Dusty Starbursts with CubeCarve
Authors:
Kevin Hall,
Hai Fu
Abstract:
We present a study of extended Ly$α$ emission around four quasars hosted by dusty starbursts, which are composite systems thought to represent a transitional stage in quasar evolution. To extract faint CGM emission in the presence of bright point sources, we introduce {\it CubeCarve}, a dual-channel deconvolution algorithm that separates unresolved quasar emission from spatially extended structure…
▽ More
We present a study of extended Ly$α$ emission around four quasars hosted by dusty starbursts, which are composite systems thought to represent a transitional stage in quasar evolution. To extract faint CGM emission in the presence of bright point sources, we introduce {\it CubeCarve}, a dual-channel deconvolution algorithm that separates unresolved quasar emission from spatially extended structure. This approach enables reliable recovery of \Lya\ emission projected onto the quasar position without introducing subtraction artifacts. Using {\it CubeCarve}, we find that the \Lya\ surface brightness profiles of these systems are, on average, fainter and shallower than those of quasars of similar bolometric luminosities. We also find that the total integrated \Lya\ luminosities of the nebulae are lower in systems whose host galaxies exhibit brighter far-infrared emission. These results suggest that the CGM conditions in composite systems differ from those in the broader quasar population. Our study highlights both the physical diversity of quasar CGM environments and the effectiveness of {\it CubeCarve} for recovering diffuse emission in modern IFU datasets.
△ Less
Submitted 29 January, 2026;
originally announced January 2026.
-
SketchDynamics: Exploring Free-Form Sketches for Dynamic Intent Expression in Animation Generation
Authors:
Boyu Li,
Lin-Ping Yuan,
Zeyu Wang,
Hongbo Fu
Abstract:
Sketching provides an intuitive way to convey dynamic intent in animation authoring (i.e., how elements change over time and space), making it a natural medium for automatic content creation. Yet existing approaches often constrain sketches to fixed command tokens or predefined visual forms, overlooking their freeform nature and the central role of humans in shaping intention. To address this, we…
▽ More
Sketching provides an intuitive way to convey dynamic intent in animation authoring (i.e., how elements change over time and space), making it a natural medium for automatic content creation. Yet existing approaches often constrain sketches to fixed command tokens or predefined visual forms, overlooking their freeform nature and the central role of humans in shaping intention. To address this, we introduce an interaction paradigm where users convey dynamic intent to a vision-language model via free-form sketching, instantiated here in a sketch storyboard to motion graphics workflow. We implement an interface and improve it through a three-stage study with 24 participants. The study shows how sketches convey motion with minimal input, how their inherent ambiguity requires users to be involved for clarification, and how sketches can visually guide video refinement. Our findings reveal the potential of sketch and AI interaction to bridge the gap between intention and outcome, and demonstrate its applicability to 3D animation and video generation.
△ Less
Submitted 28 January, 2026;
originally announced January 2026.
-
Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction
Authors:
Genyuan Zhang,
Zihao Wang,
Zhifan Gao,
Lei Xu,
Zhen Zhou,
Haijun Yu,
Jianjia Zhang,
Xiujian Liu,
Weiwei Zhang,
Shaoyu Wang,
Huazhu Fu,
Fenglin Liu,
Weiwen Wu
Abstract:
The application of iodinated contrast media (ICM) improves the sensitivity and specificity of computed tomography (CT) for a wide range of clinical indications. However, overdose of ICM can cause problems such as kidney damage and life-threatening allergic reactions. Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose while maintaining diag…
▽ More
The application of iodinated contrast media (ICM) improves the sensitivity and specificity of computed tomography (CT) for a wide range of clinical indications. However, overdose of ICM can cause problems such as kidney damage and life-threatening allergic reactions. Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose while maintaining diagnostic power. However, existing methods are difficult to realize accurate enhancement with incompletely paired images, mainly because of the limited ability of the model to recognize specific structures. To overcome this limitation, we propose a Structure-constrained Language-informed Diffusion Model (SLDM), a unified medical generation model that integrates structural synergy and spatial intelligence. First, the structural prior information of the image is effectively extracted to constrain the model inference process, thus ensuring structural consistency in the enhancement process. Subsequently, semantic supervision strategy with spatial intelligence is introduced, which integrates the functions of visual perception and spatial reasoning, thus prompting the model to achieve accurate enhancement. Finally, the subtraction angiography enhancement module is applied, which serves to improve the contrast of the ICM agent region to suitable interval for observation. Qualitative analysis of visual comparison and quantitative results of several metrics demonstrate the effectiveness of our method in angiographic reconstruction for low-dose contrast medium CT angiography.
△ Less
Submitted 28 January, 2026;
originally announced January 2026.
-
Gravity Wave Interactions in the Stratocumulus-Topped Boundary Layer
Authors:
Arun Balakrishna,
Hao Fu,
Parviz Moin,
Morgan O'Neill
Abstract:
This work studies the breakup propensity of the stratocumulus-topped boundary layer (STBL) interacting with gravity waves using large-eddy simulation with a uniform vertical grid of $5$ m and horizontal spacing of $30$ m. A radiative-convective equilibrium (RCE) state is constructed to enforce stationarity in the STBL, and the gravity waves are introduced via a vertical momentum forcing mimicking…
▽ More
This work studies the breakup propensity of the stratocumulus-topped boundary layer (STBL) interacting with gravity waves using large-eddy simulation with a uniform vertical grid of $5$ m and horizontal spacing of $30$ m. A radiative-convective equilibrium (RCE) state is constructed to enforce stationarity in the STBL, and the gravity waves are introduced via a vertical momentum forcing mimicking a packet of plane waves. A nondimensionalization involving the inversion height and mean horizontal base wind as length and velocity scales is proposed to provide a framework to analyze the forcing parameter space. The magnitude of the scaled forcing amplitude ($\mathcal{A}$) is critical in understanding various STBL breakup conditions. Classification of breakup was based on the reduction of the liquid water path for each forced STBL case. We found that breakup did not occur for $\mathcal{A}<1$ and observed modest reductions in cloud for $1<\mathcal{A}<2$, but the deck recovered to the stationary state slowly after the single-period forcing ceased. Fixing $\mathcal{A}\sim 2$ showed that forcings with longer duration and wider locality promote breakup. However, when the forcing is a linear combination of waves of two different periods, the percentage of cleared cloud dramatically increases, though recovery of RCE is still observed in some cases. $\mathcal{A}\geq2.5$ marks a critical threshold by which the STBL breaks up entirely and remains patchy. We further explore the connection between these bulk breakup results and the turbulent state by examining energy budgets and the anisotropy induced by the forcing.
△ Less
Submitted 24 January, 2026;
originally announced January 2026.