-
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Authors:
Yu Jiang,
Hanwen Jiang,
Ahmed Abdelkader,
Wen-Sheng Chu,
Brandon Y. Feng,
Zhangyang Wang,
Qixing Huang
Abstract:
With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentang…
▽ More
With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.
△ Less
Submitted 11 April, 2026;
originally announced April 2026.
-
Neural Federated Learning for Livestock Growth Prediction
Authors:
Shoujin Wang,
Mingze Ni,
Wei Liu,
Victor W. Chu,
Bryan Zheng,
Ayush Kanwal,
Roy Jing Yang,
Kenneth Sabir,
Fang Chen
Abstract:
Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limi…
▽ More
Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limiting their robustness and generalisability. To address these challenges, we propose LivestockFL, the first federated learning framework specifically designed for livestock growth prediction. LivestockFL enables collaborative model training across distributed farms without sharing raw data, thereby preserving data privacy while alleviating data sparsity, particularly for farms with limited historical records. The framework employs a neural architecture based on a Gated Recurrent Unit combined with a multilayer perceptron to model temporal growth patterns from historical weight records and auxiliary features. We further introduce LivestockPFL, a novel personalised federated learning framework that extends the above federated learning framework with a personalized prediction head trained on each farm's local data, producing farm-specific predictors. Experiments on a real-world dataset demonstrate the effectiveness and practicality of the proposed approaches.
△ Less
Submitted 1 April, 2026; v1 submitted 30 March, 2026;
originally announced March 2026.
-
Summation Formulae for Binomial Moments
Authors:
Marta Na Chen,
Wenchang Chu
Abstract:
By combining the telescoping method with an algebraic relation, four classes of binomial moments are examined. Several explicit summation formulae are established.
By combining the telescoping method with an algebraic relation, four classes of binomial moments are examined. Several explicit summation formulae are established.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning
Authors:
Shih-Wen Liu,
Yen-Chang Chen,
Wei-Ta Chu,
Fu-En Yang,
Yu-Chiang Frank Wang
Abstract:
Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce \textbf{Free Sinewich}, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (\textbf{Free}). Specifically, a \…
▽ More
Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce \textbf{Free Sinewich}, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (\textbf{Free}). Specifically, a \textbf{Sine-AWB (Sinewich)} layer combines low-rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task-specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39\% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Project page: \href{https://casperliuliuliu.github.io/projects/Free-Sinewich/}{https://casperliuliuliu.github.io/projects/Free-Sinewich}.
△ Less
Submitted 22 March, 2026;
originally announced March 2026.
-
Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation
Authors:
Jingguo Qu,
Xinyang Han,
Yao Pu,
Man-Lik Chui,
Simon Takadiyi Gunda,
Ziman Chen,
Jing Qin,
Ann Dorothy King,
Winnie Chiu-Wing Chu,
Jing Cai,
Michael Tin-Cheung Ying
Abstract:
Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this pa…
▽ More
Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5\% labeling ratio, Switch achieves remarkable improvements: 80.04\% Dice on LN-INT, 85.52\% Dice on DDTI, and 83.48\% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
Pressure-induced Superconductivity in AgSbTe2
Authors:
Sudaice Kazibwe,
Bishnu Karki,
Wencheng Lu,
Zhongxin Liang,
Minghong Sui,
Melissa Gooch,
Zhifeng Ren,
Pavan Hosur,
Timothy A. Strobel,
Ching-Wu Chu,
Liangzi Deng
Abstract:
AgSbTe2 is a well-known thermoelectric material with a high Seebeck coefficient and intrinsically low thermal conductivity, but its behavior under pressure remains largely unexplored. Here we report a systematic investigation of the structural, electronic, and transport properties of non-stoichiometric AgSbTe2 under high pressure. At ambient pressure, the material can be described as having a cubi…
▽ More
AgSbTe2 is a well-known thermoelectric material with a high Seebeck coefficient and intrinsically low thermal conductivity, but its behavior under pressure remains largely unexplored. Here we report a systematic investigation of the structural, electronic, and transport properties of non-stoichiometric AgSbTe2 under high pressure. At ambient pressure, the material can be described as having a cubic crystal structure that remains stable up to 21.7 GPa beyond which it loses long-range structural order, while its crystal system fully recovers upon decompression. Remarkably, superconductivity emerges at a very low pressure of 0.38 GPa with an onset superconducting critical temperature (Tc) of 3.2 K. Tc increases with increasing pressure, reaching 6.9 K at 31.9 GPa, and peaks at 7.4 K during decompression. Magnetic-field-dependent transport measurements and electronic structure calculations reveal an evolution of the superconducting state driven by an enhanced electronic density of states at the Fermi level under compression. Our findings uncover pressure-induced superconductivity in AgSbTe2 and demonstrate that pressure can effectively tune the electronic ground state of thermoelectric materials, extending their functionality beyond thermoelectric energy conversion.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition
Authors:
Yuhan Chen,
Yicui Shi,
Guofa Li,
Liping Zhang,
Jie Li,
Jiaxin Gao,
Wenbo Chu
Abstract:
Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor top…
▽ More
Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor topologies hinder the modeling of latent long-range dependencies. To overcome these limitations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian splatting with probabilistic topology. Our framework explicitly addresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into continuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is designed to dynamically construct anisotropic covariance matrices using instantaneous joint velocity vectors. This module enhances visual representation by rendering sparse skeleton sequences into multi-view continuous heatmaps rich in spatiotemporal semantics. Secondly, to transcend the limitations of fixed physical connections, a probabilistic topology construction method is proposed. This approach generates an adaptive prior adjacency matrix by quantifying statistical correlations via the Bhattacharyya distance between joint Gaussian distributions. Ultimately, the GCN backbone is adaptively modulated by the rendered visual features via a visual context gating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances the modeling of complex spatiotemporal dynamics. By addressing the inherent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This approach establishes a practical pathway for improving perceptual reliability in real-world sensing applications.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting
Authors:
Yicheng Rui,
Xiao-Wei Duan,
Licai Deng,
Fan Yang,
Zhengming Dang,
Zhengjun Du,
Junhao Peng,
Wenhao Chu,
Umut Mahmut,
Kexin Li,
Yiyun Wu,
Fabo Feng
Abstract:
Ground-based time-domain observatories require minute-by-minute, site-scale awareness of cloud cover, yet existing all-sky datasets are short, daylight-biased, or lack astrometric calibration. We present LenghuSky-8, an eight-year (2018-2025) all-sky imaging dataset from a premier astronomical site, comprising 429,620 $512 \times 512$ frames with 81.2% night-time coverage, star-aware cloud masks,…
▽ More
Ground-based time-domain observatories require minute-by-minute, site-scale awareness of cloud cover, yet existing all-sky datasets are short, daylight-biased, or lack astrometric calibration. We present LenghuSky-8, an eight-year (2018-2025) all-sky imaging dataset from a premier astronomical site, comprising 429,620 $512 \times 512$ frames with 81.2% night-time coverage, star-aware cloud masks, background masks, and per-pixel altitude-azimuth (Alt-Az) calibration. For robust cloud segmentation across day, night, and lunar phases, we train a linear probe on DINOv3 local features and obtain 93.3% $\pm$ 1.1% overall accuracy on a balanced, manually labeled set of 1,111 images. Using stellar astrometry, we map each pixel to local alt-az coordinates and measure calibration uncertainties of approximately 0.37 deg at zenith and approximately 1.34 deg at 30 deg altitude, sufficient for integration with telescope schedulers. Beyond segmentation, we introduce a short-horizon nowcasting benchmark over per-pixel three-class logits (sky/cloud/contamination) with four baselines: persistence (copying the last frame), optical flow, ConvLSTM, and VideoGPT. ConvLSTM performs best but yields only limited gains over persistence, underscoring the difficulty of near-term cloud evolution. We release the dataset, calibrations, and an open-source toolkit for loading, evaluation, and scheduler-ready alt-az maps to boost research in segmentation, nowcasting, and autonomous observatory operations.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay
Authors:
Long Li,
Zhijian Zhou,
Tianyi Wang,
Weidi Xu,
Zuming Huang,
Wei Chu,
Zhe Wang,
Shirui Pan,
Chao Qu,
Yuan Qi
Abstract:
While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize…
▽ More
While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
Ambient-pressure 151-K superconductivity in HgBa2Ca2Cu3O8+δ via pressure quench
Authors:
Liangzi Deng,
Thacien Habamahoro,
Artin Safezoddeh,
Bishnu Karki,
Sudaice Kazibwe,
Daniel J. Schulze,
Zheng Wu,
Matthew Julian,
Rohit P. Prasankumar,
Hua Zhou,
Jesse S. Smith,
Pavan R. Hosur,
Ching-Wu Chu
Abstract:
Superconductivity has been a vigorously researched topic since its discovery in 1911. Raising the superconducting transition temperature (Tc) has been the main driving force behind such long-sustained efforts due to its potential for impacting humanity and the fundamental knowledge gained from understanding this macroscopic coherent quantum state at high temperatures. The successful development of…
▽ More
Superconductivity has been a vigorously researched topic since its discovery in 1911. Raising the superconducting transition temperature (Tc) has been the main driving force behind such long-sustained efforts due to its potential for impacting humanity and the fundamental knowledge gained from understanding this macroscopic coherent quantum state at high temperatures. The successful development of high-Tc superconductivity will make possible extraordinarily efficient generation, delivery, and utilization of energy, and could also enable the development of controlled fusion while impacting other burgeoning fields like quantum computation and quantum electronics. However, progress has been hindered by a longstanding plateau in the record ambient-pressure Tc, unchanged since 1993. Subsequent significant advancements in Tc have been achieved only under high pressures, preventing the realization of superconductivity's full potential. To directly address this challenge, we developed a pressure-quench protocol (PQP) to stabilize pressure-induced/-enhanced superconducting states at ambient pressure. Here we achieve a record ambient-pressure Tc of 151 K in the cuprate HgBa2Ca2Cu3O8+δ via PQP. The experimental results are further supported by synchrotron X-ray diffraction measurements and phonon and electronic structure calculations. This breakthrough opens new avenues for stabilizing and exploring ambient-pressure high-Tc superconducting states and other quantum states that have been previously only accessible under pressure, paving the way for deeper understanding and practical applications of high-Tc superconductivity and beyond.
△ Less
Submitted 12 March, 2026;
originally announced March 2026.
-
Evaluating Few-Shot Pill Recognition Under Visual Domain Shift
Authors:
W. I. Chu,
G. Tarroni,
L. Li
Abstract:
Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition fr…
▽ More
Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition from a deployment-oriented perspective, prioritizing generalization under realistic cross-dataset domain shifts over architectural innovation. A two-stage object detection framework is employed, involving base training followed by few-shot fine-tuning. Models are adapted to novel pill classes using one, five, or ten labeled examples per class and are evaluated on a separate deployment dataset featuring multi-object, cluttered scenes. The evaluation focuses on classification-centric and error-based metrics to address heterogeneous annotation strategies. Findings indicate that semantic pill recognition adapts rapidly with few-shot supervision, with classification performance reaching saturation even with a single labeled example. However, stress testing under overlapping and occluded conditions demonstrates a marked decline in localization and recall, despite robust semantic classification. Models trained on visually realistic, multi-pill data consistently exhibit greater robustness in low-shot scenarios, underscoring the importance of training data realism and the diagnostic utility of few-shot fine-tuning for deployment readiness.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
A dataset of medication images with instance segmentation masks for preventing adverse drug events
Authors:
W. I. Chu,
S. Hirani,
G. Tarroni,
L. Li
Abstract:
Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlappin…
▽ More
Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3-Pills subset and 80.1 percent on the 32-Pills subset. We further evaluate MEDISEG under a few-shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets. These results highlight the dataset's ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates
Authors:
Shengjie Zhu,
Ahmed Abdelkader,
Mark J. Matthews,
Xiaoming Liu,
Wen-Sheng Chu
Abstract:
Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth…
▽ More
Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.
△ Less
Submitted 21 February, 2026;
originally announced February 2026.
-
Noisy nonlocal aggregation model with gradient flow structures
Authors:
Su Yang,
Weiqi Chu,
Panayotis G. Kevrekidis
Abstract:
Interacting particle systems provide a fundamental framework for modeling collective behavior in biological, social, and physical systems. In many applications, stochastic perturbations are essential for capturing environmental variability and individual uncertainty, yet their impact on long-term dynamics and equilibrium structure remains incompletely understood, particularly in the presence of no…
▽ More
Interacting particle systems provide a fundamental framework for modeling collective behavior in biological, social, and physical systems. In many applications, stochastic perturbations are essential for capturing environmental variability and individual uncertainty, yet their impact on long-term dynamics and equilibrium structure remains incompletely understood, particularly in the presence of nonlocal interactions. We investigate a stochastic interacting particle system governed by potential-driven interactions and its continuum density formulation in the large-population limit. We introduce an energy functional and show that the macroscopic density evolution has a gradient-flow structure in the Wasserstein-2 space. The associated variational framework yields equilibrium states through constrained energy minimization and illustrates how noise regulates the density and mitigates singular concentration. We demonstrate the connection between microscopic and macroscopic descriptions through numerical examples in one and two dimensions. Within the variational framework, we compute energy minimizers and perform a linear stability analysis. The numerical results show that the stable minimizers agree with the long-time dynamics of the macroscopic density model.
△ Less
Submitted 3 February, 2026;
originally announced February 2026.
-
Symmetry Adapted Analysis of Screw Dislocation: Electronic Structure and Carrier Recombination Mechanisms in GaN
Authors:
Yuncheng Xie,
Haozhe Shi,
Menglin Huang,
Weibin Chu,
Shiyou Chen,
Xin-Gao Gong
Abstract:
As fundamental one-dimensional defects, screw dislocations profoundly reshape the energy landscape and carrier dynamics of crystalline materials. By restoring the exact algebra of the screw dislocation group, we unveil the latent symmetry constraints that govern the electronic structure, providing a more rigorous physical picture than the conventional treatments. When applied to GaN, the method yi…
▽ More
As fundamental one-dimensional defects, screw dislocations profoundly reshape the energy landscape and carrier dynamics of crystalline materials. By restoring the exact algebra of the screw dislocation group, we unveil the latent symmetry constraints that govern the electronic structure, providing a more rigorous physical picture than the conventional treatments. When applied to GaN, the method yields a band-connectivity constraint and rigorous dipole selection rules for polarization-resolved transitions. Combined with computed Hamiltonian matrix, the approach gives symmetry-filtered radiative and dielectric calculations and reveals a piezoelectrical effect at the dislocation core that strongly suppresses radiative recombination. The pronounced dominance of non-radiative capture over radiative recombination highlights the detrimental impact of screw dislocations on the luminous efficiency of GaN, providing a theoretical foundation for optimizing dislocation-limited optoelectronic devices.
△ Less
Submitted 27 January, 2026;
originally announced January 2026.
-
LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting
Authors:
Yuhan Chen,
Wenxuan Yu,
Guofa Li,
Yijun Xu,
Ying Fang,
Yicui Shi,
Long Cao,
Wenbo Chu,
Keqiang Li
Abstract:
2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compro…
▽ More
2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.
△ Less
Submitted 22 January, 2026;
originally announced January 2026.
-
LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps
Authors:
Yuhan Chen,
Ying Fang,
Guofa Li,
Wenxuan Yu,
Yicui Shi,
Jingrui Zhang,
Kefei Qian,
Wenbo Chu,
Keqiang Li
Abstract:
Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique charac…
▽ More
Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.
△ Less
Submitted 27 January, 2026; v1 submitted 22 January, 2026;
originally announced January 2026.
-
Simulations and Advancements in MRI-Guided Power-Driven Ferric Tools for Wireless Therapeutic Interventions
Authors:
Wenhui Chu,
Aobo Jin,
Hardik A. Gohel
Abstract:
Designing a robotic system that functions effectively within the specific environment of a Magnetic Resonance Imaging (MRI) scanner requires solving numerous technical issues, such as maintaining the robot's precision and stability under strong magnetic fields. This research focuses on enhancing MRI's role in medical imaging, especially in its application to guide intravascular interventions using…
▽ More
Designing a robotic system that functions effectively within the specific environment of a Magnetic Resonance Imaging (MRI) scanner requires solving numerous technical issues, such as maintaining the robot's precision and stability under strong magnetic fields. This research focuses on enhancing MRI's role in medical imaging, especially in its application to guide intravascular interventions using robot-assisted devices. A newly developed computational system is introduced, designed for seamless integration with the MRI scanner, including a computational unit and user interface. This system processes MR images to delineate the vascular network, establishing virtual paths and boundaries within vessels to prevent procedural damage. Key findings reveal the system's capability to create tailored magnetic field gradient patterns for device control, considering the vessel's geometry and safety norms, and adapting to different blood flow characteristics for finer navigation. Additionally, the system's modeling aspect assesses the safety and feasibility of navigating pre-set vascular paths. Conclusively, this system, based on the Qt framework and C/C++, with specialized software modules, represents a major step forward in merging imaging technology with robotic aid, significantly enhancing precision and safety in intravascular procedures.
△ Less
Submitted 4 January, 2026;
originally announced January 2026.
-
A Novel Deep Learning Method for Segmenting the Left Ventricle in Cardiac Cine MRI
Authors:
Wenhui Chu,
Aobo Jin,
Hardik A. Gohel
Abstract:
This research aims to develop a novel deep learning network, GBU-Net, utilizing a group-batch-normalized U-Net framework, specifically designed for the precise semantic segmentation of the left ventricle in short-axis cine MRI scans. The methodology includes a down-sampling pathway for feature extraction and an up-sampling pathway for detail restoration, enhanced for medical imaging. Key modificat…
▽ More
This research aims to develop a novel deep learning network, GBU-Net, utilizing a group-batch-normalized U-Net framework, specifically designed for the precise semantic segmentation of the left ventricle in short-axis cine MRI scans. The methodology includes a down-sampling pathway for feature extraction and an up-sampling pathway for detail restoration, enhanced for medical imaging. Key modifications include techniques for better contextual understanding crucial in cardiac MRI segmentation. The dataset consists of 805 left ventricular MRI scans from 45 patients, with comparative analysis using established metrics such as the dice coefficient and mean perpendicular distance. GBU-Net significantly improves the accuracy of left ventricle segmentation in cine MRI scans. Its innovative design outperforms existing methods in tests, surpassing standard metrics like the dice coefficient and mean perpendicular distance. The approach is unique in its ability to capture contextual information, often missed in traditional CNN-based segmentation. An ensemble of the GBU-Net attains a 97% dice score on the SunnyBrook testing dataset. GBU-Net offers enhanced precision and contextual understanding in left ventricle segmentation for surgical robotics and medical analysis.
△ Less
Submitted 4 January, 2026;
originally announced January 2026.
-
Family of High-Chern-Number Orbital Magnets in Twisted Rhombohedral Graphene
Authors:
Xirui Wang,
L. Antonio Benítez,
Vo Tien Phong,
Wai In Chu,
Kenji Watanabe,
Takashi Taniguchi,
Cyprian Lewandowski,
Pablo Jarillo-Herrero
Abstract:
Realizing Chern insulators with Chern numbers greater than one remains a major goal in quantum materials research. Such platforms promise multichannel dissipationless chiral transport and access to correlated phases beyond the conventional C = 1 paradigm. Here, we discover a family of high-Chern-number orbital magnets in twisted monolayer-multilayer rhombohedral graphene, denoted (1+n) with n = 3,…
▽ More
Realizing Chern insulators with Chern numbers greater than one remains a major goal in quantum materials research. Such platforms promise multichannel dissipationless chiral transport and access to correlated phases beyond the conventional C = 1 paradigm. Here, we discover a family of high-Chern-number orbital magnets in twisted monolayer-multilayer rhombohedral graphene, denoted (1+n) with n = 3, 4, and 5. Magnetotransport measurements show pronounced anomalous Hall effects at one and three electrons per moiré unit cell when they are polarized away from the moiré interface. Across the (1+n) systems, we observe a clear topological hierarchy C = n, revealed by the Středa trajectories and the quantized Hall resistance. Our experimental observations are supported by self-consistent mean-field calculations. Moreover, we realize both electrical and magnetic switching of the high-Chern-number states by flipping the valley polarization. Together, these results establish a tunable hierarchy of orbital Chern magnets in twisted rhombohedral graphene, offering systematic control of Chern number and topology through layer engineering in pristine graphene moiré systems.
△ Less
Submitted 3 January, 2026;
originally announced January 2026.
-
Simulations of MRI Guided and Powered Ferric Applicators for Tetherless Delivery of Therapeutic Interventions
Authors:
Wenhui Chu,
Khang Tran,
Nikolaos V. Tsekos
Abstract:
Magnetic Resonance Imaging (MRI) is a well-established modality for pre-operative planning and is also explored for intra-operative guidance of procedures such as intravascular interventions. Among the experimental robot-assisted technologies, the magnetic field gradients of the MRI scanner are used to power and maneuver ferromagnetic applicators for accessing sites in the patient's body via the v…
▽ More
Magnetic Resonance Imaging (MRI) is a well-established modality for pre-operative planning and is also explored for intra-operative guidance of procedures such as intravascular interventions. Among the experimental robot-assisted technologies, the magnetic field gradients of the MRI scanner are used to power and maneuver ferromagnetic applicators for accessing sites in the patient's body via the vascular network. In this work, we propose a computational platform for preoperative planning and modeling of MRI-powered applicators inside blood vessels. This platform was implemented as a two-way data and command pipeline that links the MRI scanner, the computational core, and the operator. The platform first processes multi-slice MR data to extract the vascular bed and then fits a virtual corridor inside the vessel. This corridor serves as a virtual fixture (VF), a forbidden region for the applicators to avoid vessel perforation or collision. The geometric features of the vessel centerline, the VF, and MRI safety compliance (dB/dt, max available gradient) are then used to generate magnetic field gradient waveforms. Different blood flow profiles can be user-selected, and those parameters are used for modeling the applicator's maneuvering. The modeling module further generates cues about whether the selected vascular path can be safely maneuvered. Given future experimental studies that require a real-time operation, the platform was implemented on the Qt framework (C/C++) with software modules performing specific tasks running on dedicated threads: PID controller, generation of VF, generation of MR gradient waveforms.
△ Less
Submitted 2 January, 2026;
originally announced January 2026.
-
Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI
Authors:
Wenhui Chu,
Nikolaos V. Tsekos
Abstract:
Left ventricle (LV) segmentation is critical for clinical quantification and diagnosis of cardiac images. In this work, we propose two novel deep learning architectures called LNU-Net and IBU-Net for left ventricle segmentation from short-axis cine MRI images. LNU-Net is derived from layer normalization (LN) U-Net architecture, while IBU-Net is derived from the instance-batch normalized (IB) U-Net…
▽ More
Left ventricle (LV) segmentation is critical for clinical quantification and diagnosis of cardiac images. In this work, we propose two novel deep learning architectures called LNU-Net and IBU-Net for left ventricle segmentation from short-axis cine MRI images. LNU-Net is derived from layer normalization (LN) U-Net architecture, while IBU-Net is derived from the instance-batch normalized (IB) U-Net for medical image segmentation. The architectures of LNU-Net and IBU-Net have a down-sampling path for feature extraction and an up-sampling path for precise localization. We use the original U-Net as the basic segmentation approach and compared it with our proposed architectures. Both LNU-Net and IBU-Net have left ventricle segmentation methods: LNU-Net applies layer normalization in each convolutional block, while IBU-Net incorporates instance and batch normalization together in the first convolutional block and passes its result to the next layer. Our method incorporates affine transformations and elastic deformations for image data processing. Our dataset that contains 805 MRI images regarding the left ventricle from 45 patients is used for evaluation. We experimentally evaluate the results of the proposed approaches outperforming the dice coefficient and the average perpendicular distance than other state-of-the-art approaches.
△ Less
Submitted 2 January, 2026;
originally announced January 2026.
-
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Authors:
Qihao Liu,
Chengzhi Mao,
Yaojie Liu,
Alan Yuille,
Wen-Sheng Chu
Abstract:
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate c…
▽ More
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing
Authors:
Kuan Lu,
Shuhang Lin,
Sai Wu,
Yichen Yao,
Junhan Yang,
Huan Li,
Wei Chu,
Xu Yinghui,
Yuan Qi,
Gang Chen
Abstract:
Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades…
▽ More
Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
Mull-Tokens: Modality-Agnostic Latent Thinking
Authors:
Arijit Ray,
Ahmed Abdelkader,
Chengzhi Mao,
Bryan A. Plummer,
Kate Saenko,
Ranjay Krishna,
Leonidas Guibas,
Wen-Sheng Chu
Abstract:
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead,…
▽ More
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
SmartAlert: Implementing Machine Learning-Driven Clinical Decision Support for Inpatient Lab Utilization Reduction
Authors:
April S. Liang,
Fatemeh Amrollahi,
Yixing Jiang,
Conor K. Corbin,
Grace Y. E. Kim,
David Mui,
Trevor Crowell,
Aakash Acharya,
Sreedevi Mony,
Soumya Punnathanam,
Jack McKeown,
Margaret Smith,
Steven Lin,
Arnold Milstein,
Kevin Schulman,
Jason Hom,
Michael A. Pfeffer,
Tho D. Pham,
David Svec,
Weihan Chu,
Lisa Shieh,
Christopher Sharp,
Stephen P. Ma,
Jonathan H. Chen
Abstract:
Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision supp…
▽ More
Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision support (CDS) system integrated into the electronic health record that predicts stable laboratory results to reduce unnecessary repeat testing. This case study describes the implementation process, challenges, and lessons learned from deploying SmartAlert targeting complete blood count (CBC) utilization in a randomized controlled pilot across 9270 admissions in eight acute care units across two hospitals between August 15, 2024, and March 15, 2025. Results show significant decrease in number of CBC results within 52 hours of SmartAlert display (1.54 vs 1.82, p <0.01) without adverse effect on secondary safety outcomes, representing a 15% relative reduction in repetitive testing. Implementation lessons learned include interpretation of probabilistic model predictions in clinical contexts, stakeholder engagement to define acceptable model behavior, governance processes for deploying a complex model in a clinical environment, user interface design considerations, alignment with clinical operational priorities, and the value of qualitative feedback from end users. In conclusion, a machine learning-driven CDS system backed by a deliberate implementation and governance process can provide precision guidance on inpatient laboratory testing to safely reduce unnecessary repetitive testing.
△ Less
Submitted 3 December, 2025;
originally announced December 2025.
-
A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems
Authors:
Yuhan Chen,
Yicui Shi,
Guofa Li,
Guangrui Bai,
Jinyuan Shao,
Xiangfei Huang,
Wenbo Chu,
Keqiang Li
Abstract:
In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv)…
▽ More
In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET
△ Less
Submitted 2 December, 2025;
originally announced December 2025.
-
MOTION: ML-Assisted On-Device Low-Latency Motion Recognition
Authors:
Veeramani Pugazhenthi,
Wei-Hsiang Chu,
Junwei Lu,
Jadyn N. Miyahira,
Mahdi Eslamimehr,
Pratik Satam,
Rozhin Yasaei,
Soheil Salehi
Abstract:
The use of tiny devices capable of low-latency gesture recognition is gaining momentum in everyday human-computer interaction and especially in medical monitoring fields. Embedded solutions such as fall detection, rehabilitation tracking, and patient supervision require fast and efficient tracking of movements while avoiding unwanted false alarms. This study presents an efficient solution on how t…
▽ More
The use of tiny devices capable of low-latency gesture recognition is gaining momentum in everyday human-computer interaction and especially in medical monitoring fields. Embedded solutions such as fall detection, rehabilitation tracking, and patient supervision require fast and efficient tracking of movements while avoiding unwanted false alarms. This study presents an efficient solution on how to build very efficient motion-based models only using triaxial accelerometer sensors. We explore the capability of the AutoML pipelines to extract the most important features from the data segments. This approach also involves training multiple lightweight machine learning algorithms using the extracted features. We use WeBe Band, a multi-sensor wearable device that is equipped with a powerful enough MCU to effectively perform gesture recognition entirely on the device. Of the models explored, we found that the neural network provided the best balance between accuracy, latency, and memory use. Our results also demonstrate that reliable real-time gesture recognition can be achieved in WeBe Band, with great potential for real-time medical monitoring solutions that require a secure and fast response time.
△ Less
Submitted 9 February, 2026; v1 submitted 13 October, 2025;
originally announced December 2025.
-
Layer-Aware Video Composition via Split-then-Merge
Authors:
Ozgur Kara,
Yujia Chen,
Ming-Hsuan Yang,
James M. Rehg,
Wen-Sheng Chu,
Du Tran
Abstract:
We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse…
▽ More
We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io
△ Less
Submitted 25 November, 2025;
originally announced November 2025.
-
Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Authors:
Baode Wang,
Biao Wu,
Weizhen Li,
Meng Fang,
Zuming Huang,
Jun Huang,
Haozhe Wang,
Yanjie Liang,
Ling Chen,
Wei Chu,
Yuan Qi
Abstract:
Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by t…
▽ More
Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.
△ Less
Submitted 20 October, 2025; v1 submitted 17 October, 2025;
originally announced October 2025.
-
Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation
Authors:
Wei-Teng Chu,
Tianyi Zhang,
Matthew Johnson-Roberson,
Weiming Zhi
Abstract:
Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as NeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this wor…
▽ More
Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as NeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets. Code is publicly available at https://github.com/waynechu1109/FINS.
△ Less
Submitted 11 March, 2026; v1 submitted 24 September, 2025;
originally announced September 2025.
-
Nonreciprocal optical circuit switching
Authors:
Zhifeng Tu,
Yucong Yang,
Yiran Wei,
Shuyuan Liu,
Fangchen Hu,
Peng Zou,
Chengkun Yang,
Tianchi Zhang,
Di Wu,
Ruoyu Shen,
Bingzhou Hong,
Haiwen Cai,
Lei Bi,
Wei Chu
Abstract:
Directly switching optical signals outperforms conventional optoelectronic hardware in terms of cost, latency, and energy efficiency, and is expected to address the growing demand for data node capacity driven by the development of machine learning and artificial intelligence (AI) technologies. Therefore, optical circuit switching (OCS) technology has piqued widespread research interest in various…
▽ More
Directly switching optical signals outperforms conventional optoelectronic hardware in terms of cost, latency, and energy efficiency, and is expected to address the growing demand for data node capacity driven by the development of machine learning and artificial intelligence (AI) technologies. Therefore, optical circuit switching (OCS) technology has piqued widespread research interest in various technical solutions, including silicon photonics. However, silicon-based integrated OCS remains constrained by challenges such as network performance and port scalability. Here we propose a magneto-optical heterogeneous integrated nonreciprocal OCS (NOCS) network based on a silicon photonics platform, achieving bidirectional full-duplex nonreciprocal transmission by programming reciprocal and nonreciprocal phase shifters. We demonstrate that compared with the existing OCS architecture, NOCS has the advantages of ultra-high reconfiguration speed, large-scale integration compatibility, and bidirectional channel isolation reducing the number of required ports. NOCS could meet the programming speed requirements of the AI backend network, or supports nonreciprocal optical switching applications without multiplexing technology.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
Authors:
Long Li,
Zhijian Zhou,
Jiaran Hao,
Jason Klein Liu,
Yanting Miao,
Wei Pang,
Xiaoyu Tan,
Wei Chu,
Zhe Wang,
Shirui Pan,
Chao Qu,
Yuan Qi
Abstract:
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and…
▽ More
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives -- both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely -- lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.
△ Less
Submitted 3 March, 2026; v1 submitted 9 September, 2025;
originally announced September 2025.
-
MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries
Authors:
François Grolleau,
Emily Alsentzer,
Timothy Keyes,
Philip Chung,
Akshay Swaminathan,
Asad Aali,
Jason Hom,
Tridu Huynh,
Thomas Lew,
April S. Liang,
Weihan Chu,
Natasha Z. Steele,
Christina F. Lin,
Jingkun Yang,
Kameron C. Black,
Stephen P. Ma,
Fateme N. Haredasht,
Nigam H. Shah,
Kevin Schulman,
Jonathan H. Chen
Abstract:
Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key fa…
▽ More
Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.
△ Less
Submitted 6 September, 2025;
originally announced September 2025.
-
OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics
Authors:
Wei Chu,
Yuanzhe Dong,
Ke Tan,
Dong Han,
Xavier Menendez-Pidal,
Ruchao Fan,
Chenfeng Miao,
Chanwoo Kim,
Bhiksha Raj,
Rita Singh
Abstract:
OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence sco…
▽ More
OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.
△ Less
Submitted 4 September, 2025;
originally announced September 2025.
-
400-Gbps/$λ$ Ultrafast Silicon Microring Modulator for Scalable Optical Compute Interconnects
Authors:
Fangchen Hu,
Fengxin Yu,
Xingyu Liu,
Aoxue Wang,
Xiao Hu,
Haiwen Cai,
Wei Chu
Abstract:
The exponential growth of artificial intelligence (AI) workloads is driving an urgent demand for optical interconnects with ultrahigh bandwidth, energy efficiency, and scalability. Silicon photonics, with its CMOS compatibility and wafer-scale manufacturability, has emerged as a promising platform for optical interconnect architectures. Silicon microring modulators (MRMs), with their compact footp…
▽ More
The exponential growth of artificial intelligence (AI) workloads is driving an urgent demand for optical interconnects with ultrahigh bandwidth, energy efficiency, and scalability. Silicon photonics, with its CMOS compatibility and wafer-scale manufacturability, has emerged as a promising platform for optical interconnect architectures. Silicon microring modulators (MRMs), with their compact footprint, low power consumption, and high modulation efficiency, have become ideal devices for modulation in interconnects. However, silicon MRMS have so far been constrained by the trade-off between modulation efficiency and bandwidth, hindering their potential for 400 Gbps-per-wavelength operation. To mitigate this trade-off, here we demonstrate a wafer-level fabricated and high-bandwidth silicon MRM with a novel heavily-doped trench-integrated structure on a 300-mm silicon photonic platform, achieving both outstanding device performance and remarkable wafer-scale uniformity. Exploiting dual operation modes: self-biasing for energy-efficient scale-up interconnects and depletion driving for ultrafast scale-out links, the device supports error-free 32-Gbps NRZ transmission over 2-km SSMF with only 0.43-Vpp drive and zero electrical bias, yielding energy efficiency of 0.97 fJ/bit without DSP. At higher swings, it further supports 280-Gbps PAM4 and error-free 80-Gbps NRZ optical modulation. For scale-out interconnects, open eye diagrams are achieved at 200 Gbps (NRZ), 360 Gbps (PAM4), and a record 400 Gbps (PAM6), establishing the first wafer-scale silicon MRM solution reaching 400 Gbps/$λ$. The sub-fJ/bit energy efficiency and high bandwidth demonstrated in this work establish the MRM as a scalable, high-performance solution for next-generation optical interconnect architectures in AI computing networks
△ Less
Submitted 1 September, 2025;
originally announced September 2025.
-
What Do Bouncing Balls Tell Us About the Universe? A Journey into Billiard Systems
Authors:
Weiqi Chu,
Matthew Dobson
Abstract:
Have you ever played or watched a game of pool? If so, you have already seen a billiard system in action. In mathematics and physics, a billiard system describes a ball that moves in straight lines and bounces off walls. Despite these simple rules, billiard systems can produce remarkably rich behaviors: some table shapes generate regular, periodic patterns, while others give rise to complete chaos…
▽ More
Have you ever played or watched a game of pool? If so, you have already seen a billiard system in action. In mathematics and physics, a billiard system describes a ball that moves in straight lines and bounces off walls. Despite these simple rules, billiard systems can produce remarkably rich behaviors: some table shapes generate regular, periodic patterns, while others give rise to complete chaos. Scientists also study what happens when we shrink the ball down to the size of an electron to a world where quantum effects take over and the familiar reflection rules no longer apply. In this article, we discuss billiard systems in their many forms and show how such a simple setup can reveal fundamental insights into the behavior of nature at both classical and quantum scales.
△ Less
Submitted 25 August, 2025;
originally announced August 2025.
-
The Complexity of Extreme Climate Events on the New Zealand's Kiwifruit Industry
Authors:
Boyuan Zheng,
Victor W. Chu,
Zhidong Li,
Evan Webster,
Ashley Rootsey
Abstract:
Climate change has intensified the frequency and severity of extreme weather events, presenting unprecedented challenges to the agricultural industry worldwide. In this investigation, we focus on kiwifruit farming in New Zealand. We propose to examine the impacts of climate-induced extreme events, specifically frost, drought, extreme rainfall, and heatwave, on kiwifruit harvest yields. These four…
▽ More
Climate change has intensified the frequency and severity of extreme weather events, presenting unprecedented challenges to the agricultural industry worldwide. In this investigation, we focus on kiwifruit farming in New Zealand. We propose to examine the impacts of climate-induced extreme events, specifically frost, drought, extreme rainfall, and heatwave, on kiwifruit harvest yields. These four events were selected due to their significant impacts on crop productivity and their prevalence as recorded by climate monitoring institutions in the country. We employed Isolation Forest, an unsupervised anomaly detection method, to analyse climate history and recorded extreme events, alongside with kiwifruit yields. Our analysis reveals considerable variability in how different types of extreme event affect kiwifruit yields underscoring notable discrepancies between climatic extremes and individual farm's yield outcomes. Additionally, our study highlights critical limitations of current anomaly detection approaches, particularly in accurately identifying events such as frost. These findings emphasise the need for integrating supplementary features like farm management strategies with climate adaptation practices. Our further investigation will employ ensemble methods that consolidate nearby farms' yield data and regional climate station features to reduce variance, thereby enhancing the accuracy and reliability of extreme event detection and the formulation of response strategies.
△ Less
Submitted 4 August, 2025;
originally announced August 2025.
-
Multi-Hazard Early Warning Systems for Agriculture with Featural-Temporal Explanations
Authors:
Boyuan Zheng,
Victor W. Chu
Abstract:
Climate extremes present escalating risks to agriculture intensifying the need for reliable multi-hazard early warning systems (EWS). The situation is evolving due to climate change and hence such systems should have the intelligent to continue to learn from recent climate behaviours. However, traditional single-hazard forecasting methods fall short in capturing complex interactions among concurre…
▽ More
Climate extremes present escalating risks to agriculture intensifying the need for reliable multi-hazard early warning systems (EWS). The situation is evolving due to climate change and hence such systems should have the intelligent to continue to learn from recent climate behaviours. However, traditional single-hazard forecasting methods fall short in capturing complex interactions among concurrent climatic events. To address this deficiency, in this paper, we combine sequential deep learning models and advanced Explainable Artificial Intelligence (XAI) techniques to introduce a multi-hazard forecasting framework for agriculture. In our experiments, we utilize meteorological data from four prominent agricultural regions in the United States (between 2010 and 2023) to validate the predictive accuracy of our framework on multiple severe event types, which are extreme cold, floods, frost, hail, heatwaves, and heavy rainfall, with tailored models for each area. The framework uniquely integrates attention mechanisms with TimeSHAP (a recurrent XAI explainer for time series) to provide comprehensive temporal explanations revealing not only which climatic features are influential but precisely when their impacts occur. Our results demonstrate strong predictive accuracy, particularly with the BiLSTM architecture, and highlight the system's capacity to inform nuanced, proactive risk management strategies. This research significantly advances the explainability and applicability of multi-hazard EWS, fostering interdisciplinary trust and effective decision-making process for climate risk management in the agricultural industry.
△ Less
Submitted 30 July, 2025;
originally announced July 2025.
-
MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results
Authors:
Yuki Kondo,
Norimichi Ukita,
Riku Kanayama,
Yuki Yoshida,
Takayuki Yamaguchi,
Xiang Yu,
Guang Liang,
Xinyao Liu,
Guan-Zhang Wang,
Wei-Ta Chu,
Bing-Cheng Chuang,
Jia-Hua Lee,
Pin-Tseng Kuo,
I-Hsuan Chu,
Yi-Shein Hsiao,
Cheng-Han Wu,
Po-Yi Wu,
Jui-Chien Tsou,
Hsuan-Chi Liu,
Chun-Yi Lee,
Yuan-Fu Yang,
Kosuke Shigematsu,
Asuka Shin,
Ba Tran
Abstract:
Small Multi-Object Tracking (SMOT) is particularly challenging when targets occupy only a few dozen pixels, rendering detection and appearance-based association unreliable. Building on the success of the MVA2023 SOD4SB challenge, this paper introduces the SMOT4SB challenge, which leverages temporal information to address limitations of single-frame detection. Our three main contributions are: (1)…
▽ More
Small Multi-Object Tracking (SMOT) is particularly challenging when targets occupy only a few dozen pixels, rendering detection and appearance-based association unreliable. Building on the success of the MVA2023 SOD4SB challenge, this paper introduces the SMOT4SB challenge, which leverages temporal information to address limitations of single-frame detection. Our three main contributions are: (1) the SMOT4SB dataset, consisting of 211 UAV video sequences with 108,192 annotated frames under diverse real-world conditions, designed to capture motion entanglement where both camera and targets move freely in 3D; (2) SO-HOTA, a novel metric combining Dot Distance with HOTA to mitigate the sensitivity of IoU-based metrics to small displacements; and (3) a competitive MVA2025 challenge with 78 participants and 308 submissions, where the winning method achieved a 5.1x improvement over the baseline. This work lays a foundation for advancing SMOT in UAV scenarios with applications in bird strike avoidance, agriculture, fisheries, and ecological monitoring.
△ Less
Submitted 17 July, 2025;
originally announced July 2025.
-
Discrete Diffusion Trajectory Alignment via Stepwise Decomposition
Authors:
Jiaqi Han,
Austin Wang,
Minkai Xu,
Wenda Chu,
Meihua Dang,
Haotian Ye,
Huayu Chen,
Yisong Yue,
Stefano Ermon
Abstract:
Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discre…
▽ More
Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.
△ Less
Submitted 31 January, 2026; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning
Authors:
Shih-Wen Liu,
Hsuan-Yu Fan,
Wei-Ta Chu,
Fu-En Yang,
Yu-Chiang Frank Wang
Abstract:
Automating medical report generation from histopathology images is a critical challenge requiring effective visual representations and domain-specific knowledge. Inspired by the common practices of human experts, we propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning (ICL) mechanism. Our method dynami…
▽ More
Automating medical report generation from histopathology images is a critical challenge requiring effective visual representations and domain-specific knowledge. Inspired by the common practices of human experts, we propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning (ICL) mechanism. Our method dynamically retrieves semantically similar whole slide image (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality. Evaluated on the HistGen benchmark, the framework achieves state-of-the-art results, with significant improvements across BLEU, METEOR, and ROUGE-L metrics, and demonstrates robustness across diverse report lengths and disease categories. By maximizing training data utility and bridging vision and language with ICL, our work offers a solution for AI-driven histopathology reporting, setting a strong foundation for future advancements in multimodal clinical applications.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Generative 4D Scene Gaussian Splatting with Object View-Synthesis Priors
Authors:
Wen-Hsuan Chu,
Lei Ke,
Jianmeng Liu,
Mingxiao Huo,
Pavel Tokmakov,
Katerina Fragkiadaki
Abstract:
We tackle the challenge of generating dynamic 4D scenes from monocular, multi-object videos with heavy occlusions, and introduce GenMOJO, a novel approach that integrates rendering-based deformable 3D Gaussian optimization with generative priors for view synthesis. While existing models perform well on novel view synthesis for isolated objects, they struggle to generalize to complex, cluttered sce…
▽ More
We tackle the challenge of generating dynamic 4D scenes from monocular, multi-object videos with heavy occlusions, and introduce GenMOJO, a novel approach that integrates rendering-based deformable 3D Gaussian optimization with generative priors for view synthesis. While existing models perform well on novel view synthesis for isolated objects, they struggle to generalize to complex, cluttered scenes. To address this, GenMOJO decomposes the scene into individual objects, optimizing a differentiable set of deformable Gaussians per object. This object-wise decomposition allows leveraging object-centric diffusion models to infer unobserved regions in novel viewpoints. It performs joint Gaussian splatting to render the full scene, capturing cross-object occlusions, and enabling occlusion-aware supervision. To bridge the gap between object-centric priors and the global frame-centric coordinate system of videos, GenMOJO uses differentiable transformations that align generative and rendering constraints within a unified framework. The resulting model generates 4D object reconstructions over space and time, and produces accurate 2D and 3D point tracks from monocular input. Quantitative evaluations and perceptual human studies confirm that GenMOJO generates more realistic novel views of scenes and produces more accurate point tracks compared to existing approaches.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Authors:
Jingguo Qu,
Xinyang Han,
Jia Ai,
Juan Wu,
Tong Zhao,
Tonghuan Xiao,
Sheng Ning,
Yuqi Yang,
Jing Qin,
Ann Dorothy King,
Winnie Chiu-Wing Chu,
Jing Cai,
Michael Tin-Cheung Ying
Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf founda…
▽ More
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf foundation models. To address this, we propose a novel Hybrid-tuning (HT) strategy for the efficient adaptation of CLIP-based models to ultrasound analysis. Our method introduces a lightweight adapter module integrated into the frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation to calibrate feature representations. Furthermore, we design specialized segmentation and classification heads that employ multi-scale feature aggregation to maximize the utility of pre-trained semantic priors. Extensive evaluations across six multi-center datasets (covering lymph nodes, breast, thyroid, and prostate) reveal that our HT-enhanced models significantly outperform existing state-of-the-art methods, including BiomedCLIP and standard LoRA fine-tuning. The results highlight the superior data efficiency and robustness of our approach, paving the way for practical, foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.
△ Less
Submitted 7 January, 2026; v1 submitted 10 June, 2025;
originally announced June 2025.
-
AllTracker: Efficient Dense Point Tracking at High Resolution
Authors:
Adam W. Harley,
Yang You,
Xinglong Sun,
Yang Zheng,
Nikhil Raghuraman,
Yunqi Gu,
Sheldon Liang,
Wen-Hsuan Chu,
Achal Dave,
Pavel Tokmakov,
Suya You,
Rares Ambrus,
Katerina Fragkiadaki,
Leonidas J. Guibas
Abstract:
We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to…
▽ More
We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on optical flow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at https://alltracker.github.io
△ Less
Submitted 1 August, 2025; v1 submitted 8 June, 2025;
originally announced June 2025.
-
Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Authors:
Baode Wang,
Biao Wu,
Weizhen Li,
Meng Fang,
Zuming Huang,
Jun Huang,
Haozhe Wang,
Yanjie Liang,
Ling Chen,
Wei Chu,
Yuan Qi
Abstract:
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of…
▽ More
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
△ Less
Submitted 20 October, 2025; v1 submitted 1 June, 2025;
originally announced June 2025.
-
arXiv:2506.02588
[pdf]
cond-mat.soft
cond-mat.dis-nn
cond-mat.mtrl-sci
cond-mat.stat-mech
physics.chem-ph
Emergent rigidity percolation of five-fold aggregates enables controllable glass properties
Authors:
Wei Chu,
Zheng Wang,
Christopher Ness,
Konrad Samwer,
Alessio Zaccone,
Lina Hu
Abstract:
Metallic glasses possess outstanding mechanical and physical properties, making them promising candidates for advanced structural and functional applications; however, the lack of understanding and control over their glass transition and solidification processes remains a significant barrier to practical design. The glass transition from liquid to amorphous solid has remained an open problem in ph…
▽ More
Metallic glasses possess outstanding mechanical and physical properties, making them promising candidates for advanced structural and functional applications; however, the lack of understanding and control over their glass transition and solidification processes remains a significant barrier to practical design. The glass transition from liquid to amorphous solid has remained an open problem in physics despite many theories and recent advances in computational efforts. The question of identifying a clear and well-defined diverging length scale accompanying the glass transition has remained unanswered, as has the nature of the transition and, indeed, the presence of a transition at all, as opposed to a mere dynamical crossover. Here we answer these questions using numerical results and theoretical analysis showing that, in atomic (metallic) glass formers, the glass transition coincides with, and is caused by, a continuous rigidity percolation transition from a liquid-like to a solid-like material. The transition occurs as five-fold symmetric atomic clusters progressively aggregate, forming a system-spanning rigid network that marks the onset of mechanical stability. This percolation-driven rigidity growth is accompanied by a sharp increase in the shear modulus G', indicating the emergence of macroscopic solid-like behavior. Beyond this point, which coincides with the Maxwell isostatic point of the percolating structure, dynamical arrest or "freezing-in" prevents further evolution. The long-sought diverging length scale is thus identified as the percolation-driven growth of rigid five-fold clusters, providing a direct link between local structural motifs and macroscopic mechanical properties at the glass transition. These insights offer practical routes to rationally engineer metallic glasses with targeted mechanical stiffness, hardness, and toughness.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Authors:
Shuyao Xu,
Cheng Peng,
Jiangxuan Long,
Weidi Xu,
Wei Chu,
Yuan Qi
Abstract:
Recent advances in model distillation show that data from advanced reasoning models can effectively train smaller student models. However, standard practices discard incorrect reasoning traces -- valuable, yet underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance i…
▽ More
Recent advances in model distillation show that data from advanced reasoning models can effectively train smaller student models. However, standard practices discard incorrect reasoning traces -- valuable, yet underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? We employ a two-stage training recipe: first, Supervised Fine-Tuning (SFT) on positive traces, followed by a refinement stage using both positive and negative traces. We find that a simple REINFORCE-style objective, which we term the Reinforcement Distillation (REDI) objective, outperforms established preference optimization methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate the effectiveness of this approach. Notably, our Qwen-REDI-1.5B model, trained on just 131k traces from the open Open-R1 dataset, achieves an 83.1% score on MATH-500. Its performance matches that of DeepSeek-R1-Distill-Qwen-1.5B, a model trained on 800k proprietary data. This result showcases the remarkable data efficiency of utilizing previously discarded negative traces.
△ Less
Submitted 14 December, 2025; v1 submitted 30 May, 2025;
originally announced May 2025.
-
Quantitative Macromolecular Proton Fraction Imaging using Pulsed Spin-Lock
Authors:
Qianxue Shan,
Ziqiang Yu,
Baiyan Jiang,
Jian Hou,
Qiuyi Shen,
Winnie CW Chu,
Vincent WS Wong,
Weitian Chen
Abstract:
Purpose: Recent studies have shown that spin-lock MRI can simplify quantitative magnetization transfer (MT) by eliminating its dependency on water pool parameters, removing the need for a T1 map in macromolecular proton fraction (MPF) quantification. However, its application is often limited by the requirement for long radiofrequency (RF) pulse durations, which are constrained by RF hardware capab…
▽ More
Purpose: Recent studies have shown that spin-lock MRI can simplify quantitative magnetization transfer (MT) by eliminating its dependency on water pool parameters, removing the need for a T1 map in macromolecular proton fraction (MPF) quantification. However, its application is often limited by the requirement for long radiofrequency (RF) pulse durations, which are constrained by RF hardware capabilities despite remaining within specific absorption rate (SAR) safety limits.
Methods: To address this challenge, we propose a novel method, MPF mapping using pulsed spin-lock (MPF-PSL). MPF-PSL employs a pulsed spin-lock train with intermittent free precession periods, enabling extended total spin-lock durations without exceeding hardware and specific absorption rate limits. A comprehensive analytical framework was developed to model the magnetization dynamics of the two-pool MT system under pulsed spin-lock, demonstrating that MPF-PSL achieves MT-specific quantification while minimizing confounding effects from the water pool. The proposed method is validated with Bloch-McConnell simulations, phantoms, and in vivo studies at 3T.
Results: Both Bloch-McConnell simulations and phantom validation demonstrated that MPF-PSL exhibits robust insensitivity to water pool parameters while enabling high-SNR MPF quantification. In vivo validation studies confirmed the method's clinical utility in detecting collagen deposition in patients with liver fibrosis.
Conclusion: MPF-PSL presents a practical solution for quantitative MT imaging, with strong potential for clinical applications.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Steering Generative Models with Experimental Data for Protein Fitness Optimization
Authors:
Jason Yang,
Wenda Chu,
Daniel Khalil,
Raul Astudillo,
Bruce J. Wittmann,
Frances H. Arnold,
Yisong Yue
Abstract:
Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized…
▽ More
Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.
△ Less
Submitted 20 October, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.