-
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Authors:
Zhe Cao,
Tao Wang,
Jiaming Wang,
Yanghai Wang,
Yuanxing Zhang,
Jialu Chen,
Miao Deng,
Jiahao Wang,
Yubin Guo,
Chenxi Liao,
Yize Zhang,
Zhaoxiang Zhang,
Jiaheng Liu
Abstract:
Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2A…
▽ More
Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.
△ Less
Submitted 24 December, 2025;
originally announced December 2025.
-
L4: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling
Authors:
Yitao Yuan,
Chenqi Zhao,
Bohan Zhao,
Zane Cao,
Yongchao He,
Wenfei Wu
Abstract:
Efficiently harnessing GPU compute is critical to improving user experience and reducing operational costs in large language model (LLM) services. However, current inference engine schedulers overlook the attention backend's sensitivity to request-length heterogeneity within a batch. As state-of-the-art models now support context windows exceeding 128K tokens, this once-tolerable inefficiency has…
▽ More
Efficiently harnessing GPU compute is critical to improving user experience and reducing operational costs in large language model (LLM) services. However, current inference engine schedulers overlook the attention backend's sensitivity to request-length heterogeneity within a batch. As state-of-the-art models now support context windows exceeding 128K tokens, this once-tolerable inefficiency has escalated into a primary system bottleneck, causing severe performance degradation through GPU underutilization and increased latency. We present L4, a runtime system that dynamically reschedules requests across multiple instances serving the same LLM to mitigate per-instance length heterogeneity. L4 partitions these instances into length-specialized groups, each handling requests within a designated length range, naturally forming a pipeline as requests flow through them. L4 devises a dynamic programming algorithm to efficiently find the stage partition with the best QoE, employs runtime range refinement together with decentralized load (re)balance both across and within groups, achieving a balanced and efficient multi-instance service. Our evaluation shows that, under the same configuration, L4 reduces end-to-end latency by up to 67% and tail latency by up to 69%, while improving overall system throughput by up to 2.89 times compared to the state-of-the-art multi-instance scheduling systems.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
MeniMV: A Multi-view Benchmark for Meniscus Injury Severity Grading
Authors:
Shurui Xu,
Siqi Yang,
Jiapin Ren,
Zhong Cao,
Hongwei Yang,
Mengzhen Fan,
Yuyu Sun,
Shuyan Li
Abstract:
Precise grading of meniscal horn tears is critical in knee injury diagnosis but remains underexplored in automated MRI analysis. Existing methods often rely on coarse study-level labels or binary classification, lacking localization and severity information. In this paper, we introduce MeniMV, a multi-view benchmark dataset specifically designed for horn-specific meniscus injury grading. MeniMV co…
▽ More
Precise grading of meniscal horn tears is critical in knee injury diagnosis but remains underexplored in automated MRI analysis. Existing methods often rely on coarse study-level labels or binary classification, lacking localization and severity information. In this paper, we introduce MeniMV, a multi-view benchmark dataset specifically designed for horn-specific meniscus injury grading. MeniMV comprises 3,000 annotated knee MRI exams from 750 patients across three medical centers, providing 6,000 co-registered sagittal and coronal images. Each exam is meticulously annotated with four-tier (grade 0-3) severity labels for both anterior and posterior meniscal horns, verified by chief orthopedic physicians. Notably, MeniMV offers more than double the pathology-labeled data volume of prior datasets while uniquely capturing the dual-view diagnostic context essential in clinical practice. To demonstrate the utility of MeniMV, we benchmark multiple state-of-the-art CNN and Transformer-based models. Our extensive experiments establish strong baselines and highlight challenges in severity grading, providing a valuable foundation for future research in automated musculoskeletal imaging.
△ Less
Submitted 20 December, 2025;
originally announced December 2025.
-
CrystalFormer-CSP: Thinking Fast and Slow for Crystal Structure Prediction
Authors:
Zhendong Cao,
Shigang Ou,
Lei Wang
Abstract:
Crystal structure prediction is a fundamental problem in materials science. We present CrystalFormer-CSP, an efficient framework that unifies data-driven heuristic and physics-driven optimization approaches to predict stable crystal structures for given chemical compositions. The approach combines pretrained generative models for space-group-informed structure generation and a universal machine le…
▽ More
Crystal structure prediction is a fundamental problem in materials science. We present CrystalFormer-CSP, an efficient framework that unifies data-driven heuristic and physics-driven optimization approaches to predict stable crystal structures for given chemical compositions. The approach combines pretrained generative models for space-group-informed structure generation and a universal machine learning force field for energy minimization. Reinforcement fine-tuning can be employed to further boost the accuracy of the framework. We demonstrate the effectiveness of CrystalFormer-CSP on benchmark problems and showcase its usage via web interface and language model integration.
△ Less
Submitted 20 December, 2025;
originally announced December 2025.
-
TimeSeries2Report prompting enables adaptive large language model management of lithium-ion batteries
Authors:
Jiayang Yang,
Chunhui Zhao,
Martin Guay,
Zhixing Cao
Abstract:
Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a prompting framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched r…
▽ More
Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a prompting framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
Beyond openness: Inclusiveness and usability of Chinese scholarly data in OpenAlex
Authors:
Lin Zhang,
Zhe Cao,
Jianhua Liu,
Nees Jan van Eck
Abstract:
OpenAlex, launched in 2022 as a fully open scholarly data source, promises greater inclusiveness compared to traditional proprietary databases. This study evaluates whether OpenAlex delivers on that promise by examining its coverage and metadata quality for Chinese-language journals and their articles. Using the 2023 edition of A Guide to the Core Journals of China (GCJC) and Wanfang Data as a ben…
▽ More
OpenAlex, launched in 2022 as a fully open scholarly data source, promises greater inclusiveness compared to traditional proprietary databases. This study evaluates whether OpenAlex delivers on that promise by examining its coverage and metadata quality for Chinese-language journals and their articles. Using the 2023 edition of A Guide to the Core Journals of China (GCJC) and Wanfang Data as a benchmark, we analyze three aspects: (1) journal-level coverage, (2) article-level coverage, and (3) completeness and accuracy of metadata fields. Results show that OpenAlex indexes only 37% of GCJC journals and 24% of their articles, with substantial disciplinary and temporal variation. Metadata quality is uneven: while basic fields such as title and publication year are complete, bibliographic details, author affiliations, and cited references are frequently missing or inaccurate. DOI coverage is limited, and language information is often incorrect, with most Chinese-language articles labeled as English. These findings highlight significant challenges for achieving full inclusiveness and usability in research evaluation and related activities. We conclude with recommendations for improving data aggregation strategies, DOI registration practices, and metadata standardization to enhance the integration of local scholarly outputs into global open infrastructures.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
CHIP: Adaptive Compliance for Humanoid Control through Hindsight Perturbation
Authors:
Sirui Chen,
Zi-ang Cao,
Zhengyi Luo,
Fernando CastaƱeda,
Chenran Li,
Tingwu Wang,
Ye Yuan,
Linxi "Jim" Fan,
C. Karen Liu,
Yuke Zhu
Abstract:
Recent progress in humanoid robots has unlocked agile locomotion skills, including backflipping, running, and crawling. Yet it remains challenging for a humanoid robot to perform forceful manipulation tasks such as moving objects, wiping, and pushing a cart. We propose adaptive Compliance Humanoid control through hIsight Perturbation (CHIP), a plug-and-play module that enables controllable end-eff…
▽ More
Recent progress in humanoid robots has unlocked agile locomotion skills, including backflipping, running, and crawling. Yet it remains challenging for a humanoid robot to perform forceful manipulation tasks such as moving objects, wiping, and pushing a cart. We propose adaptive Compliance Humanoid control through hIsight Perturbation (CHIP), a plug-and-play module that enables controllable end-effector stiffness while preserving agile tracking of dynamic reference motions. CHIP is easy to implement and requires neither data augmentation nor additional reward tuning. We show that a generalist motion-tracking controller trained with CHIP can perform a diverse set of forceful manipulation tasks that require different end-effector compliance, such as multi-robot collaboration, wiping, box delivery, and door opening.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
A Unified Sparse Attention via Multi-Granularity Compression
Authors:
Siran Liu,
Zane Cao,
Yongchao He
Abstract:
Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based method…
▽ More
Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
Multiband gravitational wave observations of eccentric escaping binary black holes from globular clusters
Authors:
Yuetong Zhao,
Abbas Askar,
Youjun Lu,
Zhoujian Cao,
Mirek Giersz,
Grzegorz Wiktorowicz,
Arkadiusz Hypki,
Lucas Hellstrom,
Sohaib Ali,
Wei-Tou Ni
Abstract:
Stellar-mass binary black holes (sBBHs) formed in globular clusters (GCs) are promising sources for multiband gravitational wave (GW) observations, particularly with low- and middle-frequency detectors. These sBBHs can retain detectable eccentricities when they enter the sensitivity bands of low-frequency GW observatories. We study multiband GW observations of eccentric sBBHs that escape from GC m…
▽ More
Stellar-mass binary black holes (sBBHs) formed in globular clusters (GCs) are promising sources for multiband gravitational wave (GW) observations, particularly with low- and middle-frequency detectors. These sBBHs can retain detectable eccentricities when they enter the sensitivity bands of low-frequency GW observatories. We study multiband GW observations of eccentric sBBHs that escape from GC models simulated with the MOCCA code, focusing on how low- and middle-frequency detectors can constrain their eccentricities and other parameters. Using Monte Carlo simulations, we generate ten realizations of cosmic sBBHs by combining the MOCCA sample with a cosmological model for GC formation and evolution. We then assess their detectability and the precision of parameter estimation. Our results show that LISA, Taiji, the LISA-Taiji network (LT), and AMIGO could detect $0.8\pm0.7$, $11.6\pm2.0$, $15.4\pm2.7$, and $7.9\pm1.3$ escaping sBBHs, respectively, over four years, while LT-AMIGO could detect $20.6\pm3.0$ multiband sBBHs in the same period. LT and AMIGO can measure initial eccentricities with relative errors of approximately $10^{-6}-2\times10^{-4}$ and $10^{-3}-0.7$, respectively. Joint LT-AMIGO observations have a similar ability to estimate eccentricities as LT alone.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
Sequence of Expert: Boosting Imitation Planners for Autonomous Driving through Temporal Alternation
Authors:
Xiang Li,
Gang Liu,
Weitao Zhou,
Hongyi Zhu,
Zhong Cao
Abstract:
Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over time.Over successive planning cycles, these errors compound, potentially resulting i…
▽ More
Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over time.Over successive planning cycles, these errors compound, potentially resulting in severe failures.Current research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this issue.To this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art performance.This module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
$\mathcal{N} = (0, 2)$ higher-spin supergravity in AdS$_3$
Authors:
Zisong Cao
Abstract:
In this paper we generalize Vasiliev's higher-spin gravity theory in 3d into $\mathcal{N} = (0, 2)$ case, by which we mean that the asymptotic symmetry of such a gravity theory have the structure of 2d $\mathcal{N} = (0, 2)$ superconformal algebra. While the construction is limited to linearized level, asymptotic symmetry and possible matter content of such theories is discussed. Also, the 1-loop…
▽ More
In this paper we generalize Vasiliev's higher-spin gravity theory in 3d into $\mathcal{N} = (0, 2)$ case, by which we mean that the asymptotic symmetry of such a gravity theory have the structure of 2d $\mathcal{N} = (0, 2)$ superconformal algebra. While the construction is limited to linearized level, asymptotic symmetry and possible matter content of such theories is discussed. Also, the 1-loop partition function of this theory around thermal Euclidean AdS space-time, with different matter fields, is calculated by heat-kernel method.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
A Survey of OAM-Encoded High-Dimensional Quantum Key Distribution: Foundations, Experiments, and Recent Trends
Authors:
Huan Zhang,
Zhenyu Cao,
Yu Sun,
Hu Jin
Abstract:
High-dimensional quantum key distribution (HD-QKD) enhances information efficiency and noise tolerance by encoding data in large Hilbert spaces. The orbital angular momentum (OAM) of light provides a scalable basis for such encoding and supports high-dimensional photonic communication. Practical OAM-based implementations remain constrained by challenges in state generation, transmission, and detec…
▽ More
High-dimensional quantum key distribution (HD-QKD) enhances information efficiency and noise tolerance by encoding data in large Hilbert spaces. The orbital angular momentum (OAM) of light provides a scalable basis for such encoding and supports high-dimensional photonic communication. Practical OAM-based implementations remain constrained by challenges in state generation, transmission, and detection. This survey offers a consolidated overview of OAM-encoded HD-QKD, outlining fundamental principles, representative experiments, and system-level limitations. Recent progress in hybrid encodings, mode sorting, adaptive optics, and TF, CV, MDI, and DI frameworks is summarized with emphasis on practical feasibility.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
Gravitational Wave Detection Based on Gravitomagnetic Effects
Authors:
Yu-Qi Dong,
Zhoujian Cao,
Yu-Xiao Liu
Abstract:
In this paper, we explore the feasibility of detecting gravitomagnetic effects generated by gravitational waves, by monitoring the relative orientation of the angular momentum vectors of test particles. We analyze the response of the relative angular momentum direction to all six polarization modes of gravitational waves and estimate the magnitude of its variation during gravitational wave events.…
▽ More
In this paper, we explore the feasibility of detecting gravitomagnetic effects generated by gravitational waves, by monitoring the relative orientation of the angular momentum vectors of test particles. We analyze the response of the relative angular momentum direction to all six polarization modes of gravitational waves and estimate the magnitude of its variation during gravitational wave events. Our findings indicate that when test particles possess magnetic moments, applying an external magnetic field of appropriate strength can induce resonant precession of the angular momentum direction under the influence of gravitational waves. This resonance may significantly amplify the gravitational wave signal, potentially enabling its detection with future gyroscope-based detectors. Such detectors would complement existing gravitational wave observatories that rely on gravitoelectric effects.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
Classifier Reconstruction Through Counterfactual-Aware Wasserstein Prototypes
Authors:
Xuan Zhao,
Zhuo Cao,
Arya Bangun,
Hanno Scharr,
Ira Assent
Abstract:
Counterfactual explanations provide actionable insights by identifying minimal input changes required to achieve a desired model prediction. Beyond their interpretability benefits, counterfactuals can also be leveraged for model reconstruction, where a surrogate model is trained to replicate the behavior of a target model. In this work, we demonstrate that model reconstruction can be significantly…
▽ More
Counterfactual explanations provide actionable insights by identifying minimal input changes required to achieve a desired model prediction. Beyond their interpretability benefits, counterfactuals can also be leveraged for model reconstruction, where a surrogate model is trained to replicate the behavior of a target model. In this work, we demonstrate that model reconstruction can be significantly improved by recognizing that counterfactuals, which typically lie close to the decision boundary, can serve as informative though less representative samples for both classes. This is particularly beneficial in settings with limited access to labeled data. We propose a method that integrates original data samples with counterfactuals to approximate class prototypes using the Wasserstein barycenter, thereby preserving the underlying distributional structure of each class. This approach enhances the quality of the surrogate model and mitigates the issue of decision boundary shift, which commonly arises when counterfactuals are naively treated as ordinary training instances. Empirical results across multiple datasets show that our method improves fidelity between the surrogate and target models, validating its effectiveness.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution
Authors:
Zouying Cao,
Jiaji Deng,
Li Yu,
Weikang Zhou,
Zhaoyang Liu,
Bolin Ding,
Hai Zhao
Abstract:
Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ (…
▽ More
Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
REASAN: Learning Reactive Safe Navigation for Legged Robots
Authors:
Qihao Yuan,
Ziyu Cao,
Ming Cao,
Kailai Li
Abstract:
We present a novel modularized end-to-end framework for legged reactive navigation in complex dynamic environments using a single light detection and ranging (LiDAR) sensor. The system comprises four simulation-trained modules: three reinforcement-learning (RL) policies for locomotion, safety shielding, and navigation, and a transformer-based exteroceptive estimator that processes raw point-cloud…
▽ More
We present a novel modularized end-to-end framework for legged reactive navigation in complex dynamic environments using a single light detection and ranging (LiDAR) sensor. The system comprises four simulation-trained modules: three reinforcement-learning (RL) policies for locomotion, safety shielding, and navigation, and a transformer-based exteroceptive estimator that processes raw point-cloud inputs. This modular decomposition of complex legged motor-control tasks enables lightweight neural networks with simple architectures, trained using standard RL practices with targeted reward shaping and curriculum design, without reliance on heuristics or sophisticated policy-switching mechanisms. We conduct comprehensive ablations to validate our design choices and demonstrate improved robustness compared to existing approaches in challenging navigation tasks. The resulting reactive safe navigation (REASAN) system achieves fully onboard and real-time reactive navigation across both single- and multi-robot settings in complex environments. We release our training and deployment code at https://github.com/ASIG-X/REASAN.
△ Less
Submitted 10 December, 2025;
originally announced December 2025.
-
Physics-Guided Diffusion Priors for Multi-Slice Reconstruction in Scientific Imaging
Authors:
Laurentius Valdy,
Richard D. Paul,
Alessio Quercia,
Zhuo Cao,
Xuan Zhao,
Hanno Scharr,
Arya Bangun
Abstract:
Accurate multi-slice reconstruction from limited measurement data is crucial to speed up the acquisition process in medical and scientific imaging. However, it remains challenging due to the ill-posed nature of the problem and the high computational and memory demands. We propose a framework that addresses these challenges by integrating partitioned diffusion priors with physics-based constraints.…
▽ More
Accurate multi-slice reconstruction from limited measurement data is crucial to speed up the acquisition process in medical and scientific imaging. However, it remains challenging due to the ill-posed nature of the problem and the high computational and memory demands. We propose a framework that addresses these challenges by integrating partitioned diffusion priors with physics-based constraints. By doing so, we substantially reduce memory usage per GPU while preserving high reconstruction quality, outperforming both physics-only and full multi-slice reconstruction baselines for different modalities, namely Magnetic Resonance Imaging (MRI) and four-dimensional Scanning Transmission Electron Microscopy (4D-STEM). Additionally, we show that the proposed method improves in-distribution accuracy as well as strong generalization to out-of-distribution datasets.
△ Less
Submitted 7 December, 2025;
originally announced December 2025.
-
Evolutionary System 2 Reasoning: An Empirical Proof
Authors:
Zeyuan Ma,
Wenqi Huang,
Guo-Huan Song,
Hongshu Guo,
Sijie Ma,
Zhiguang Cao,
Yue-Jiao Gong
Abstract:
Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answe…
▽ More
Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.
△ Less
Submitted 5 December, 2025;
originally announced December 2025.
-
Comparative Analysis of Barrier-like Function Methods for Reach-Avoid Verification in Stochastic Discrete-Time Systems
Authors:
Zhipeng Cao,
Peixin Wang,
Luke Ong,
ÄorÄe ŽikeliÄ,
Dominik Wagner,
Bai Xue
Abstract:
In this paper, we compare several representative barrier-like conditions from the literature for infinite-horizon reach-avoid verification of stochastic discrete-time systems. Our comparison examines both their theoretical properties and computational tractability, highlighting each condition's strengths and limitations that affect applicability and conservativeness. Finally, we illustrate their p…
▽ More
In this paper, we compare several representative barrier-like conditions from the literature for infinite-horizon reach-avoid verification of stochastic discrete-time systems. Our comparison examines both their theoretical properties and computational tractability, highlighting each condition's strengths and limitations that affect applicability and conservativeness. Finally, we illustrate their practical performance through computational experiments using semidefinite programming (SDP) and counterexample-guided inductive synthesis (CEGIS).
△ Less
Submitted 4 December, 2025;
originally announced December 2025.
-
Light-X: Generative 4D Video Rendering with Camera and Illumination Control
Authors:
Tianqi Liu,
Zhaoxi Chen,
Zihao Huang,
Shaocong Xu,
Saining Zhang,
Chongjie Ye,
Bohan Li,
Zhiguo Cao,
Wei Li,
Hao Zhao,
Ziwei Liu
Abstract:
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we presen…
▽ More
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
△ Less
Submitted 15 December, 2025; v1 submitted 4 December, 2025;
originally announced December 2025.
-
Efficient Spatially-Variant Convolution via Differentiable Sparse Kernel Complex
Authors:
Zhizhen Wu,
Zhe Cao,
Yuchi Huo
Abstract:
Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomp…
▽ More
Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.
△ Less
Submitted 4 December, 2025;
originally announced December 2025.
-
MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching
Authors:
Ao Xu,
Rujin Zhao,
Xiong Xu,
Boceng Huang,
Yujia Jia,
Hongfeng Long,
Fuxuan Chen,
Zilong Cao,
Fangyuan Chen
Abstract:
Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile…
▽ More
Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile devices, limiting their deployment in real-time applications. To address this, we propose a Multi-frequency Adaptive Fusion Network (MAFNet), which can produce high-quality disparity maps using only efficient 2D convolutions. Specifically, we design an adaptive frequency-domain filtering attention module that decomposes the full cost volume into high-frequency and low-frequency volumes, performing frequency-aware feature aggregation separately. Subsequently, we introduce a Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information, yielding more robust disparity estimation. Extensive experiments demonstrate that the proposed MAFNet significantly outperforms existing real-time methods on public datasets such as Scene Flow and KITTI 2015, showing a favorable balance between accuracy and real-time performance.
△ Less
Submitted 3 December, 2025;
originally announced December 2025.
-
Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design
Authors:
Danny Reidenbach,
Zhonglin Cao,
Zuobai Zhang,
Kieran Didi,
Tomas Geffner,
Guoqing Zhou,
Jian Tang,
Christian Dallago,
Arash Vahdat,
Emine Kucukbenli,
Karsten Kreis
Abstract:
High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often include unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic s…
▽ More
High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often include unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic structures with recoverable synthetic sequences. In that way, we create a new dataset designed specifically for training expressive, fully atomistic protein generators. By retraining La-Proteina, which models discrete residue type and side chain structure in a continuous latent space, on this dataset, we achieve new state-of-the-art results, with improvements of +54% in structural diversity and +27% in co-designability. To validate the broad utility of our approach, we further introduce Proteina Atomistica, a unified flow-based framework that jointly learns the distribution of protein backbone structure, discrete sequences, and atomistic side chains without latent variables. We again find that training on our new sequence-structure data dramatically boosts benchmark performance, improving \method's structural diversity by +73% and co-designability by +5%. Our work highlights the critical importance of aligned sequence-structure data for training high-performance de novo protein design models. Our new dataset https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/proteina-atomistica_data/files?version=release , the Consistency Distilled Synthetic Protein Database, is made available as an open-source resource.
△ Less
Submitted 10 December, 2025; v1 submitted 1 December, 2025;
originally announced December 2025.
-
ViT$^3$: Unlocking Test-Time Training in Vision
Authors:
Dongchen Han,
Yining Li,
Tianyu Li,
Zixuan Cao,
Ziming Wang,
Jun Song,
Yu Cheng,
Bo Zheng,
Gao Huang
Abstract:
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design rema…
▽ More
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.
△ Less
Submitted 1 December, 2025;
originally announced December 2025.
-
SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving
Authors:
Bohan Zhao,
Zane Cao,
Yongchao He
Abstract:
As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns logits into tokens, becomes a new bottleneck. This creates a structural holdout: sampling neither expands with TP nor balances across PP stages, so its share of it…
▽ More
As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns logits into tokens, becomes a new bottleneck. This creates a structural holdout: sampling neither expands with TP nor balances across PP stages, so its share of iteration time grows as GPUs get faster and it caps pipeline frequency at the last stage. We present SIMPLE, a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling into a CPU-side service and shrinks its runtime footprint back to a minor, hidden role. SIMPLE combines: (1) sequence-parallel sampling, which shards work along the batch dimension and removes vocabulary-axis collectives; (2) a CPU-based algorithm with column-wise penalties and truncation-first filtering to realize single-pass, linear-time kernels; and (3) speculative hot-vocab sampling (SHVS), which samples on a small hot set with rejection-correctness and uses a simple sizing model to choose the hot-vocab size that maximizes throughput. In evaluation, SIMPLE improves end-to-end throughput by up to 96% and reduces P95 latency by 20-65%. Crucially, SIMPLE requires no user-side code changes and composes with existing data-plane optimizations, unlocking scaling benefits that compound with future GPU generations.
△ Less
Submitted 29 November, 2025;
originally announced December 2025.
-
Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization
Authors:
Boyang Gu,
Hongjian Zhou,
Bradley Max Segal,
Jinge Wu,
Zeyu Cao,
Hantao Zhong,
Lei Clifton,
Fenglin Liu,
David A. Clifton
Abstract:
Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pretraining and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes field…
▽ More
Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pretraining and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train Clinical-R1-3B, a 3B-parameter model for clinical reasoning. The experiments on three benchmarks demonstrate that our CRPO substantially improves reasoning on truthfulness and completeness over standard GRPO while maintaining comfortable accuracy enhancements. This framework provides a scalable pathway to align LLM reasoning with clinical objectives, enabling safer and more collaborative AI systems for healthcare while also highlighting the potential of multi-objective, verifiable RL methods in post-training scaling of LLMs for medical domains.
△ Less
Submitted 3 December, 2025; v1 submitted 29 November, 2025;
originally announced December 2025.
-
Red Teaming Large Reasoning Models
Authors:
Jiawei Chen,
Yang Yang,
Chao Yu,
Yu Tian,
Zhi Cao,
Linghao Li,
Hang Su,
Zhaoxia Yin
Abstract:
Large Reasoning Models (LRMs) have emerged as a powerful advancement in multi-step reasoning tasks, offering enhanced transparency and logical consistency through explicit chains of thought (CoT). However, these models introduce novel safety and reliability risks, such as CoT-hijacking and prompt-induced inefficiencies, which are not fully captured by existing evaluation methods. To address this g…
▽ More
Large Reasoning Models (LRMs) have emerged as a powerful advancement in multi-step reasoning tasks, offering enhanced transparency and logical consistency through explicit chains of thought (CoT). However, these models introduce novel safety and reliability risks, such as CoT-hijacking and prompt-induced inefficiencies, which are not fully captured by existing evaluation methods. To address this gap, we propose RT-LRM, a unified benchmark designed to assess the trustworthiness of LRMs. RT-LRM evaluates three core dimensions: truthfulness, safety and efficiency. Beyond metric-based evaluation, we further introduce the training paradigm as a key analytical perspective to investigate the systematic impact of different training strategies on model trustworthiness. We achieve this by designing a curated suite of 30 reasoning tasks from an observational standpoint. We conduct extensive experiments on 26 models and identify several valuable insights into the trustworthiness of LRMs. For example, LRMs generally face trustworthiness challenges and tend to be more fragile than Large Language Models (LLMs) when encountering reasoning-induced risks. These findings uncover previously underexplored vulnerabilities and highlight the need for more targeted evaluations. In addition, we release a scalable toolbox for standardized trustworthiness research to support future advancements in this important field. Our code and datasets will be open-sourced.
△ Less
Submitted 29 November, 2025;
originally announced December 2025.
-
JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge
Authors:
Zhihan Cao,
Fumihito Nishino,
Hiroaki Yamada,
Nguyen Ha Thanh,
Yusuke Miyao,
Ken Satoh
Abstract:
We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil…
▽ More
We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.
△ Less
Submitted 27 November, 2025;
originally announced November 2025.
-
RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
Authors:
Xiyan Liu,
Han Wang,
Yuhu Wang,
Junjie Cai,
Zhe Cao,
Jianzhong Yang,
Zhen Lu
Abstract:
Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic…
▽ More
Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.
△ Less
Submitted 27 November, 2025;
originally announced November 2025.
-
Planet Migration in Protoplanetary Disks with Rims
Authors:
Zhuoya Cao,
Ya-Ping Li,
Douglas N. C. Lin,
Shude Mao
Abstract:
Complex structures, including sharp edges, rings and gaps, have been commonly observed in protoplanetary disks with or without planetary candidates. Here we consider the possibility that they are the intrinsic consequences of angular momentum transfer mechanisms, and investigate how they may influence the dynamical evolution of embedded planets. With the aid of numerical hydrodynamic simulations,…
▽ More
Complex structures, including sharp edges, rings and gaps, have been commonly observed in protoplanetary disks with or without planetary candidates. Here we consider the possibility that they are the intrinsic consequences of angular momentum transfer mechanisms, and investigate how they may influence the dynamical evolution of embedded planets. With the aid of numerical hydrodynamic simulations, we show that gas giants have a tendency to migrate away from sharp edges, whereas super-Earths embedded in the annuli tend to be retained. This implies that, observationally, Jupiters are preferentially detected in dark rings (gaps), whereas super-Earths tend to be found in bright rings (density bumps). Moreover, planets' tidal torque provide, not necessarily predominant, feedback on the surface density profile. This tendency implies that Jupiter's gap-opening process deepens and widens the density gap associated with the dark ring, while super-Earths can be halted by steep surface density gradient near the disk or ring boundaries. 13Hence, we expect there would be a desert for super-Earths in the surface density gap.
△ Less
Submitted 26 November, 2025;
originally announced November 2025.
-
Flexible mm-Wave Frequency and High-Speed Arbitrary IQ Signal Synthesis by a Photonic System on Chip
Authors:
Bowen Zhu,
Tao Zhu,
Yazhi Pi,
Chunyang Ma,
Xiaochuan Xu,
Zizheng Cao,
Lei Wang,
Shaohua Yu
Abstract:
Photonics-assisted millimeter-wave bands and terahertz signal generation offer significant advantages over traditional electronic methods by leveraging the inherent benefits of optical components, including broad bandwidth, low power consumption, and minimal insertion loss. This work utilizes a silicon photonic chip in conjunction with a reconfigurable optical frequency comb to demonstrate the syn…
▽ More
Photonics-assisted millimeter-wave bands and terahertz signal generation offer significant advantages over traditional electronic methods by leveraging the inherent benefits of optical components, including broad bandwidth, low power consumption, and minimal insertion loss. This work utilizes a silicon photonic chip in conjunction with a reconfigurable optical frequency comb to demonstrate the synthesis of signals in the millimeter-wave range. The implemented photonic system performs on-chip filtering and modulation, producing high-bandwidth single frequency, multi-frequency, and vector signals suitable for arbitrary IQ signal construction. These results highlight the flexible and reconfigurable capabilities of the proposed approach, providing new perspectives for applications in radio-over-fiber systems and beyond.
△ Less
Submitted 26 November, 2025;
originally announced November 2025.
-
AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale
Authors:
Ziyang Wang,
Yuanlei Zheng,
Zhenbiao Cao,
Xiaojin Zhang,
Zhongyu Wei,
Pei Fu,
Zhenbo Luo,
Wei Chen,
Xiang Bai
Abstract:
For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present \textb…
▽ More
For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present \textbf{AutoLink}, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink's superior performance, achieving state-of-the-art strict schema linking recall of \textbf{97.4\%} on Bird-Dev and \textbf{91.2\%} on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf{68.7\%} EX on Bird-Dev (better than CHESS) and \textbf{34.9\%} EX on Spider-2.0-Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits \textbf{exceptional scalability}, \textbf{maintaining high recall}, \textbf{efficient token consumption}, and \textbf{robust execution accuracy} on large schemas (e.g., over 3,000 columns) where existing methods severely degrade-making it a highly scalable, high-recall schema-linking solution for industrial text-to-SQL systems.
△ Less
Submitted 21 November, 2025;
originally announced November 2025.
-
Rigidity of five-dimensional quasi-Einstein manifolds with constant scalar curvature
Authors:
Zhongxian Cao
Abstract:
Let $(M^5,g)$ be a five-dimensional non-trivial simply-connected compact quasi-Einstein manifold with boundary. If $M$ has constant scalar $R$, Johnatan Costa, Ernani Ribeiro Jr, and Detang Zhou show that $R$ = $((m-5)k+20)/(m-k+4)Ī»$ for some $k\in\{0,2,3,4\}$. Both cases of $k=0$ and $k=4$ are already classified. In this paper we will prove that the case $k=3$ is rigid.
Let $(M^5,g)$ be a five-dimensional non-trivial simply-connected compact quasi-Einstein manifold with boundary. If $M$ has constant scalar $R$, Johnatan Costa, Ernani Ribeiro Jr, and Detang Zhou show that $R$ = $((m-5)k+20)/(m-k+4)Ī»$ for some $k\in\{0,2,3,4\}$. Both cases of $k=0$ and $k=4$ are already classified. In this paper we will prove that the case $k=3$ is rigid.
△ Less
Submitted 20 November, 2025;
originally announced November 2025.
-
BokehFlow: Depth-Free Controllable Bokeh Rendering via Flow Matching
Authors:
Yachuan Huang,
Xianrui Luo,
Qiwen Wang,
Liao Shen,
Jiaqi Li,
Huiqiang Sun,
Zihao Huang,
Wei Jiang,
Zhiguo Cao
Abstract:
Bokeh rendering simulates the shallow depth-of-field effect in photography, enhancing visual aesthetics and guiding viewer attention to regions of interest. Although recent approaches perform well, rendering controllable bokeh without additional depth inputs remains a significant challenge. Existing classical and neural controllable methods rely on accurate depth maps, while generative approaches…
▽ More
Bokeh rendering simulates the shallow depth-of-field effect in photography, enhancing visual aesthetics and guiding viewer attention to regions of interest. Although recent approaches perform well, rendering controllable bokeh without additional depth inputs remains a significant challenge. Existing classical and neural controllable methods rely on accurate depth maps, while generative approaches often struggle with limited controllability and efficiency. In this paper, we propose BokehFlow, a depth-free framework for controllable bokeh rendering based on flow matching. BokehFlow directly synthesizes photorealistic bokeh effects from all-in-focus images, eliminating the need for depth inputs. It employs a cross-attention mechanism to enable semantic control over both focus regions and blur intensity via text prompts. To support training and evaluation, we collect and synthesize four datasets. Extensive experiments demonstrate that BokehFlow achieves visually compelling bokeh effects and offers precise control, outperforming existing depth-dependent and generative methods in both rendering quality and efficiency.
△ Less
Submitted 18 November, 2025;
originally announced November 2025.
-
nuCarla: A nuScenes-Style Bird's-Eye View Perception Dataset for CARLA Simulation
Authors:
Zhijie Qiao,
Zhong Cao,
Henry X. Liu
Abstract:
End-to-end (E2E) autonomous driving heavily relies on closed-loop simulation, where perception, planning, and control are jointly trained and evaluated in interactive environments. Yet, most existing datasets are collected from the real world under non-interactive conditions, primarily supporting open-loop learning while offering limited value for closed-loop testing. Due to the lack of standardiz…
▽ More
End-to-end (E2E) autonomous driving heavily relies on closed-loop simulation, where perception, planning, and control are jointly trained and evaluated in interactive environments. Yet, most existing datasets are collected from the real world under non-interactive conditions, primarily supporting open-loop learning while offering limited value for closed-loop testing. Due to the lack of standardized, large-scale, and thoroughly verified datasets to facilitate learning of meaningful intermediate representations, such as bird's-eye-view (BEV) features, closed-loop E2E models remain far behind even simple rule-based baselines. To address this challenge, we introduce nuCarla, a large-scale, nuScenes-style BEV perception dataset built within the CARLA simulator. nuCarla features (1) full compatibility with the nuScenes format, enabling seamless transfer of real-world perception models; (2) a dataset scale comparable to nuScenes, but with more balanced class distributions; (3) direct usability for closed-loop simulation deployment; and (4) high-performance BEV backbones that achieve state-of-the-art detection results. By providing both data and models as open benchmarks, nuCarla substantially accelerates closed-loop E2E development, paving the way toward reliable and safety-aware research in autonomous driving.
△ Less
Submitted 12 November, 2025;
originally announced November 2025.
-
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
Authors:
Ziang Cao,
Fangzhou Hong,
Zhaoxi Chen,
Liang Pan,
Ziwei Liu
Abstract:
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framew…
▽ More
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
△ Less
Submitted 17 November, 2025;
originally announced November 2025.
-
Robust Client-Server Watermarking for Split Federated Learning
Authors:
Jiaxiong Tang,
Zhengchunmin Dai,
Liantao Wu,
Peng Sun,
Honglong Chen,
Zhenfu Cao
Abstract:
Split Federated Learning (SFL) is renowned for its privacy-preserving nature and low computational overhead among decentralized machine learning paradigms. In this framework, clients employ lightweight models to process private data locally and transmit intermediate outputs to a powerful server for further computation. However, SFL is a double-edged sword: while it enables edge computing and enhan…
▽ More
Split Federated Learning (SFL) is renowned for its privacy-preserving nature and low computational overhead among decentralized machine learning paradigms. In this framework, clients employ lightweight models to process private data locally and transmit intermediate outputs to a powerful server for further computation. However, SFL is a double-edged sword: while it enables edge computing and enhances privacy, it also introduces intellectual property ambiguity as both clients and the server jointly contribute to training. Existing watermarking techniques fail to protect both sides since no single participant possesses the complete model. To address this, we propose RISE, a Robust model Intellectual property protection scheme using client-Server watermark Embedding for SFL. Specifically, RISE adopts an asymmetric client-server watermarking design: the server embeds feature-based watermarks through a loss regularization term, while clients embed backdoor-based watermarks by injecting predefined trigger samples into private datasets. This co-embedding strategy enables both clients and the server to verify model ownership. Experimental results on standard datasets and multiple network architectures show that RISE achieves over $95\%$ watermark detection rate ($p-value \lt 0.03$) across most settings. It exhibits no mutual interference between client- and server-side watermarks and remains robust against common removal attacks.
△ Less
Submitted 17 November, 2025;
originally announced November 2025.
-
Semi-Supervised High Dynamic Range Image Reconstructing via Bi-Level Uncertain Area Masking
Authors:
Wei Jiang,
Jiahao Cui,
Yizheng Wu,
Zhan Peng,
Zhiyu Pan,
Zhiguo Cao
Abstract:
Reconstructing high dynamic range (HDR) images from low dynamic range (LDR) bursts plays an essential role in the computational photography. Impressive progress has been achieved by learning-based algorithms which require LDR-HDR image pairs. However, these pairs are hard to obtain, which motivates researchers to delve into the problem of annotation-efficient HDR image reconstructing: how to achie…
▽ More
Reconstructing high dynamic range (HDR) images from low dynamic range (LDR) bursts plays an essential role in the computational photography. Impressive progress has been achieved by learning-based algorithms which require LDR-HDR image pairs. However, these pairs are hard to obtain, which motivates researchers to delve into the problem of annotation-efficient HDR image reconstructing: how to achieve comparable performance with limited HDR ground truths (GTs). This work attempts to address this problem from the view of semi-supervised learning where a teacher model generates pseudo HDR GTs for the LDR samples without GTs and a student model learns from pseudo GTs. Nevertheless, the confirmation bias, i.e., the student may learn from the artifacts in pseudo HDR GTs, presents an impediment. To remove this impediment, an uncertainty-based masking process is proposed to discard unreliable parts of pseudo GTs at both pixel and patch levels, then the trusted areas can be learned from by the student. With this novel masking process, our semi-supervised HDR reconstructing method not only outperforms previous annotation-efficient algorithms, but also achieves comparable performance with up-to-date fully-supervised methods by using only 6.7% HDR GTs.
△ Less
Submitted 16 November, 2025;
originally announced November 2025.
-
Generative Photographic Control for Scene-Consistent Video Cinematic Editing
Authors:
Huiqiang Sun,
Liao Shen,
Zhan Peng,
Kun Wang,
Size Wu,
Yuhang Zang,
Tianqi Liu,
Zihao Huang,
Xingyu Zeng,
Zhiguo Cao,
Wei Li,
Chen Change Loy
Abstract:
Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl,…
▽ More
Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.
△ Less
Submitted 16 November, 2025;
originally announced November 2025.
-
Multi-Agent Reinforcement Learning for Heterogeneous Satellite Cluster Resources Optimization
Authors:
Mohamad A. Hady,
Siyi Hu,
Mahardhika Pratama,
Zehong Cao,
Ryszard Kowalczyk
Abstract:
This work investigates resource optimization in heterogeneous satellite clusters performing autonomous Earth Observation (EO) missions using Reinforcement Learning (RL). In the proposed setting, two optical satellites and one Synthetic Aperture Radar (SAR) satellite operate cooperatively in low Earth orbit to capture ground targets and manage their limited onboard resources efficiently. Traditiona…
▽ More
This work investigates resource optimization in heterogeneous satellite clusters performing autonomous Earth Observation (EO) missions using Reinforcement Learning (RL). In the proposed setting, two optical satellites and one Synthetic Aperture Radar (SAR) satellite operate cooperatively in low Earth orbit to capture ground targets and manage their limited onboard resources efficiently. Traditional optimization methods struggle to handle the real-time, uncertain, and decentralized nature of EO operations, motivating the use of RL and Multi-Agent Reinforcement Learning (MARL) for adaptive decision-making. This study systematically formulates the optimization problem from single-satellite to multi-satellite scenarios, addressing key challenges including energy and memory constraints, partial observability, and agent heterogeneity arising from diverse payload capabilities. Using a near-realistic simulation environment built on the Basilisk and BSK-RL frameworks, we evaluate the performance and stability of state-of-the-art MARL algorithms such as MAPPO, HAPPO, and HATRPO. Results show that MARL enables effective coordination across heterogeneous satellites, balancing imaging performance and resource utilization while mitigating non-stationarity and inter-agent reward coupling. The findings provide practical insights into scalable, autonomous satellite operations and contribute a foundation for future research on intelligent EO mission planning under heterogeneous and dynamic conditions.
△ Less
Submitted 16 November, 2025;
originally announced November 2025.
-
Mock Observations for the CSST Mission: Multi-Channel Imager--Instrument Simulation
Authors:
Zhao-Jun Yan,
Huan-Yuan Shan,
Zhen-Ya Zheng,
Xi-Yan Peng,
Zhao-Xiang Qi,
Chun Xu,
Lin Lin,
Xin-Rong Wen,
Chun-Yan Jiang,
Li-Xin Zheng,
Jing Zhong,
Fang-Ting Yuan,
Zhen-Lei Chen,
Wei Chen,
Mao-Chun Wu,
Zhen-Sen Fu,
Ke-Xin Li,
Lin Nie,
Chao Liu,
Nan Li,
Qiao Wang,
Zi-Huang Cao,
Shuai Feng,
Guo-Liang Li,
Lei Wang
, et al. (18 additional authors not shown)
Abstract:
The Chinese Space Station Survey Telescope (CSST), a two-meter aperture astronomical space telescope under China's manned space program, is equipped with multiple back-end scientific instruments. As an astronomical precision measurement module of the CSST, the Multi-Channel Imager (MCI) can cover a wide wavelength range from ultraviolet to near-infrared with three-color simultaneous high-precision…
▽ More
The Chinese Space Station Survey Telescope (CSST), a two-meter aperture astronomical space telescope under China's manned space program, is equipped with multiple back-end scientific instruments. As an astronomical precision measurement module of the CSST, the Multi-Channel Imager (MCI) can cover a wide wavelength range from ultraviolet to near-infrared with three-color simultaneous high-precision photometry and imaging, which meets the scientific requirements for various fields. The diverse scientific objectives of MCI require not only a robust airborne platform, advanced optical systems, and observing facilities but also comprehensive software support for scientific operations and research. To this end, it is essential to develop realistic observational simulation software to thoroughly evaluate the MCI data stream and provide calibration tools for future scientific investigations. The MCI instrument simulation software will serve as a foundation for the development of the MCI data processing pipeline and will facilitate improvements in both hardware and software, as well as in the observational operation strategy, in alignment with the mission's scientific goals. In conclusion, we present a comprehensive overview of the MCI instrument simulation and some corresponding performances of the MCI data processing pipeline.
△ Less
Submitted 16 November, 2025;
originally announced November 2025.
-
On the coincidence between the close passage of HD7977 and the Pliocene-Pleistocene transition
Authors:
Zhuoya Cao,
Abraham Loeb,
Morgan MacLeod
Abstract:
The Oort Cloud's dynamical evolution is significantly influenced by both the galactic tide and stellar flybys. This study investigates the particular case of HD7977's close encounter 2.47 Myr ago, which likely repopulated the Inner Oort Cloud and potentially triggered a significant comet shower on Earth. Our results demonstrate that the shower's intensity strongly depends on HD7977's impact parame…
▽ More
The Oort Cloud's dynamical evolution is significantly influenced by both the galactic tide and stellar flybys. This study investigates the particular case of HD7977's close encounter 2.47 Myr ago, which likely repopulated the Inner Oort Cloud and potentially triggered a significant comet shower on Earth. Our results demonstrate that the shower's intensity strongly depends on HD7977's impact parameter ($b$), with possible flyby distances ranging from 2,300 AU to $\sim$ 13,000 AU. For the closest approach ($b \sim 2,300$ AU), the terrestrial impact probability of 1 km comets increases by an order of magnitude compared to the steady state, slightly exceeding the asteroid impact probability at this size scale. We propose an analytical method to compute the probability of comet showers impacting Earth, which saves considerable computation time compared to N-body simulations. We identify a threshold diameter $D_0 = 2.25$ km for which yields $P = 1$ in our model, with $D_0$ following a logarithmic dependence on $b$. These findings suggest that HD7977's flyby may have caused an enhanced comet flux during the Pliocene-Pleistocene transition, which could plausibly be related to the environmental changes at this era.
△ Less
Submitted 14 November, 2025;
originally announced November 2025.
-
Raman fingerprint of high-temperature superconductivity in compressed hydrides
Authors:
Philip Dalladay-Simpson,
Guglielmo Marchese,
Zi-Yu Cao,
Paolo Barone,
Lara Benfatto,
Gaston Garbarino,
Francesco Mauri,
Federico Aiace Gorelli
Abstract:
The discovery of high-temperature superconductivity in hydrogen-rich compounds under extreme pressures has prompted great excitement, intense research, but also debate over the past decade. Electrical transport has been the primary diagnostic tool for identifying superconductivity in these systems, whereas complementary probes, including magnetic, spectroscopic, tunnelling and ultrafast methods, r…
▽ More
The discovery of high-temperature superconductivity in hydrogen-rich compounds under extreme pressures has prompted great excitement, intense research, but also debate over the past decade. Electrical transport has been the primary diagnostic tool for identifying superconductivity in these systems, whereas complementary probes, including magnetic, spectroscopic, tunnelling and ultrafast methods, remain mostly qualitative due to experimental constraints and sample heterogeneity. Recent concerns over their reliability have fuelled controversy, leading to scepticism and pointing out the need for alternative, quantitative approaches. In this study, we acquired unprecedented high-quality Raman spectra of hexagonal LaH10 at approximately 145 GPa and low temperatures, in conjunction with electrical transport measurements. Upon cooling, we observe a drop of resistivity and simultaneous remarkable variations of phonon frequencies and linewidths. These effects are interpreted and perfectly reproduced by the Migdal-Eliashberg theory, providing a definitive proof of phonon-mediated superconductivity and enabling a quantitative determination of the superconducting energy gap. Our results establish Raman spectroscopy as a robust, contact-free probe with micrometric resolution for studying high temperature superconductivity, opening a powerful route to its discovery and characterization.
△ Less
Submitted 13 November, 2025;
originally announced November 2025.
-
AgentEvolver: Towards Efficient Self-Evolving Agent System
Authors:
Yunpeng Zhai,
Shuchang Tao,
Cheng Chen,
Anni Zou,
Ziqian Chen,
Qingxu Fu,
Shinji Mai,
Li Yu,
Jiaji Deng,
Zouying Cao,
Zhaoyang Liu,
Bolin Ding,
Jingren Zhou
Abstract:
Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extens…
▽ More
Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
△ Less
Submitted 13 November, 2025;
originally announced November 2025.
-
Bridging Synthetic and Real Routing Problems via LLM-Guided Instance Generation and Progressive Adaptation
Authors:
Jianghan Zhu,
Yaoxin Wu,
Zhuoyi Lin,
Zhengyuan Zhang,
Haiyan Yin,
Zhiguang Cao,
Senthilnath Jayavelu,
Xiaoli Li
Abstract:
Recent advances in Neural Combinatorial Optimization (NCO) methods have significantly improved the capability of neural solvers to handle synthetic routing instances. Nonetheless, existing neural solvers typically struggle to generalize effectively from synthetic, uniformly-distributed training data to real-world VRP scenarios, including widely recognized benchmark instances from TSPLib and CVRPLi…
▽ More
Recent advances in Neural Combinatorial Optimization (NCO) methods have significantly improved the capability of neural solvers to handle synthetic routing instances. Nonetheless, existing neural solvers typically struggle to generalize effectively from synthetic, uniformly-distributed training data to real-world VRP scenarios, including widely recognized benchmark instances from TSPLib and CVRPLib. To bridge this generalization gap, we present Evolutionary Realistic Instance Synthesis (EvoReal), which leverages an evolutionary module guided by large language models (LLMs) to generate synthetic instances characterized by diverse and realistic structural patterns. Specifically, the evolutionary module produces synthetic instances whose structural attributes statistically mimics those observed in authentic real-world instances. Subsequently, pre-trained NCO models are progressively refined, firstly aligning them with these structurally enriched synthetic distributions and then further adapting them through direct fine-tuning on actual benchmark instances. Extensive experimental evaluations demonstrate that EvoReal markedly improves the generalization capabilities of state-of-the-art neural solvers, yielding a notable reduced performance gap compared to the optimal solutions on the TSPLib (1.05%) and CVRPLib (2.71%) benchmarks across a broad spectrum of problem scales.
△ Less
Submitted 13 November, 2025;
originally announced November 2025.
-
HI-TransPA: Hearing Impairments Translation Personal Assistant
Authors:
Zhiming Ma,
Shiyu Gan,
Junhao Zhao,
Xianming Li,
Qingyun Pan,
Peidong Wang,
Mingjun Pan,
Yuhao Mo,
Jiajie Cheng,
Chengxin Chen,
Zhonglun Cao,
Chonghan Liu,
Shi Cheng
Abstract:
Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within…
▽ More
Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
△ Less
Submitted 14 November, 2025; v1 submitted 12 November, 2025;
originally announced November 2025.
-
Enabling Agents to Communicate Entirely in Latent Space
Authors:
Zhuoyun Du,
Runze Wang,
Huiyu Bai,
Zouying Cao,
Xiaoyong Zhu,
Bo Zheng,
Wei Chen,
Haochao Ying
Abstract:
While natural language is the de facto communication medium for LLM-based agents, it presents a fundamental constraint. The process of downsampling rich, internal latent states into discrete tokens inherently limits the depth and nuance of information that can be transmitted, thereby hindering collaborative problem-solving. Inspired by human mind-reading, we propose Interlat (Inter-agent Latent Sp…
▽ More
While natural language is the de facto communication medium for LLM-based agents, it presents a fundamental constraint. The process of downsampling rich, internal latent states into discrete tokens inherently limits the depth and nuance of information that can be transmitted, thereby hindering collaborative problem-solving. Inspired by human mind-reading, we propose Interlat (Inter-agent Latent Space Communication), a paradigm that leverages the last hidden states of an LLM as a representation of its mind for direct transmission (termed latent communication). An additional compression process further compresses latent communication via entirely latent space reasoning. Experiments demonstrate that Interlat outperforms both fine-tuned chain-of-thought (CoT) prompting and single-agent baselines, promoting more exploratory behavior and enabling genuine utilization of latent information. Further compression not only substantially accelerates inference but also maintains competitive performance through an efficient information-preserving mechanism. We position this work as a feasibility study of entirely latent space inter-agent communication, and our results highlight its potential, offering valuable insights for future research.
△ Less
Submitted 12 November, 2025;
originally announced November 2025.
-
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
Authors:
Zhengyi Luo,
Ye Yuan,
Tingwu Wang,
Chenran Li,
Sirui Chen,
Fernando CastaƱeda,
Zi-Ang Cao,
Jiefeng Li,
David Minor,
Qingwei Ben,
Xingye Da,
Runyu Ding,
Cyrus Hogg,
Lina Song,
Edy Lim,
Eugene Jeong,
Tairan He,
Haoru Xue,
Wenli Xiao,
Zi Wang,
Simon Yuen,
Jan Kautz,
Yan Chang,
Umar Iqbal,
Linxi "Jim" Fan
, et al. (1 additional authors not shown)
Abstract:
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid cont…
▽ More
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
△ Less
Submitted 4 December, 2025; v1 submitted 10 November, 2025;
originally announced November 2025.
-
Multipartite steering verification with imprecise measurements
Authors:
Zeyang Lu,
Chan Li,
Gang Wang,
Zhu Cao
Abstract:
Quantum steering is a fundamental quantum correlation that plays a pivotal role in quantum technologies, but its verification crucially relies on precise measurements -- an assumption often undermined by practical imperfections. Here, we investigate multipartite steering verification under imprecise measurements and develop a quantitative method that effectively eliminates false positives induced…
▽ More
Quantum steering is a fundamental quantum correlation that plays a pivotal role in quantum technologies, but its verification crucially relies on precise measurements -- an assumption often undermined by practical imperfections. Here, we investigate multipartite steering verification under imprecise measurements and develop a quantitative method that effectively eliminates false positives induced by measurement imprecision. A comparison with a device-independent approach demonstrates that our method accurately delineates the scope of valid verification. In a special case, our method also enables the verification of multipartite entanglement under nonideal conditions. These results substantially enhance the robustness of multipartite steering and entanglement verification against measurement imprecision, thereby promoting their applicability in realistic quantum technologies.
△ Less
Submitted 10 November, 2025;
originally announced November 2025.
-
Mock Observations for the CSST Mission: Main Surveys--An Overview of Framework and Simulation Suite
Authors:
Cheng-Liang Wei,
Guo-Liang Li,
Yue-Dong Fang,
Xin Zhang,
Yu Luo,
Hao Tian,
De-Zi Liu,
Xian-Ming Meng,
Zhang Ban,
Xiao-Bo Li,
Zun Luo,
Jing-Tian Xian,
Wei Wang,
Xi-Yan Peng,
Nan Li,
Ran Li,
Li Shao,
Tian-Meng Zhang,
Jing Tang,
Yang Chen,
Zhao-Xiang Qi,
Zi-Huang Cao,
Huan- Yuan Shan,
Lin Nie,
Lei Wang
, et al. (4 additional authors not shown)
Abstract:
The Chinese Space Station Survey Telescope (CSST) is a flagship space-based observatory. Its main survey camera is designed to conduct high spatial resolution near-ultraviolet to near-infrared imaging and low-resolution spectroscopic surveys. To maximize the scientific output of CSST, we have developed a comprehensive, high-fidelity simulation pipeline for reproducing both imaging and spectroscopi…
▽ More
The Chinese Space Station Survey Telescope (CSST) is a flagship space-based observatory. Its main survey camera is designed to conduct high spatial resolution near-ultraviolet to near-infrared imaging and low-resolution spectroscopic surveys. To maximize the scientific output of CSST, we have developed a comprehensive, high-fidelity simulation pipeline for reproducing both imaging and spectroscopic observations. This paper presents an overview of the simulation framework, detailing its implementation and components. Built upon the GalSim package and incorporating the latest CSST instrumental specifications, our pipeline generates pixel-level mock observations that closely replicate the expected instrumental and observational conditions. The simulation suite integrates realistic astrophysical object catalogs, instrumental effects, point spread function (PSF) modeling, and observational noises to produce accurate synthetic data. We describe the key processing stages of the simulation, from constructing the input object catalogs to modeling the telescope optics and detector responses. Furthermore, we introduce the most recent release of simulated datasets, which provide a crucial testbed for data processing pipeline developments, calibration strategies, and scientific analyses, ensuring that CSST will meet its stringent requirements. Our pipeline serves as a vital tool for optimizing CSST main survey strategies and ensuring robust cosmological measurements.
△ Less
Submitted 10 November, 2025;
originally announced November 2025.