-
Counting tight Hamilton cycles in Dirac hypergraphs
Authors:
Felix Joos,
Xinyue Xie
Abstract:
Suppose $G$ is a $k$-uniform hypergraph on $n$ vertices such that every $(k-1)$-subset $S$ of $V(G)$ belongs to at least $δn$ edges, where $δ> 1/2$. Let $Ψ(G)$ denote the number of tight Hamilton cycles in $G$, that is, cyclic orderings of $V(G)$ in which every $k$ consecutive vertices form an edge. We prove that $\logΨ(G)\ge kh(G)-n\log{n\choose k-1}+n\log n-n\log e-o(n)$, where $h(G)$ is the hyp…
▽ More
Suppose $G$ is a $k$-uniform hypergraph on $n$ vertices such that every $(k-1)$-subset $S$ of $V(G)$ belongs to at least $δn$ edges, where $δ> 1/2$. Let $Ψ(G)$ denote the number of tight Hamilton cycles in $G$, that is, cyclic orderings of $V(G)$ in which every $k$ consecutive vertices form an edge. We prove that $\logΨ(G)\ge kh(G)-n\log{n\choose k-1}+n\log n-n\log e-o(n)$, where $h(G)$ is the hypergraph entropy of $G$, defined via perfect fractional matchings. This bound is tight, for example, for all (nearly) regular hypergraphs, in particular for the binomial random hypergraph. It also implies a conjecture by Ferber, Hardiman and Mond, stating that $Ψ(G)\ge (δ-o(1))^n n!$.
△ Less
Submitted 16 April, 2026;
originally announced April 2026.
-
Theta-regularized Kriging: Modelling and Algorithms
Authors:
Xuelin Xie,
Xiliang Lu
Abstract:
To obtain more accurate model parameters and improve prediction accuracy, we proposed a regularized Kriging model that penalizes the hyperparameter theta in the Gaussian stochastic process, termed the Theta-regularized Kriging. We derived the optimization problem for this model from a maximum likelihood perspective. Additionally, we presented specific implementation details for the iterative proce…
▽ More
To obtain more accurate model parameters and improve prediction accuracy, we proposed a regularized Kriging model that penalizes the hyperparameter theta in the Gaussian stochastic process, termed the Theta-regularized Kriging. We derived the optimization problem for this model from a maximum likelihood perspective. Additionally, we presented specific implementation details for the iterative process, including the regularized optimization algorithm and the geometric search cross-validation tuning algorithm. Three distinct penalty methods, Lasso, Ridge, and Elastic-net regularization, were meticulously considered. Meanwhile, the proposed Theta-regularized Kriging models were tested on nine common numerical functions and two practical engineering examples. The results demonstrate that, compared with other penalized Kriging models, the proposed model performs better in terms of accuracy and stability.
△ Less
Submitted 16 April, 2026;
originally announced April 2026.
-
SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval
Authors:
Xin Xie,
Dongyun Xue,
Wuguannan Yao,
Mingxiao Feng,
Wengang Zhou,
Xiang Qi,
Houqiang Li,
Peng Zhang
Abstract:
LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we levera…
▽ More
LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.
△ Less
Submitted 16 April, 2026;
originally announced April 2026.
-
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
Authors:
Zerun Ma,
Guoqiang Wang,
Xinchen Xie,
Yicheng Chen,
He Du,
Bowen Li,
Yanan Sun,
Wenran Liu,
Kai Chen,
Yining Li
Abstract:
While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-th…
▽ More
While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
Authors:
Zhen Liu,
Xinyu Ning,
Zhe Hu,
Xinxin Xie,
Weize Li,
Zhipeng Tang,
Chongyu Wang,
Zejun Yang,
Hanlin Wang,
Yitong Liu,
Zhongzhu Pu
Abstract:
Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor cont…
▽ More
Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation.
Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A low-level executor, instantiated as a VLA-based visuomotor controller, carries out each sub-task through diffusion-based action generation conditioned on geometry-preserving filtered observations. Together, the two systems form a closed loop between planning and execution, enabling memory-aware reasoning, adaptive replanning, and robust online recovery. Experiments on representative RMBench tasks show that the proposed framework substantially outperforms representative baselines, achieving a 32.4% average success rate compared with 9.8% for the strongest baseline. Ablation studies further confirm the importance of structured memory and closed-loop recovery for long-horizon manipulation.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Authors:
Jia Feng,
Zhanyue Qin,
Cuiyun Gao,
Ruiqi Wang,
Chaozheng Wang,
Yingwei Ma,
Xiaoyuan Xie
Abstract:
Repository-level code intelligence tasks require large language models (LLMs) to process long, multi-file contexts. Such inputs introduce three challenges: crucial context can be obscured by noise, truncated due to limited windows, and increased inference latency. Context compression mitigates these risks by condensing inputs. While studied in NLP, its applicability to code tasks remains largely u…
▽ More
Repository-level code intelligence tasks require large language models (LLMs) to process long, multi-file contexts. Such inputs introduce three challenges: crucial context can be obscured by noise, truncated due to limited windows, and increased inference latency. Context compression mitigates these risks by condensing inputs. While studied in NLP, its applicability to code tasks remains largely unexplored. We present the first systematic empirical study of context compression for repository-level code intelligence, organizing eight methods into three paradigms: discrete token sequences, continuous latent vectors, and visual tokens. We evaluate them on code completion and generation, measuring performance and efficiency. Results show context compression is effective: at 4x compression, continuous latent vector methods surpass full-context performance by up to 28.3% in BLEU score, indicating they filter noise rather than just truncating. On efficiency, all paradigms reduce inference cost. Both visual and text-based compression achieve up to 50% reduction in end-to-end latency at high ratios, approaching the cost of inference without repository context. These findings establish context compression as a viable approach and provide guidance for paradigm selection.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems
Authors:
Zhenyu Wan,
Gong Chen,
Qing Huang,
Xiaoyuan Xie
Abstract:
Scenario testing is an important technique for detecting errors in web systems. Testers draft test scenarios and convert them into test scripts for execution. Early methods relied on testers to convert test scenarios into test scripts. Recent LLM-based scenario testing methods can generate test scripts from natural language descriptions of test scenarios. However, these methods are not only limite…
▽ More
Scenario testing is an important technique for detecting errors in web systems. Testers draft test scenarios and convert them into test scripts for execution. Early methods relied on testers to convert test scenarios into test scripts. Recent LLM-based scenario testing methods can generate test scripts from natural language descriptions of test scenarios. However, these methods are not only limited by the incompleteness of descriptions but also overlook test adequacy criteria, making it difficult to detect potential errors. To address these limitations, this paper proposes WebMAC, a multi-agent collaborative framework for scenario testing of web systems. WebMAC can complete natural language descriptions of test scenarios through interactive clarification and transform adequate instantiated test scenarios via equivalence class partitioning. WebMAC consists of three multi-agent modules, responsible respectively for completing natural language descriptions of test scenarios, transforming test scenarios, and converting test scripts. We evaluated WebMAC on four web systems. Compared with the SOTA method, WebMAC improves the execution success rate of generated test scripts by 30%-60%, increases testing efficiency by 29%, and reduces token consumption by 47.6%. Furthermore, WebMAC can effectively detect more errors in web systems.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
From Exploration to Specification: LLM-Based Property Generation for Mobile App Testing
Authors:
Yiheng Xiong,
Shiwen Song,
Bo Ma,
Ting Su,
Xiaofei Xie
Abstract:
Mobile apps often suffer from functional bugs that do not cause crashes but instead manifest as incorrect behaviors under specific user interactions. Such bugs are difficult to detect automatically because they often lack explicit test oracles. Property-based testing can effectively expose them by checking intended behavioral properties under diverse interactions. However, its use largely depends…
▽ More
Mobile apps often suffer from functional bugs that do not cause crashes but instead manifest as incorrect behaviors under specific user interactions. Such bugs are difficult to detect automatically because they often lack explicit test oracles. Property-based testing can effectively expose them by checking intended behavioral properties under diverse interactions. However, its use largely depends on manually written properties, whose construction is difficult and expensive, limiting its practical use for mobile apps.
To address this limitation, we propose PropGen, an automated approach for generating properties for Android apps. However, this task is challenging for two reasons: app functionalities are often hard to systematically uncover and execute, and properties are difficult to derive accurately from observed behaviors. To this end, PropGen performs functionality-guided exploration to collect behavioral evidence from app executions, synthesizes properties from the collected evidence, and refines imprecise properties based on testing feedback. We implemented PropGen and evaluated it on 12 real-world Android apps. The results show that PropGen can effectively identify and execute valid app functionalities, generate valid properties, and repair most imprecise ones. Across all apps, PropGen identified 1,210 valid functionalities and correctly executed 977 of them, compared with 491 and 187 for the baseline. It generated 985 properties, 912 of which were valid, and repaired 118 of 127 imprecise ones exposed during testing. With the resulting properties, we found 25 previously unknown functional bugs in the latest versions of the subject apps, many of which were missed by existing functional testing techniques.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
Authors:
Tong Zhang,
Jiangning Zhang,
Zhucun Xue,
Juntao Jiang,
Yicheng Xu,
Chengming Xu,
Teng Hu,
Xingyu Xie,
Xiaobin Hu,
Yabiao Wang,
Yong Liu,
Shuicheng Yan
Abstract:
Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning p…
▽ More
Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL-AIGC/Awesome-Optimizer.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
Can Persona-Prompted LLMs Emulate Subgroup Values? An Empirical Analysis of Generalisability and Fairness in Cultural Alignment
Authors:
Bryan Chen Zhengyu Tan,
Zhengyuan Liu,
Xiaoyuan Yi,
Jing Yao,
Xing Xie,
Nancy F. Chen,
Roy Ka-Wei Lee
Abstract:
Despite their global prevalence, many Large Language Models (LLMs) are aligned to a monolithic, often Western-centric set of values. This paper investigates the more challenging task of fine-grained value alignment: examining whether LLMs can emulate the distinct cultural values of demographic subgroups. Using Singapore as a case study and the World Values Survey (WVS), we examine the value landsc…
▽ More
Despite their global prevalence, many Large Language Models (LLMs) are aligned to a monolithic, often Western-centric set of values. This paper investigates the more challenging task of fine-grained value alignment: examining whether LLMs can emulate the distinct cultural values of demographic subgroups. Using Singapore as a case study and the World Values Survey (WVS), we examine the value landscape and show that even state-of-the-art models like GPT-4.1 achieve only 57.4% accuracy in predicting subgroup modal preferences. We construct a dataset of over 20,000 samples to train and evaluate a range of models. We demonstrate that simple fine-tuning on structured numerical preferences yields substantial gains, improving accuracy on unseen, out-of-distribution subgroups by an average of 17.4%. These gains partially transfer to open-ended generation. However, we find significant pre-existing performance biases, where models better emulate young, male, Chinese, and Christian personas. Furthermore, while fine-tuning improves average performance, it widens the disparity between subgroups when measured by distance-aware metrics. Our work offers insights into the limits and fairness implications of subgroup-level cultural alignment.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising
Authors:
Xuelin Xie,
Xiliang Lu,
Zhengshan Wang,
Yang Zhang,
Long Chen
Abstract:
The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior re…
▽ More
The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior reduction and a spatial-spectral adaptive fidelity term. This framework considers comprehensive noise priors with fewer parameters and introduces an adaptive weight tensor to dynamically balance the fidelity and prior regularization terms. Within this framework, we further develop a fast and robust pixel-wise model combined with the representative coefficient total variation regularizer to accurately remove mixed noise in HSIs. The proposed method not only efficiently handles various types of noise but also accurately captures the spectral low-rank structure and local smoothness of HSIs. An efficient optimization algorithm based on the alternating direction method of multipliers is designed to ensure stable and fast convergence. Extensive experiments on simulated and real-world datasets demonstrate that the proposed model achieves superior denoising performance while maintaining competitive computational efficiency.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation
Authors:
Bin Zhang,
Yabiao Wang,
Xiaoyao Xie,
Shanping You,
Xuhong Yu,
Qiuhua Li,
Hongwei Li,
Shaowen Du,
Chenchen Miao,
Dengke Zhou,
Jianhua Fang,
Jiafu Wu,
Pei Wang,
Di Li
Abstract:
The exponential growth of data from modern radio telescopes presents a significant challenge to traditional single-pulse search algorithms, which are computationally intensive and prone to high false-positive rates due to Radio Frequency Interference (RFI). In this work, we introduce FRTSearch, an end-to-end framework unifying the detection and physical characterization of Fast Radio Transients (F…
▽ More
The exponential growth of data from modern radio telescopes presents a significant challenge to traditional single-pulse search algorithms, which are computationally intensive and prone to high false-positive rates due to Radio Frequency Interference (RFI). In this work, we introduce FRTSearch, an end-to-end framework unifying the detection and physical characterization of Fast Radio Transients (FRTs). Leveraging the morphological universality of dispersive trajectories in time-frequency dynamic spectra, we reframe FRT detection as a pattern recognition problem governed by the cold plasma dispersion relation. To facilitate this, we constructed CRAFTS-FRT, a pixel-level annotated dataset derived from the Commensal Radio Astronomy FAST Survey (CRAFTS), comprising 2{,}392 instances across diverse source classes. This dataset enables the training of a Mask R-CNN model for precise trajectory segmentation. Coupled with our physics-driven IMPIC algorithm, the framework maps the geometric coordinates of segmented trajectories to directly infer the Dispersion Measure (DM) and Time of Arrival (ToA). Benchmarking on the FAST-FREX dataset shows that FRTSearch achieves a 98.0\% recall, competitive with exhaustive search methods, while reducing false positives by over 99.9\% compared to PRESTO and delivering a processing speedup of up to $13.9\times$. Furthermore, the framework demonstrates robust cross-facility generalization, detecting all 19 tested FRBs from the ASKAP survey without retraining. By shifting the paradigm from ``search-then-identify'' to ``detect-and-infer,'' FRTSearch provides a scalable, high-precision solution for real-time discovery in the era of petabyte-scale radio astronomy.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
Authors:
Ruiyang Li,
Fang Liu,
Licheng Jiao,
Xinglin Xie,
Jiayao Hao,
Shuo Li,
Xu Liu,
Jingyi Yang,
Lingling Li,
Puhua Chen,
Wenping Ma
Abstract:
Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reli…
▽ More
Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
On quadratic binomial vectorial functions with maximal bent components
Authors:
Xianhong Xie,
Yi Ouyang,
Shenxing Zhang
Abstract:
Assume $n=2m\geq 2$ and let $F(x)=x^{d_1}+x^{d_2}$ be a binomial vectorial function over $\F_{2^n}$ possessing the maximal number (i.e. $2^n-2^m$) of bent components. Suppose the $2$-adic Hamming weights $\wt_2(d_1)$ and $\wt_2(d_2)$ are both at most $2$, we prove that $F(x)$ is affine equivalent to either $x^{2^m+1}$ or $x^{2^i}(x+x^{2^m})$, provided that \[
\ell(n):=\min_{γ:~\F_2(γ)=\F_{2^n}}…
▽ More
Assume $n=2m\geq 2$ and let $F(x)=x^{d_1}+x^{d_2}$ be a binomial vectorial function over $\F_{2^n}$ possessing the maximal number (i.e. $2^n-2^m$) of bent components. Suppose the $2$-adic Hamming weights $\wt_2(d_1)$ and $\wt_2(d_2)$ are both at most $2$, we prove that $F(x)$ is affine equivalent to either $x^{2^m+1}$ or $x^{2^i}(x+x^{2^m})$, provided that \[
\ell(n):=\min_{γ:~\F_2(γ)=\F_{2^n}} \dim_{\F_2}\F_2[σ]γ>m, \] where $σ$ is the Frobenius $(x\mapsto x^2)$ on $\F_{2^n}$, and $\gcd(d_1,d_2,2^m-1)>1$. Under this condition, we also establish two bounds on the nonlinearity and the differential uniformity of $F$ by means of the cardinality of its image set.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
Rotation Equivariant Convolutions in Deformable Registration of Brain MRI
Authors:
Arghavan Rezvani,
Kun Han,
Anthony T. Wu,
Pooya Khosravi,
Xiaohui Xie
Abstract:
Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant…
▽ More
Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets.
Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
Benchmarking Deep Learning for Future Liver Remnant Segmentation in Colorectal Liver Metastasis
Authors:
Anthony T. Wu,
Arghavan Rezvani,
Kela Liu,
Roozbeh Houshyar,
Pooya Khosravi,
Whitney Li,
Xiaohui Xie
Abstract:
Accurate segmentation of the future liver remnant (FLR) is critical for surgical planning in colorectal liver metastases (CRLM) to prevent fatal post-hepatectomy liver failure. However, this segmentation task is technically challenging due to complex resection boundaries, convoluted hepatic vasculature and diffuse metastatic lesions. A primary bottleneck in developing automated AI tools has been t…
▽ More
Accurate segmentation of the future liver remnant (FLR) is critical for surgical planning in colorectal liver metastases (CRLM) to prevent fatal post-hepatectomy liver failure. However, this segmentation task is technically challenging due to complex resection boundaries, convoluted hepatic vasculature and diffuse metastatic lesions. A primary bottleneck in developing automated AI tools has been the lack of high-fidelity, validated data. We address this gap by manually refining all 197 volumes from the public CRLM-CT-Seg dataset, creating the first open-source, validated benchmark for this task. We then establish the first segmentation baselines, comparing cascaded (Liver->CRLM->FLR) and end-to-end (E2E) strategies using nnU-Net, SwinUNETR, and STU-Net. We find a cascaded nnU-Net achieves the best final FLR segmentation Dice (0.767), while the pretrained STU-Net provides superior CRLM segmentation (0.620 Dice) and is significantly more robust to cascaded errors. This work provides the first validated benchmark and a reproducible framework to accelerate research in AI-assisted surgical planning.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
Controllable Chirality Sorting of Particles via Topological Optical Quasiparticles
Authors:
Hao Zhang,
Xi Xie,
Yijie Shen
Abstract:
The manipulation and sorting of chiral nanoparticles are of fundamental importance in multidisciplinary fields ranging from biochemistry to nanophotonics. In this study, we propose a novel and controllable chirality sorting mechanism for continuous particle separation using focused topological optical quasiparticles. Specifically, we investigate the sorting dynamics driven by tight-focused optical…
▽ More
The manipulation and sorting of chiral nanoparticles are of fundamental importance in multidisciplinary fields ranging from biochemistry to nanophotonics. In this study, we propose a novel and controllable chirality sorting mechanism for continuous particle separation using focused topological optical quasiparticles. Specifically, we investigate the sorting dynamics driven by tight-focused optical skyrmions and bimerons consisting of tailored spatial modes. By highly focusing free-space topological structure light fields, we generate intricate non-paraxial focal fields with tailored intensity and topological polarization textures. The sorting dynamics are systematically evaluated under the dipole approximation for fused silica nanoparticles. Our analytical calculation demonstrate that optical forces exert opposite directional pushes on particles of opposite chiralities, enabling highly efficient spatial separation. Notably, we demonstrate that this sorting process is controllable; by tuning the topological charges, the sorting distance can be flexibly tailored and expanded. The dynamic sorting process in customized topological structures introduces a promising new paradigm for tunable, wide-range chirality sorting of micro- and nano-particles.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook
Authors:
Jaehyeok Lee,
Xiaoyuan Yi,
Jing Yao,
Hyunjin Hwang,
Roy Ka-Wei Lee,
Xing Xie,
JinYeong Bak
Abstract:
As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation…
▽ More
As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value-codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and sub-group diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.
△ Less
Submitted 8 April, 2026; v1 submitted 16 March, 2026;
originally announced April 2026.
-
Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities
Authors:
Xiangxu Zhang,
Jiamin Wang,
Qinlin Zhao,
Hanze Guo,
Linzhuo Li,
Jing Yao,
Xiao Zhou,
Xiaoyuan Yi,
Xing Xie
Abstract:
As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based multi-agent systems, where group-level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters…
▽ More
As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based multi-agent systems, where group-level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters the collective behavior of LLM agents and what changes it induces? In this work, we introduce CIVA, a controlled multi-agent environment grounded in social science theories, where LLM agents form a community and autonomously communicate, explore, and compete for resources, enabling systematic manipulation of value prevalence and behavioral analysis. Through comprehensive simulation experiments, we reveal three key findings. (1) We identify several structurally critical values that substantially shape the community's collective dynamics, including those diverging from LLMs' original orientations. Triggered by the misspecification of these values, we (2) detect system failure modes, e.g., catastrophic collapse, at the macro level, and (3) observe emergent behaviors like deception and power-seeking at the micro level. These results offer quantitative evidence that human values are essential for collective outcomes in LLMs and motivate future multi-agent value alignment.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Authors:
Chaoyou Fu,
Haozhi Yuan,
Yuhao Dong,
Yi-Fan Zhang,
Yunhang Shen,
Xiaoxing Hu,
Xueying Li,
Jinsen Su,
Chengwu Long,
Xiaoyao Xie,
Yongkang Xie,
Xiawu Zheng,
Xue Yang,
Haoyu Cao,
Yunsheng Wu,
Ziwei Liu,
Xing Sun,
Caifeng Shan,
Ran He
Abstract:
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically…
▽ More
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Authors:
Chenxi Wang,
Zhuoyun Yu,
Xin Xie,
Wuguannan Yao,
Runnan Fang,
Shuofei Qiao,
Kexin Cao,
Guozhou Zheng,
Xiang Qi,
Peng Zhang,
Shumin Deng
Abstract:
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a \text…
▽ More
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a \textbf{plug-and-play skill knowledge base} that can be reused across agents and environments. SkillX operates through a fully automated pipeline built on three synergistic innovations: \textit{(i) Multi-Level Skills Design}, which distills raw trajectories into three-tiered hierarchy of strategic plans, functional skills, and atomic skills; \textit{(ii) Iterative Skills Refinement}, which automatically revises skills based on execution feedback to continuously improve library quality; and \textit{(iii) Exploratory Skills Expansion}, which proactively generates and validates novel skills to expand coverage beyond seed training data. Using a strong backbone agent (GLM-4.6), we automatically build a reusable skill library and evaluate its transferability on challenging long-horizon, user-interactive benchmarks, including AppWorld, BFCL-v3, and $τ^2$-Bench. Experiments show that SkillKB consistently improves task success and execution efficiency when plugged into weaker base agents, highlighting the importance of structured, hierarchical experience representations for generalizable agent learning. Our code will be publicly available soon at https://github.com/zjunlp/SkillX.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
Position: Science of AI Evaluation Requires Item-level Benchmark Data
Authors:
Han Jiang,
Susu Zhang,
Xiaoyuan Yi,
Xing Xie,
Ziang Xiao
Abstract:
AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In t…
▽ More
AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by item-level data. To catalyze community-wide adoption, we introduce OpenEval, a growing repository of item-level benchmark data designed supporting evidence-centered AI evaluation.
△ Less
Submitted 26 February, 2026;
originally announced April 2026.
-
Evolution from Landau Quantization to Discrete Scale Invariance Revealed by Quantum Oscillations in Topological Materials
Authors:
Jiayi Yang,
Nannan Tang,
Yunxing Li,
Jiawei Luo,
Huakun Zuo,
Gangjian Jin,
Ziqiao Wang,
Haiwen Liu,
Yanzhao Liu,
Donghui Guo,
XinCheng Xie,
Jian Wang,
Huichao Wang
Abstract:
Dirac materials have been a unique solid state platform for exploring relativistic quantum phenomena including supercritical atomic collapse, which leads to emergent discrete scale symmetry and logperiodic quantum oscillations. In the relativistic regime, the fundamental effect in quantum electrodynamics, vacuum polarization, can further modulate the atomic collapselike state by screening bare cha…
▽ More
Dirac materials have been a unique solid state platform for exploring relativistic quantum phenomena including supercritical atomic collapse, which leads to emergent discrete scale symmetry and logperiodic quantum oscillations. In the relativistic regime, the fundamental effect in quantum electrodynamics, vacuum polarization, can further modulate the atomic collapselike state by screening bare charges but is rarely harnessed in condensed matter system. Here, we report a continuous progression from low field Shubnikov de Haas oscillations to high field log periodic oscillations in the Dirac material HfTe5, with both phenomena modulated by Fermi surface anisotropy. This maps the transition from single particle Landau levels to an interaction-driven, discrete scale invariant energy spectrum of quasi-bound states. Crucially, our findings suggest vacuum polarization provides a compelling mechanism for renormalizing the effective impurity charge, quantitatively explaining the carrier-density dependent scale factor. By revealing the intricate interplay between Landau quantization, many body electronic screening, and scale-symmetry breaking, our results establish Dirac solids as a controllable platform for exploring relativistic vacuum effects and emergent novel symmetry.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Physics-informed neural networks for solving two-phase flow problems with moving interfaces
Authors:
Qijia Zhai,
Pengtao Sun,
Xiaoping Xie,
Xingwen Zhu,
Chen-Song Zhang
Abstract:
In this paper, a meshfree method using physics-informed neural networks (PINNs) is developed for solving two-phase flow problems with moving interfaces, where two immiscible fluids bearing different material properties, are separated by a dynamically evolving interface and interact with each other through interface conditions. Two kinds of distinct scenarios of interface motion are addressed: the…
▽ More
In this paper, a meshfree method using physics-informed neural networks (PINNs) is developed for solving two-phase flow problems with moving interfaces, where two immiscible fluids bearing different material properties, are separated by a dynamically evolving interface and interact with each other through interface conditions. Two kinds of distinct scenarios of interface motion are addressed: the prescribed interface motion whose moving velocity is explicitly given, and the solution-driven interface motion whose evolution is determined by the velocity field of two-phase flow. Based upon piecewise deep neural networks and spatiotemporal sampling points/training set in each fluid subdomain, the proposed PINNs framework reformulates the two-phase flow moving interface problem as a least-squares (LS) minimization problem, which involves all residuals of governing equations, interface conditions, boundary conditions and initial conditions. Furthermore, approximation properties of the proposed PINNs approach are analyzed rigorously for the presented two-phase flow model by employing the Reynolds transport theorem in evolving domains, moreover, a comprehensive error estimation is provided to account for additional complexities introduced by the moving interface and the coupling between fluid dynamics and interface evolution. Numerical experiments are carried out to illustrate the effectiveness of the proposed PINNs approach for various configurations of two-phase flow moving interface problems, and to validate the theoretical findings as well. A practical guidance is thus provided for an efficient training set distribution when applying the proposed PINNs approach to two-phase flow moving interface problems in practice.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
AutoEG: Exploiting Known Third-Party Vulnerabilities in Black-Box Web Applications
Authors:
Ruozhao Yang,
Mingfei Cheng,
Gelei Deng,
Junjie Wang,
Tianwei Zhang,
Xiaofei Xie
Abstract:
Large-scale web applications are widely deployed with complex third-party components, inheriting security risks arising from component vulnerabilities. Security assessment is therefore required to determine whether such known vulnerabilities remain practically exploitable in real applications. Penetration testing is a widely adopted approach that validates exploitability by launching concrete atta…
▽ More
Large-scale web applications are widely deployed with complex third-party components, inheriting security risks arising from component vulnerabilities. Security assessment is therefore required to determine whether such known vulnerabilities remain practically exploitable in real applications. Penetration testing is a widely adopted approach that validates exploitability by launching concrete attacks against known vulnerabilities in real-world black-box systems. However, existing approaches often fail to automatically generate reliable exploits, limiting their effectiveness in practical security assessment. This limitation mainly stems from two issues: (1) precisely triggering vulnerabilities with correct technical details, and (2) adapting exploits to diverse real-world deployment settings.
In this paper, we propose AutoEG, a fully automated multi-agent framework for exploit generation targeting black-box web applications. AutoEG has two phases: First, AutoEG extracts precise vulnerability trigger logic from unstructured vulnerability information and encapsulates it into reusable trigger functions. Second, AutoEG uses trigger functions for concrete attack objectives and iteratively refines exploits through feedback-driven interaction with the target application. We evaluate AutoEG on 104 real-world vulnerabilities with 29 attack objectives, resulting in 660 exploitation tasks and 55,440 exploit attempts. AutoEG achieves an average success rate of 82.41%, substantially outperforming state-of-the-art baselines, whose best performance reaches only 32.88%.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation
Authors:
Yuheng Liu,
Xin Lin,
Xinke Li,
Baihan Yang,
Chen Wang,
Kalyan Sunkavalli,
Yannick Hold-Geoffroy,
Hao Tan,
Kai Zhang,
Xiaohui Xie,
Zifan Shi,
Yiwei Hu
Abstract:
Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverag…
▽ More
Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing
Authors:
Juan Rodriguez,
Haotian Zhang,
Abhay Puri,
Tianyang Zhang,
Rishav Pramanik,
Meng Lin,
Xiaoqing Xie,
Marco Terral,
Darsh Kaushik,
Aly Shariff,
Perouz Taslakian,
Spandana Gella,
Sai Rajeswar,
David Vazquez,
Christopher Pal,
Marco Pedersoli
Abstract:
We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketc…
▽ More
We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on huggingface.co/datasets/ServiceNow/VectorGym.
△ Less
Submitted 22 February, 2026;
originally announced March 2026.
-
SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
Authors:
Patrick Rim,
Kevin Harris,
Braden Copple,
Shangchen Han,
Xu Xie,
Ivan Shugurov,
Sizhe An,
He Wen,
Alex Wong,
Tomas Hodan,
Kun He
Abstract:
Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we intro…
▽ More
Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Authors:
He Du,
Qiming Ge,
Jiakai Hu,
Aijun Yang,
Zheng Cai,
Zixian Huang,
Sheng Yuan,
Qinxiu Cheng,
Xinchen Xie,
Yicheng Chen,
Yining Li,
Jiaxing Xie,
Huanan Dong,
Yaguang Wu,
Xiangjun Huang,
Jian Yang,
Hui Wang,
Bowen Zhou,
Bowen Li,
Qipeng Guo,
Kai Chen
Abstract:
We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured executi…
▽ More
We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
MCPT-Solver: An Monte Carlo Algorithm Solver Using MTJ Devices for Particle Transport Problems
Authors:
Siqing Fu,
Lizhou Wu,
Tiejun Li,
Xuchao Xie,
Chunyuan Zhang,
Sheng Ma,
Jianmin Zhang,
Yuhan Tang,
Jixuan Tang
Abstract:
Monte Carlo particle transport problems play a vital role in scientific computing, but solving them on exiting von Neumann architectures suffers from random branching and irregular memory access, causing computing inefficiency due to a fundamental mismatch between stochastic algorithms and deterministic hardware. To bridge this gap, we propose MCPT-Solver, a spin-based hardware true random number…
▽ More
Monte Carlo particle transport problems play a vital role in scientific computing, but solving them on exiting von Neumann architectures suffers from random branching and irregular memory access, causing computing inefficiency due to a fundamental mismatch between stochastic algorithms and deterministic hardware. To bridge this gap, we propose MCPT-Solver, a spin-based hardware true random number generator (TRNG) with tunable output probability enabled by a Bayesian inference network architecture. It is dedicated for efficiently solving stochastic applications including Monte Carlo particle transport problems. First, we leverage the stochastic switching property of spin devices to provide a high-quality entropy source for the TRNG and achieve high generating throughput and process-voltage-temperature tolerance through optimized control logic and write mechanism designs. Next, we propose a hardware Bayesian inference network to enable probability-tunable random number outputs. Finally, we present a system-level simulation framework to evaluate MCPT-Solver. Experimental results show that MCPT-Solver achieves a mean squared error of 7.6e-6 for solving transport problems while demonstrating a dramatic acceleration effect over general-purpose processors. Additionally, the MCPT-Solver's throughput reaches 185 Mb/s with an area of 27.8 um2/bit and energy consumption of 8.6 pJ/bit, making it the first spin-based TRNG that offers both process-voltage-temperature tolerance and adjustable probability.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
KAT-Coder-V2 Technical Report
Authors:
Fengxiang Li,
Han Zhang,
Haoyang Huang,
Jinghui Wang,
Jinhua Hao,
Kun Yuan,
Mengtong Li,
Minglei Zhang,
Pengcheng Xu,
Wenhao Zhuang,
Yizhen Shao,
Zongxian Feng,
Can Tang,
Chao Wang,
Chengxiao Tong,
Fan Yang,
Gang Xiong,
Haixuan Gao,
Han Gao,
Hao Wang,
Haochen Liu,
Hongliang Sun,
Jiabao Li,
Jingwen Chang,
Jun Du
, et al. (21 additional authors not shown)
Abstract:
We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy disti…
▽ More
We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
EvA: An Evidence-First Audio Understanding Paradigm for LALMs
Authors:
Xinyuan Xie,
Shunian Chen,
Zhiheng Liu,
Yuhao Zhang,
Zhiqiang Lv,
Liyin Liang,
Benyou Wang
Abstract:
Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasonin…
▽ More
Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
Stability Analysis of Monolithic Globally Divergence-Free ALE-HDG Methods for Fluid-Structure Interaction
Authors:
Shuaijun Liu,
Xiaoping Xie
Abstract:
In this paper, we propose two monolithic fully discrete finite element methods for fluid-structure interaction (FSI) based on a novel Piola-type Arbitrary Lagrangian-Eulerian (ALE) mapping. For the temporal discretization, we apply the backward Euler method to both the non-conservative and conservative formulations. For the spatial discretization, we adopt arbitrary order hybridizable discontinuou…
▽ More
In this paper, we propose two monolithic fully discrete finite element methods for fluid-structure interaction (FSI) based on a novel Piola-type Arbitrary Lagrangian-Eulerian (ALE) mapping. For the temporal discretization, we apply the backward Euler method to both the non-conservative and conservative formulations. For the spatial discretization, we adopt arbitrary order hybridizable discontinuous Galerkin (HDG) methods for the incompressible Navier-Stokes and linear elasticity equations, and a continuous Galerkin (CG) method for the fluid mesh movement. We derive stability results for both the temporal semi-discretization and the fully discretization, and show that the velocity approximations of the fully discrete schemes are globally divergence-free. Several numerical experiments are performed to verify the performance of the proposed methods.
△ Less
Submitted 8 April, 2026; v1 submitted 29 March, 2026;
originally announced March 2026.
-
UMI-Underwater: Learning Underwater Manipulation without Underwater Teleoperation
Authors:
Hao Li,
Long Yin Chung,
Jack Goler,
Ryan Zhang,
Xiaochi Xie,
Huy Ha,
Shuran Song,
Mark Cutkosky
Abstract:
Underwater robotic grasping is difficult due to degraded, highly variable imagery and the expense of collecting diverse underwater demonstrations. We introduce a system that (i) autonomously collects successful underwater grasp demonstrations via a self-supervised data collection pipeline and (ii) transfers grasp knowledge from on-land human demonstrations through a depth-based affordance represen…
▽ More
Underwater robotic grasping is difficult due to degraded, highly variable imagery and the expense of collecting diverse underwater demonstrations. We introduce a system that (i) autonomously collects successful underwater grasp demonstrations via a self-supervised data collection pipeline and (ii) transfers grasp knowledge from on-land human demonstrations through a depth-based affordance representation that bridges the on-land-to-underwater domain gap and is robust to lighting and color shift. An affordance model trained on on-land handheld demonstrations is deployed underwater zero-shot via geometric alignment, and an affordance-conditioned diffusion policy is then trained on underwater demonstrations to generate control actions. In pool experiments, our approach improves grasping performance and robustness to background shifts, and enables generalization to objects seen only in on-land data, outperforming RGB-only baselines. Code, videos, and additional results are available at https://umi-under-water.github.io.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
Search-Induced Issues in Web-Augmented LLM Code Generation: Detecting and Repairing Error-Inducing Pages
Authors:
Guoqing Wang,
Zeyu Sun,
Xiaofei Xie,
Yizhou Chen,
Yanchao Tan,
Yifan Zhao,
Dan Hao
Abstract:
Web-augmented large language models (LLMs) offer promising capabilities for automatic code generation. However, integrating live web search exposes models to unreliable or malicious content, leading to Search-Induced Issues (SII), a novel failure mode in which external pages mislead LLMs into producing incorrect code. This paper presents a comprehensive empirical study of the prevalence and impact…
▽ More
Web-augmented large language models (LLMs) offer promising capabilities for automatic code generation. However, integrating live web search exposes models to unreliable or malicious content, leading to Search-Induced Issues (SII), a novel failure mode in which external pages mislead LLMs into producing incorrect code. This paper presents a comprehensive empirical study of the prevalence and impact of SII across three commercial search APIs and six advanced LLMs. Our analysis reveals that all evaluated web-augmented LLMs are vulnerable to SII, with root causes arising from either misaligned specifications or flawed code implementations in the searched Error-Inducing Pages (EIPs).
To address this challenge, we propose Sherlock, an automated framework that enables LLM service providers to proactively safeguard web-augmented generation systems at scale. Sherlock operates as a continuous pipeline that first detects potential SII instances, then debugs them to identify the responsible EIPs and pinpoint their root causes, and finally repairs them by either annotating misaligned content or replacing erroneous code snippets with evaluated solutions from trusted sources. Experiments show that Sherlock identifies EIPs with an F1 score of up to 95% and repairs 71% to 100% of affected generations across the evaluated models, with modest computational overhead. Our findings and framework provide practical guidance for improving the reliability of web-augmented LLM-based code generation systems in real-world software engineering scenarios.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Authors:
Yicheng Zou,
Dongsheng Zhu,
Lin Zhu,
Tong Zhu,
Yunhua Zhou,
Peiheng Zhou,
Xinyu Zhou,
Dongzhan Zhou,
Zhiwang Zhou,
Yuhao Zhou,
Bowen Zhou,
Zhanping Zhong,
Zhijie Zhong,
Haiteng Zhao,
Penghao Zhao,
Xiaomeng Zhao,
Zhiyuan Zhao,
Yechen Zhang,
Jin Zhang,
Wenwei Zhang,
Hongjie Zhang,
Zhuo Zhang,
Wenlong Zhang,
Bo Zhang,
Chao Zhang
, et al. (152 additional authors not shown)
Abstract:
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertis…
▽ More
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
△ Less
Submitted 2 April, 2026; v1 submitted 26 March, 2026;
originally announced March 2026.
-
Electron Dynamics Reconstruction and Nontrivial Transport by Acoustic Waves
Authors:
Zi-Qian Zhou,
Zhi-Fan Zhang,
Cong Xiao,
Hua Jiang,
X. C. Xie
Abstract:
Surface acoustic waves (SAWs) become a popular driving source in modern condensed matter physics, but most existing theories simplify them as electric fields and ignore the non-uniform Brillouin zone folding effect. We develop a semiclassical framework and reconstruct the electron dynamics by treating SAW as a quasi-periodic potential modulating electronic momentum distribution. This framework nat…
▽ More
Surface acoustic waves (SAWs) become a popular driving source in modern condensed matter physics, but most existing theories simplify them as electric fields and ignore the non-uniform Brillouin zone folding effect. We develop a semiclassical framework and reconstruct the electron dynamics by treating SAW as a quasi-periodic potential modulating electronic momentum distribution. This framework naturally explains the experimentally observed DC drag current and predicts acousto-electric Hall effect. The theory further reveals various SAW-driven transport phenomena, emerging anomalous Hall, thermal Hall, and Nernst effects within time-reversal symmetric systems. Illustrated in bilayer graphene and $\mathrm{MX_2}$ (M = Mo, W; X = S, Se, Te), the angular-dependent acousto-electric Hall effect provides an experimental probe for Berry curvature distribution.
△ Less
Submitted 25 March, 2026;
originally announced March 2026.
-
Visuospatial Perspective Taking in Multimodal Language Models
Authors:
Jonathan Prunty,
Seraphina Zhang,
Patrick Quinn,
Jianxun Lian,
Xing Xie,
Lucy Cheke
Abstract:
As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a refe…
▽ More
As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
A Large-Scale Remote Sensing Dataset and VLM-based Algorithm for Fine-Grained Road Hierarchy Classification
Authors:
Ting Han,
Xiangyi Xie,
Yiping Chen,
Yumeng Du,
Jin Ma,
Aiguang Li,
Jiaan Liu,
Yin Gao
Abstract:
In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectoriz…
▽ More
In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.
△ Less
Submitted 22 March, 2026;
originally announced March 2026.
-
Local Convergence Analysis of ADMM for Nonconvex Composite Optimization
Authors:
Xiyuan Xie,
Lihua Yang,
Qia li
Abstract:
In this paper, we study the local convergence of the standard ADMM scheme for a class of nonconvex composite problems arising from modern imaging and machine learning models. This problem is constrained by a closed convex set, while its objective is the sum of a continuously differentiable (possibly nonconvex) smooth term and a polyhedral convex nonsmooth term composed with a linear mapping. Our a…
▽ More
In this paper, we study the local convergence of the standard ADMM scheme for a class of nonconvex composite problems arising from modern imaging and machine learning models. This problem is constrained by a closed convex set, while its objective is the sum of a continuously differentiable (possibly nonconvex) smooth term and a polyhedral convex nonsmooth term composed with a linear mapping. Our analysis is mainly motivated by the recent works of Rockafellar [29,30]. We begin with an elementary proof of a key local strong convexity property of the Moreau envelope of polyhedral convex functions. Building on this property, we show that the strong variational sufficiency condition holds for the considered problem under appropriate assumptions. Using the strong variational sufficiency condition, we further derive a descent inequality for the ADMM iterates, in a form analogous to the classical descent analysis of ADMM for convex problems. As a consequence, for a suitable choice of the penalty parameter, we establish local convergence of the ADMM scheme to a primal-dual solution, and a local linear convergence rate for the case where the constraint set is polyhedral convex. Finally, we present three analytic examples to illustrate the applicability of our local convergence result and the necessity of the local assumptions.
△ Less
Submitted 21 March, 2026;
originally announced March 2026.
-
Precise parameter determination of the open cluster NGC 1647 via asteroseismology of p-mode pulsators
Authors:
Mingfeng Qin,
Jian-Ning Fu,
Weikai Zong,
Tianqi Cang,
Antonio Frasca,
Gang Meng,
Xiran Xie
Abstract:
Asteroseismology of member pulsators provides a robust physical constraint on cluster parameters by linking internal stellar structures to the global properties of the host cluster. However, the parameters of NGC 1647 remains poorly constrained due to limited investigation, a situation that cluster asteroseismology can significantly refine. In this study, we identified 271 high confidential cluste…
▽ More
Asteroseismology of member pulsators provides a robust physical constraint on cluster parameters by linking internal stellar structures to the global properties of the host cluster. However, the parameters of NGC 1647 remains poorly constrained due to limited investigation, a situation that cluster asteroseismology can significantly refine. In this study, we identified 271 high confidential cluster members in NGC 1647, using HDBSCAN clustering with radial-velocity validation. Its initial age is determined in the range of 1250-280 Myr, derived from isochrone fitting based on multi-survey metallicities extinction-corrected Gaia photometry. Among the members, we found 96 periodic variables from TESS and K2 photometry, including nine p-mode pulsators (five δ Sct and four hybrid δ Sct-γ Dor stars). Assuming a common cluster age and initial chemical composition, joint asteroseismic modeling is performed based on measured large frequency separations and individual mode frequencies. This yields a metallicity of [Fe/H] = -0.08+0.04-0.01, well consistent with the spectroscopic determinations, and a seismic age of 178+11-9 Myr, more precise than isochrone-based estimates. This work shows the diagnostic potential of δ Sct asteroseismology in young open clusters and establishes a high-precision benchmark for future studies of NGC 1647 and other open clusters.
△ Less
Submitted 20 March, 2026;
originally announced March 2026.
-
Does YOLO Really Need to See Every Training Image in Every Epoch?
Authors:
Xingxing Xie,
Jiahua Dong,
Junwei Han,
Gong Cheng
Abstract:
YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy. This naturally raises an important questio…
▽ More
YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy. This naturally raises an important question: \textit{Does YOLO really need to see every training image in every epoch?} To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\times$ training speedup for YOLO-series detectors while also improving accuracy.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
Facial beauty prediction fusing transfer learning and broad learning system
Authors:
Junying Gan,
Xiaoshan Xie,
Yikui Zhai,
Guohui He,
Chaoyun Mai,
Heng Luo
Abstract:
Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Tra…
▽ More
Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Transfer Learning can be able to reduce the dependence on large amounts of data as well as avoid overfitting problems. Broad learning system (BLS) can be capable of quickly completing models building and training. For this purpose, Transfer Learning was fused with BLS for FBP in this paper. Firstly, a feature extractor is constructed by way of CNNs models based on transfer learning for facial feature extraction, in which EfficientNets are used in this paper, and the fused features of facial beauty extracted are transferred to BLS for FBP, called E-BLS. Secondly, on the basis of E-BLS, a connection layer is designed to connect the feature extractor and BLS, called ER-BLS. Finally, experimental results show that, compared with the previous BLS and CNNs methods existed, the accuracy of FBP was improved by E-BLS and ER-BLS, demonstrating the effectiveness and superiority of the method presented, which can also be widely used in pattern recognition, object detection and image classification.
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
Gluon TMDs for tensor polarized deuteron in a spectator model
Authors:
Xiupeng Xie,
Dian-Yong Chen,
Zhun Lu
Abstract:
We present a model calculation of the transverse-momentum-dependent distributions (TMDs) for gluons in a tensor-polarized deuteron. Our model is based on the assumption that an on-shell deuteron can emit a time-like off-shell gluon, while the remaining system is treated as a single on-shell spectator particle whose mass can take on a continuous range of real values, described by a spectral functio…
▽ More
We present a model calculation of the transverse-momentum-dependent distributions (TMDs) for gluons in a tensor-polarized deuteron. Our model is based on the assumption that an on-shell deuteron can emit a time-like off-shell gluon, while the remaining system is treated as a single on-shell spectator particle whose mass can take on a continuous range of real values, described by a spectral function. For spin-1 hadrons, the polarization is characterized not only by a spin vector $S$ but also by a symmetric traceless spin tensor $T$. The deuteron-gluon-spectator coupling is described by an effective vertex containing three form factors. We obtain analytical expressions for thirteen T-even gluon TMDs. We also provide numerical results for the $x$-dependence and $\bm{k}_T$-dependence of these TMDs. Our analysis reveals non-negligible results of these gluon TMDs, especially for tensor-polarized hadrons, which could potentially be explored in future experimental measurements.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
Attention Residuals
Authors:
Kimi Team,
Guangyu Chen,
Yu Zhang,
Jianlin Su,
Weixin Xu,
Siyuan Pan,
Yaoyu Wang,
Yucheng Wang,
Guanduo Chen,
Bohong Yin,
Yutian Chen,
Junjie Yan,
Ming Wei,
Y. Zhang,
Fanqing Meng,
Chao Hong,
Xiaotong Xie,
Shaowei Liu,
Enzhe Lu,
Yunpeng Tai,
Yanru Chen,
Xin Men,
Haiqing Guo,
Y. Charles,
Haoyu Lu
, et al. (12 additional authors not shown)
Abstract:
Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each…
▽ More
Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.
Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
SFedHIFI: Fire Rate-Based Heterogeneous Information Fusion for Spiking Federated Learning
Authors:
Ran Tao,
Qiugang Zhan,
Shantian Yang,
Xiurui Xie,
Qi Tian,
Guisong Liu
Abstract:
Spiking Federated Learning (SFL) has been widely studied with the energy efficiency of Spiking Neural Networks (SNNs). However, existing SFL methods require model homogeneity and assume all clients have sufficient computational resources, resulting in the exclusion of some resource-constrained clients. To address the prevalent system heterogeneity in real-world scenarios, enabling heterogeneous SF…
▽ More
Spiking Federated Learning (SFL) has been widely studied with the energy efficiency of Spiking Neural Networks (SNNs). However, existing SFL methods require model homogeneity and assume all clients have sufficient computational resources, resulting in the exclusion of some resource-constrained clients. To address the prevalent system heterogeneity in real-world scenarios, enabling heterogeneous SFL systems that allow clients to adaptively deploy models of different scales based on their local resources is crucial. To this end, we introduce SFedHIFI, a novel Spiking Federated Learning framework with Fire Rate-Based Heterogeneous Information Fusion. Specifically, SFedHIFI employs channel-wise matrix decomposition to deploy SNN models of adaptive complexity on clients with heterogeneous resources. Building on this, the proposed heterogeneous information fusion module enables cross-scale aggregation among models of different widths, thereby enhancing the utilization of diverse local knowledge. Extensive experiments on three public benchmarks demonstrate that SFedHIFI can effectively enable heterogeneous SFL, consistently outperforming all three baseline methods. Compared with ANN-based FL, it achieves significant energy savings with only a marginal trade-off in accuracy.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
LibraGen: Playing a Balance Game in Subject-Driven Video Generation
Authors:
Jiahao Zhu,
Shanshan Lao,
Lijie Liu,
Gen Li,
Tianhao Qi,
Wei Han,
Bingchuan Li,
Fangfang Liu,
Zhuowei Chen,
Tianxiang Ma,
Qian HE,
Yi Zhou,
Xiaohua Xie
Abstract:
With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing o…
▽ More
With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.
△ Less
Submitted 17 March, 2026; v1 submitted 13 March, 2026;
originally announced March 2026.
-
ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning
Authors:
Eric Nazarenus,
Chuqiao Li,
Yannan He,
Xianghui Xie,
Jan Eric Lenssen,
Gerard Pons-Moll
Abstract:
We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. T…
▽ More
We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation. The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25x faster while also achieving 18% motion quality improvement over the best previous method in terms of FID.
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
Human in the Loop for Fuzz Testing: Literature Review and the Road Ahead
Authors:
Jiongchi Yu,
Xiaolin Wen,
Sizhe Cheng,
Xiaofei Xie,
Qiang Hu,
Yong Wang
Abstract:
Fuzz testing is one of the most effective techniques for detecting bugs and vulnerabilities in software. However, as the basis of fuzz testing, automated heuristics often fail to uncover deep or complex vulnerabilities. As a result, the performance of fuzz testing remains limited. One promising way to address this limitation is to integrate human expert guidance into the paradigm of fuzz testing.…
▽ More
Fuzz testing is one of the most effective techniques for detecting bugs and vulnerabilities in software. However, as the basis of fuzz testing, automated heuristics often fail to uncover deep or complex vulnerabilities. As a result, the performance of fuzz testing remains limited. One promising way to address this limitation is to integrate human expert guidance into the paradigm of fuzz testing. Even though some works have been proposed in this direction, there is still a lack of a systematic research roadmap for combining Human-in-the-Loop (HITL) and fuzz testing, hindering the potential for further enhancing fuzzing effectiveness.
To bridge this gap, this paper outlines a forward-looking research roadmap for HITL for fuzz testing. Specifically, we highlight the promise of visualization techniques for interpretable fuzzing processes, as well as on-the-fly interventions that enable experts to guide fuzzing toward hard-to-reach program behaviors. Moreover, the rise of Large Language Models (LLMs) introduces new opportunities and challenges, raising questions about how humans can efficiently provide actionable knowledge, how expert meta-knowledge can be leveraged, and what roles humans should play in the intelligent fuzzing loop with LLMs. To address these questions, we survey existing work on HITL fuzz testing and propose a research agenda emphasizing future opportunities in (1) human monitoring, (2) human steering, and (3) human-LLM collaboration. We call for a paradigm shift toward interactive, human-guided fuzzing systems that integrate expert insight with AI-powered automation in the next-generation fuzzing ecosystem.
△ Less
Submitted 12 March, 2026;
originally announced March 2026.
-
Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D
Authors:
Agniv Sharma,
Xianghui Xie,
Tom Fischer,
Eddy Ilg,
Gerard Pons-Moll
Abstract:
Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality…
▽ More
Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.
△ Less
Submitted 12 March, 2026;
originally announced March 2026.