-
Enormous Fluid Antenna Systems (E-FAS) under Correlated Surface-Wave Leakage: Physical Layer Security
Authors:
Farshad Rostami Ghadi,
Kai-Kit Wong,
Masoud Kaveh,
Mohammad Javad Ahmadi,
Kin-Fai Tong,
Hyundong Shin
Abstract:
Enormous fluid antenna systems (E-FAS) have recently emerged as a surface-wave (SW)-enabled architecture that can induce controllable large-scale channel gains through guided electromagnetic routing. This paper develops a secrecy analysis framework for E-FAS-assisted downlink transmission with practical pilot-based channel estimation. We consider a multiple-input single-output (MISO) wiretap setti…
▽ More
Enormous fluid antenna systems (E-FAS) have recently emerged as a surface-wave (SW)-enabled architecture that can induce controllable large-scale channel gains through guided electromagnetic routing. This paper develops a secrecy analysis framework for E-FAS-assisted downlink transmission with practical pilot-based channel estimation. We consider a multiple-input single-output (MISO) wiretap setting in which the base station (BS) performs minimum mean-square-error (MMSE) channel estimation and adopts maximum-ratio transmission (MRT) with artificial noise (AN). To capture the leakage of SW routing in EFAS, we introduce a correlated SW-leakage model that accounts for statistical coupling between the legitimate and eavesdropper channels caused by partially overlapping SW propagation paths. Exploiting the two-timescale nature-with slowly varying routing gain and small-scale block fading, we then derive a closed-form conditional expression for the secrecy outage probability (SOP) and a tractable characterization of the ergodic secrecy rate (ESR) in the presence of correlated quadratic forms. Our analysis yields three key insights: (i) secrecy collapses at high transmit power if and only if AN is not present, whereas any strictly positive AN can prevent asymptotic collapse; (ii) the optimal data-AN power split is achieved by a strictly interior solution; and (iii) routing gain improves both the received signal strength and the channelestimation quality, creating a nonlinear coupling that raises the signal-to-interference plus noise ratio (SINR) ceiling in the high signal-to-noise ratio (SNR) regime, and disperses secrecy across routing states. Numerical results indicate that E-FAS markedly enlarges the secure operating region significantly when compared with conventional space-wave transmission.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation
Authors:
Woojeong Jin,
Jaeho Lee,
Heeseong Shin,
Seungho Jang,
Junhwan Heo,
Seungryong Kim
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any o…
▽ More
Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
PACE-RAG: Patient-Aware Contextual and Evidence-based Policy RAG for Clinical Drug Recommendation
Authors:
Chaeyoung Huh,
Hyunmin Hwang,
Jung Hwan Shin,
Jinse Park,
Jong Chul Ye
Abstract:
Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval…
▽ More
Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-based Policy RAG), a novel framework designed to synthesize individual patient context with the prescribing tendencies of similar cases. By analyzing treatment patterns tailored to specific clinical signals, PACE-RAG identifies optimal prescriptions and generates an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results validate PACE-RAG as a robust, clinically grounded solution for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
STONE Dataset: A Scalable Multi-Modal Surround-View 3D Traversability Dataset for Off-Road Robot Navigation
Authors:
Konyul Park,
Daehun Kim,
Jiyong Oh,
Seunghoon Yu,
Junseo Park,
Jaehyun Park,
Hongjae Shin,
Hyungchan Cho,
Jungho Kim,
Jun Won Choi
Abstract:
Reliable off-road navigation requires accurate estimation of traversable regions and robust perception under diverse terrain and sensing conditions. However, existing datasets lack both scalability and multi-modality, which limits progress in 3D traversability prediction. In this work, we introduce STONE, a large-scale multi-modal dataset for off-road navigation. STONE provides (1) trajectory-guid…
▽ More
Reliable off-road navigation requires accurate estimation of traversable regions and robust perception under diverse terrain and sensing conditions. However, existing datasets lack both scalability and multi-modality, which limits progress in 3D traversability prediction. In this work, we introduce STONE, a large-scale multi-modal dataset for off-road navigation. STONE provides (1) trajectory-guided 3D traversability maps generated by a fully automated, annotation-free pipeline, and (2) comprehensive surround-view sensing with synchronized 128-channel LiDAR, six RGB cameras, and three 4D imaging radars. The dataset covers a wide range of environments and conditions, including day and night, grasslands, farmlands, construction sites, and lakes. Our auto-labeling pipeline reconstructs dense terrain surfaces from LiDAR scans, extracts geometric attributes such as slope, elevation, and roughness, and assigns traversability labels beyond the robot's trajectory using a Mahalanobis-distance-based criterion. This design enables scalable, geometry-aware ground-truth construction without manual annotation. Finally, we establish a benchmark for voxel-level 3D traversability prediction and provide strong baselines under both single-modal and multi-modal settings. STONE is available at: https://konyul.github.io/STONE-dataset/
△ Less
Submitted 12 March, 2026; v1 submitted 10 March, 2026;
originally announced March 2026.
-
Joint Optimization of Model Partitioning and Resource Allocation for Anti-Jamming Collaborative Inference Systems
Authors:
Mengru Wu,
Jiawei Li,
Jiaqi Wei,
Bin Lyu,
Kai-Kit Wong,
Hyundong Shin
Abstract:
With the increasing computational demands of deep neural network (DNN) inference on resource-constrained devices, DNN partitioning-based device-edge collaborative inference has emerged as a promising paradigm. However, the transmission of intermediate feature data is vulnerable to malicious jamming, which significantly degrades the overall inference performance. To counter this threat, this letter…
▽ More
With the increasing computational demands of deep neural network (DNN) inference on resource-constrained devices, DNN partitioning-based device-edge collaborative inference has emerged as a promising paradigm. However, the transmission of intermediate feature data is vulnerable to malicious jamming, which significantly degrades the overall inference performance. To counter this threat, this letter focuses on an anti-jamming collaborative inference system in the presence of a malicious jammer. In this system, a DNN model is partitioned into two distinct segments, which are executed by wireless devices and edge servers, respectively. We first analyze the effects of jamming and DNN partitioning on inference accuracy via data regression. Based on this, our objective is to maximize the system's revenue of delay and accuracy (RDA) under inference accuracy and computing resource constraints by jointly optimizing computation resource allocation, devices' transmit power, and DNN partitioning. To address the mixed-integer nonlinear programming problem, we propose an efficient alternating optimization-based algorithm, which decomposes the problem into three subproblems that are solved via Karush-Kuhn-Tucker conditions, convex optimization methods, and a quantum genetic algorithm, respectively. Extensive simulations demonstrate that our proposed scheme outperforms baselines in terms of RDA.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.
-
Orchestrating Multimodal DNN Workloads in Wireless Neural Processing
Authors:
Sai Xu,
Kai-Kit Wong,
Yanan Du,
Hyundong Shin
Abstract:
In edge inference, wireless resource allocation and accelerator-level deep neural network (DNN) scheduling have yet to be co-optimized in an end-to-end manner. The lack of coordination between wireless transmission and accelerator-level DNN execution prevents efficient overlap, leading to higher end-to-end inference latency. To address this issue, this paper investigates multimodal DNN workload or…
▽ More
In edge inference, wireless resource allocation and accelerator-level deep neural network (DNN) scheduling have yet to be co-optimized in an end-to-end manner. The lack of coordination between wireless transmission and accelerator-level DNN execution prevents efficient overlap, leading to higher end-to-end inference latency. To address this issue, this paper investigates multimodal DNN workload orchestration in wireless neural processing (WNP), a paradigm that integrates wireless transmission and multi-core accelerator execution into a unified end-to-end pipeline. First, we develop a unified communication-computation model for multimodal DNN execution and formulate the corresponding optimization problem. Second, we propose O-WiN, a framework that orchestrates DNN workloads in WNP through two tightly coupled stages: simulation-based optimization and runtime execution. Third, we develop two algorithms, RTFS and PACS. RTFS schedules communication and computation sequentially, whereas PACS interleaves them to enable pipeline parallelism by overlapping wireless data transfer with accelerator-level DNN execution. Simulation results demonstrate that PACS significantly outperforms RTFS under high modality heterogeneity by better masking wireless latency through communication-computation overlap, thereby highlighting the effectiveness of communication-computation pipelining in accelerating multimodal DNN execution in WNP.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.
-
SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
Authors:
Jungho Kim,
Jiyong Oh,
Seunghoon Yu,
Hongjae Shin,
Donghyuk Kwak,
Jun Won Choi
Abstract:
The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety…
▽ More
The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.8% driving score.
△ Less
Submitted 31 March, 2026; v1 submitted 21 February, 2026;
originally announced February 2026.
-
Physics-Informed Laplace Neural Operator for Solving Partial Differential Equations
Authors:
Heechang Kim,
Qianying Cao,
Hyomin Shin,
Seungchul Lee,
George Em Karniadakis,
Minseok Choi
Abstract:
Neural operators have emerged as fast surrogate solvers for parametric partial differential equations (PDEs). However, purely data-driven models often require extensive training data and can generalize poorly, especially in small-data regimes and under unseen (out-of-distribution) input functions that are not represented in the training data. To address these limitations, we propose the Physics-In…
▽ More
Neural operators have emerged as fast surrogate solvers for parametric partial differential equations (PDEs). However, purely data-driven models often require extensive training data and can generalize poorly, especially in small-data regimes and under unseen (out-of-distribution) input functions that are not represented in the training data. To address these limitations, we propose the Physics-Informed Laplace Neural Operator (PILNO), which enhances the Laplace Neural Operator (LNO) by embedding governing physics into training through PDE, boundary condition, and initial condition residuals. To improve expressivity, we first introduce an Advanced LNO (ALNO) backbone that retains a pole-residue transient representation while replacing the steady-state branch with an FNO-style Fourier multiplier. To make physics-informed training both data-efficient and robust, PILNO further leverages (i) virtual inputs: an unlabeled ensemble of input functions spanning a broad spectral range that provides abundant physics-only supervision and explicitly targets out-of-distribution (OOD) regimes; and (ii) temporal-causality weighting: a time-decaying reweighting of the physics residual that prioritizes early-time dynamics and stabilizes optimization for time-dependent PDEs. Across four representative benchmarks -- Burgers' equation, Darcy flow, a reaction-diffusion system, and a forced KdV equation -- PILNO consistently improves accuracy in small-data settings (e.g., N_train <= 27), reduces run-to-run variability across random seeds, and achieves stronger OOD generalization than purely data-driven baselines.
△ Less
Submitted 13 February, 2026;
originally announced February 2026.
-
EVOKE: Emotion Vocabulary Of Korean and English
Authors:
Yoonwon Jung,
Hagyeong Shin,
Benjamin K. Bergen
Abstract:
This paper introduces EVOKE (Emotion Vocabulary of Korean and English), a Korean-English parallel dataset of emotion words. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,426 Korean words and 1,397 English words, and…
▽ More
This paper introduces EVOKE (Emotion Vocabulary of Korean and English), a Korean-English parallel dataset of emotion words. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,426 Korean words and 1,397 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors. The dataset is, to our knowledge, the most systematic and theory-agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at https://github.com/yoonwonj/EVOKE.
△ Less
Submitted 9 April, 2026; v1 submitted 10 February, 2026;
originally announced February 2026.
-
SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures
Authors:
Keondo Park,
Younghoon Na,
Yourim Choi,
Hyunwoo Ryu,
Hyun-Woo Shin,
Hyung-Sin Kim
Abstract:
While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce…
▽ More
While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.
△ Less
Submitted 7 February, 2026;
originally announced February 2026.
-
AI-Limited Fluid Antenna-Aided Integrated Sensing and Communication Systems
Authors:
Farshad Rostami Ghadi,
Kai-Kit Wong,
F. Javier Lopez-Martinez,
Zhentian Zhang,
Hyundong Shin,
Christos Masouros
Abstract:
This paper characterizes the fundamental limits of integrated sensing and communication (ISAC) when the transmitter is subject to an artificial intelligence (AI) representation bottleneck and the receiver employs a fluid antenna system (FAS). Specifically, the message is first encoded into an ideal Gaussian waveform and mapped by an AI encoder into a finite-capacity latent representation that cons…
▽ More
This paper characterizes the fundamental limits of integrated sensing and communication (ISAC) when the transmitter is subject to an artificial intelligence (AI) representation bottleneck and the receiver employs a fluid antenna system (FAS). Specifically, the message is first encoded into an ideal Gaussian waveform and mapped by an AI encoder into a finite-capacity latent representation that constitutes the physical channel input, while the FAS receiver selects the port experiencing the most favorable channel conditions. We reveal that the AI bottleneck is equivalent to an additive representation noise, which reduces both the communication and sensing signal-to-noise ratios (SNRs) at the selected port. We then derive the resulting ISAC capacitydistortion region and establish tight converse and achievability bounds under general fading models, including Jakes-correlated channels. Leveraging the spatial degrees of freedom (DoF) characterization of the Jakes' model, we furthermore prove that the port-selection gain is fundamentally constrained by the physical length of the FAS region: the effective diversity order equals the numerical rank of the Jakes' correlation matrix and increases only with the FAS length. Consequently, enlarging the FAS length allows the selected-port SNR to approach the AI-imposed ceiling, driving the achievable communication rate and sensing mean square error (MSE) toward their AI-limited fundamental bounds. Numerical results corroborate the analysis and scaling laws.
△ Less
Submitted 5 February, 2026;
originally announced February 2026.
-
Learning-based Adaptive Control of Quadruped Robots for Active Stabilization on Moving Platforms
Authors:
Minsung Yoon,
Heechan Shin,
Jeil Jeong,
Sung-Eui Yoon
Abstract:
A quadruped robot faces balancing challenges on a six-degrees-of-freedom moving platform, like subways, buses, airplanes, and yachts, due to independent platform motions and resultant diverse inertia forces on the robot. To alleviate these challenges, we present the Learning-based Active Stabilization on Moving Platforms (\textit{LAS-MP}), featuring a self-balancing policy and system state estimat…
▽ More
A quadruped robot faces balancing challenges on a six-degrees-of-freedom moving platform, like subways, buses, airplanes, and yachts, due to independent platform motions and resultant diverse inertia forces on the robot. To alleviate these challenges, we present the Learning-based Active Stabilization on Moving Platforms (\textit{LAS-MP}), featuring a self-balancing policy and system state estimators. The policy adaptively adjusts the robot's posture in response to the platform's motion. The estimators infer robot and platform states based on proprioceptive sensor data. For a systematic training scheme across various platform motions, we introduce platform trajectory generation and scheduling methods. Our evaluation demonstrates superior balancing performance across multiple metrics compared to three baselines. Furthermore, we conduct a detailed analysis of the \textit{LAS-MP}, including ablation studies and evaluation of the estimators, to validate the effectiveness of each component.
△ Less
Submitted 8 February, 2026; v1 submitted 3 February, 2026;
originally announced February 2026.
-
Layer-wise Swapping for Generalizable Multilingual Safety
Authors:
Hyunseo Shin,
Wonseok Hwang
Abstract:
Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high r…
▽ More
Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.
△ Less
Submitted 13 February, 2026; v1 submitted 30 January, 2026;
originally announced January 2026.
-
IROS: A Dual-Process Architecture for Real-Time VLM-Based Indoor Navigation
Authors:
Joonhee Lee,
Hyunseung Shin,
Jeonggil Ko
Abstract:
Indoor mobile robot navigation requires fast responsiveness and robust semantic understanding, yet existing methods struggle to provide both. Classical geometric approaches such as SLAM offer reliable localization but depend on detailed maps and cannot interpret human-targeted cues (e.g., signs, room numbers) essential for indoor reasoning. Vision-Language-Action (VLA) models introduce semantic gr…
▽ More
Indoor mobile robot navigation requires fast responsiveness and robust semantic understanding, yet existing methods struggle to provide both. Classical geometric approaches such as SLAM offer reliable localization but depend on detailed maps and cannot interpret human-targeted cues (e.g., signs, room numbers) essential for indoor reasoning. Vision-Language-Action (VLA) models introduce semantic grounding but remain strictly reactive, basing decisions only on visible frames and failing to anticipate unseen intersections or reason about distant textual cues. Vision-Language Models (VLMs) provide richer contextual inference but suffer from high computational latency, making them unsuitable for real-time operation on embedded platforms. In this work, we present IROS, a real-time navigation framework that combines VLM-level contextual reasoning with the efficiency of lightweight perceptual modules on low-cost, on-device hardware. Inspired by Dual Process Theory, IROS separates fast reflexive decisions (System One) from slow deliberative reasoning (System Two), invoking the VLM only when necessary. Furthermore, by augmenting compact VLMs with spatial and textual cues, IROS delivers robust, human-like navigation with minimal latency. Across five real-world buildings, IROS improves decision accuracy and reduces latency by 66% compared to continuous VLM-based navigation.
△ Less
Submitted 29 January, 2026;
originally announced January 2026.
-
Finite-Aperture Fluid Antenna Array Design: Analysis and Algorithm
Authors:
Zhentian Zhang,
Kai-Kit Wong,
Hao Jiang,
Farshad Rostami Ghadi,
Hyundong Shin,
Yangyang Zhang
Abstract:
Finite-aperture constraints render array design nontrivial and can undermine the effectiveness of classical sparse geometries. This letter provides universal guidance for fluid antenna array (FAA) design under a fixed aperture. We derive a closed-form Cramér--Rao bound (CRB) that unifies conventional and reconfigurable arrays by explicitly linking the Fisher information to the geometric variance o…
▽ More
Finite-aperture constraints render array design nontrivial and can undermine the effectiveness of classical sparse geometries. This letter provides universal guidance for fluid antenna array (FAA) design under a fixed aperture. We derive a closed-form Cramér--Rao bound (CRB) that unifies conventional and reconfigurable arrays by explicitly linking the Fisher information to the geometric variance of port locations. We further obtain a closed-form probability density function of the minimum spacing under random FAA placement, which yields a principled lower bound for the minimum-spacing constraint. Building upon these analytical insights, we then propose a gradient-based algorithm to optimize continuous port locations. Utilizing a simple gradient update design, the optimized FAA can achieve about a $30\%$ CRB reduction and a $42.5\%$ reduction in mean-squared error.
△ Less
Submitted 26 January, 2026;
originally announced January 2026.
-
MANGO: A Global Single-Date Paired Dataset for Mangrove Segmentation
Authors:
Junhyuk Heo,
Beomkyu Choi,
Hyunjin Shin,
Darongsae Kwon
Abstract:
Mangroves are critical for climate-change mitigation, requiring reliable monitoring for effective conservation. While deep learning has emerged as a powerful tool for mangrove detection, its progress is hindered by the limitations of existing datasets. In particular, many resources provide only annual map products without curated single-date image-mask pairs, limited to specific regions rather tha…
▽ More
Mangroves are critical for climate-change mitigation, requiring reliable monitoring for effective conservation. While deep learning has emerged as a powerful tool for mangrove detection, its progress is hindered by the limitations of existing datasets. In particular, many resources provide only annual map products without curated single-date image-mask pairs, limited to specific regions rather than global coverage, or remain inaccessible to the public. To address these challenges, we introduce MANGO, a large-scale global dataset comprising 42,703 labeled image-mask pairs across 124 countries. To construct this dataset, we retrieve all available Sentinel-2 imagery within the year 2020 for mangrove regions and select the best single-date observations that align with the mangrove annual mask. This selection is performed using a target detection-driven approach that leverages pixel-wise coordinate references to ensure adaptive and representative image-mask pairings. We also provide a benchmark across diverse semantic segmentation architectures under a country-disjoint split, establishing a foundation for scalable and reliable global mangrove monitoring.
△ Less
Submitted 20 January, 2026;
originally announced January 2026.
-
Improving the Accuracy of Community Detection on Signed Networks via Community Refinement and Contrastive Learning
Authors:
Hyunuk Shin,
Hojin Kim,
Chanyoung Lee,
Yeon-Chang Lee,
David Yoon Suk Kang
Abstract:
Community detection (CD) on signed networks is crucial for understanding how positive and negative relations jointly shape network structure. However, existing CD methods often yield inconsistent communities due to noisy or conflicting edge signs. In this paper, we propose ReCon, a model-agnostic post-processing framework that progressively refines community structures through four iterative steps…
▽ More
Community detection (CD) on signed networks is crucial for understanding how positive and negative relations jointly shape network structure. However, existing CD methods often yield inconsistent communities due to noisy or conflicting edge signs. In this paper, we propose ReCon, a model-agnostic post-processing framework that progressively refines community structures through four iterative steps: (1) structural refinement, (2) boundary refinement, (3) contrastive learning, and (4) clustering. Extensive experiments on eighteen synthetic and four real-world networks using four CD methods demonstrate that ReCon consistently enhances community detection accuracy, serving as an effective and easily integrable solution for reliable CD across diverse network properties.
△ Less
Submitted 22 January, 2026;
originally announced January 2026.
-
Resource Allocation and Sharing for UAV-Assisted Integrated TN-NTN with Multi-Connectivity
Authors:
Abd Ullah Khan,
Wali Ullah Khan,
Haejoon Jung,
Hyundong Shin
Abstract:
Unmanned aerial vehicles (UAVs) with multi-connectivity (MC) capabilities efficiently and reliably transfer data between terrestrial networks (TNs) and non-terrestrial networks (NTNs). However, optimally sharing and allocating spectrum and power resources to maintain MC while ensuring reliable connectivity and optimal performance remains challenging in such networks. Channel variations induced by…
▽ More
Unmanned aerial vehicles (UAVs) with multi-connectivity (MC) capabilities efficiently and reliably transfer data between terrestrial networks (TNs) and non-terrestrial networks (NTNs). However, optimally sharing and allocating spectrum and power resources to maintain MC while ensuring reliable connectivity and optimal performance remains challenging in such networks. Channel variations induced by mobility in UAV networks, coupled with the varying quality of service (QoS) demands of heterogeneous devices, resource sharing, and fairness requirements in capacity distribution pose challenges to optimal resource allocation. Thus, this paper investigates resource allocation for QoS-constrained, MC-enabled, dynamic UAVs in an integrated TN-NTN environment with spectrum sharing and fairness considerations. To this end, we consider three types of links: UAV-to-radio base station (RBS), UAV-to-UAV, and UAV-to-HAP. We also assume two types of UAVs with diverse QoS requirements to reflect a practical scenario. Consequently, we propose two algorithms. The first algorithm maximizes the capacity of UAVs-RBS and UAVs-HAP links while ensuring the reliability of the UAV-UAV link. To achieve this, the algorithm maximizes the collective throughput of the UAVs by optimizing the sum capacity of all the UAV-RBS and UAV-HAP links. Next, to provide constant capacity to all links and ensure fairness, we propose another algorithm that maximizes the minimum capacity across all links. We validate the performance of both algorithms through simulation
△ Less
Submitted 25 January, 2026; v1 submitted 21 January, 2026;
originally announced January 2026.
-
How Diplomacy Reshapes Online Discourse:Asymmetric Persistence in Online Framing of North Korea
Authors:
Hunjun Shin,
Hoonbae Moon,
Mohit Singhal
Abstract:
Public opinion toward foreign adversaries shapes and constrains diplomatic options. Prior research has largely relied on sentiment analysis and survey based measures, providing limited insight into how sustained narrative changes (beyond transient emotional reactions) might follow diplomatic engagement. This study examines the extent to which high stakes diplomatic summits shape how adversaries ar…
▽ More
Public opinion toward foreign adversaries shapes and constrains diplomatic options. Prior research has largely relied on sentiment analysis and survey based measures, providing limited insight into how sustained narrative changes (beyond transient emotional reactions) might follow diplomatic engagement. This study examines the extent to which high stakes diplomatic summits shape how adversaries are framed in online discourse. We analyze U.S.-North Korea summit diplomacy (2018-2019) using a Difference-in-Difference(DiD) design on Reddit discussions. Using multiple control groups (China, Iran, Russia) to adjust for concurrent geopolitical shocks, we integrate a validated Codebook LLM framework for framing classification with graph based discourse network analysis that examines both edge level relationships and community level narrative structures. Our results reveal short term asymmetric persistence in framing responses to diplomacy. While both post level and comment level sentiment proved transient (improving during the Singapore Summit but fully reverting after the Hanoi failure),framing exhibited significant stability: the shift from threat oriented to diplomacy oriented framing was only partially reversed. Structurally, the proportion of threat oriented edges decreased substantially (48% -> 28%) while diplomacy oriented structures expanded, and these shifts resisted complete reversion after diplomatic failure. These findings suggest that diplomatic success can leave a short-term but lasting imprint on how adversaries are framed in online discourse, even when subsequent negotiations fail.
△ Less
Submitted 14 January, 2026;
originally announced January 2026.
-
OSCAR: Optical-aware Semantic Control for Aleatoric Refinement in Sar-to-Optical Translation
Authors:
Hyunseo Lee,
Sang Min Kim,
Ho Kyung Shin,
Taeheon Kim,
Woo-Jeoung Nam
Abstract:
Synthetic Aperture Radar (SAR) provides robust all-weather imaging capabilities; however, translating SAR observations into photo-realistic optical images remains a fundamentally ill-posed problem. Current approaches are often hindered by the inherent speckle noise and geometric distortions of SAR data, which frequently result in semantic misinterpretation, ambiguous texture synthesis, and structu…
▽ More
Synthetic Aperture Radar (SAR) provides robust all-weather imaging capabilities; however, translating SAR observations into photo-realistic optical images remains a fundamentally ill-posed problem. Current approaches are often hindered by the inherent speckle noise and geometric distortions of SAR data, which frequently result in semantic misinterpretation, ambiguous texture synthesis, and structural hallucinations. To address these limitations, a novel SAR-to-Optical (S2O) translation framework is proposed, integrating three core technical contributions: (i) Cross-Modal Semantic Alignment, which establishes an Optical-Aware SAR Encoder by distilling robust semantic priors from an Optical Teacher into a SAR Student (ii) Semantically-Grounded Generative Guidance, realized by a Semantically-Grounded ControlNet that integrates class-aware text prompts for global context with hierarchical visual prompts for local spatial guidance; and (iii) an Uncertainty-Aware Objective, which explicitly models aleatoric uncertainty to dynamically modulate the reconstruction focus, effectively mitigating artifacts caused by speckle-induced ambiguity. Extensive experiments demonstrate that the proposed method achieves superior perceptual quality and semantic consistency compared to state-of-the-art approaches.
△ Less
Submitted 11 January, 2026;
originally announced January 2026.
-
Adaptive Hybrid Optimizer based Framework for Lumpy Skin Disease Identification
Authors:
Ubaidullah,
Muhammad Abid Hussain,
Mohsin Raza Jafri,
Rozi Khan,
Moid Sandhu,
Abd Ullah Khan,
Hyundong Shin
Abstract:
Lumpy Skin Disease (LSD) is a contagious viral infection that significantly deteriorates livestock health, thereby posing a serious threat to the global economy and food security. Owing to its rapid spread characteristics, early and precise identification is crucial to prevent outbreaks and ensure timely intervention. In this paper, we propose a hybrid deep learning-based approach called LUMPNet f…
▽ More
Lumpy Skin Disease (LSD) is a contagious viral infection that significantly deteriorates livestock health, thereby posing a serious threat to the global economy and food security. Owing to its rapid spread characteristics, early and precise identification is crucial to prevent outbreaks and ensure timely intervention. In this paper, we propose a hybrid deep learning-based approach called LUMPNet for the early detection of LSD. LUMPNet utilizes image data to detect and classify skin nodules -- the primary indicator of LSD. To this end, LUMPNet uses YOLOv11, EfficientNet-based CNN classifier with compound scaling, and a novel adaptive hybrid optimizer. More precisely, LUMPNet detects and localizes LSD skin nodules and lesions on cattle images. It exploits EfficientNet to classify the localized cattle images into LSD-affected or healthy categories. To stabilize and accelerate the training of YOLOv11 and EfficientNet hybrid model, a novel adaptive hybrid optimizer is proposed and utilized. We evaluate LUMPNet at various stages of LSD using a publicly available dataset. Results indicate that the proposed scheme achieves 99% LSD detection training accuracy, and outperforms existing schemes. The model also achieves validation accuracy of 98%. Moreover, for further evaluation, we conduct a case study using an optimized EfficientNet-B0 model trained with the AdamW optimizer, and compare its performance with LUMPNet. The results show that LUMPNet achieves superior performance.
△ Less
Submitted 5 January, 2026;
originally announced January 2026.
-
Sparse Bayesian Message Passing under Structural Uncertainty
Authors:
Yoonhyuk Choi,
Jiho Choi,
Chanran Kim,
Yumin Lee,
Hawon Shin,
Yeowon Jeon,
Minjeong Kim,
Jiwoo Kang
Abstract:
Semi-supervised learning on real-world graphs is frequently challenged by heterophily, where the observed graph is unreliable or label-disassortative. Many existing graph neural networks either rely on a fixed adjacency structure or attempt to handle structural noise through regularization. In this work, we explicitly capture structural uncertainty by modeling a posterior distribution over signed…
▽ More
Semi-supervised learning on real-world graphs is frequently challenged by heterophily, where the observed graph is unreliable or label-disassortative. Many existing graph neural networks either rely on a fixed adjacency structure or attempt to handle structural noise through regularization. In this work, we explicitly capture structural uncertainty by modeling a posterior distribution over signed adjacency matrices, allowing each edge to be positive, negative, or absent. We propose a sparse signed message passing network that is naturally robust to edge noise and heterophily, which can be interpreted from a Bayesian perspective. By combining (i) posterior marginalization over signed graph structures with (ii) sparse signed message aggregation, our approach offers a principled way to handle both edge noise and heterophily. Experimental results demonstrate that our method outperforms strong baseline models on heterophilic benchmarks under both synthetic and real-world structural noise.
△ Less
Submitted 3 January, 2026;
originally announced January 2026.
-
Quantum Intelligence Meets BD-RIS-Enabled AmBC: Challenges, Opportunities, and Practical Insights
Authors:
Abd Ullah Khan,
Uman Khalid,
Trung Q. Duong,
Hyundong Shin
Abstract:
A beyond-diagonal reconfigurable intelligent surface (BD-RIS) is an innovative type of reconfigurable intelligent surface (RIS) that has recently been proposed and is considered a revolutionary advancement in wave manipulation. Unlike the mutually disconnected arrangement of elements in traditional RISs, BD-RIS creates cost-effective and simple inter-element connections, allowing for greater freed…
▽ More
A beyond-diagonal reconfigurable intelligent surface (BD-RIS) is an innovative type of reconfigurable intelligent surface (RIS) that has recently been proposed and is considered a revolutionary advancement in wave manipulation. Unlike the mutually disconnected arrangement of elements in traditional RISs, BD-RIS creates cost-effective and simple inter-element connections, allowing for greater freedom in configuring the amplitude and phase of impinging waves. However, there are numerous underlying challenges in realizing the advantages associated with BD-RIS, prompting the research community to actively investigate cutting-edge schemes and algorithms in this direction. Particularly, the passive beamforming design for BD-RIS under specific environmental conditions has become a major focus in this research area. In this article, we provide a systematic introduction to BD-RIS, elaborating on its functional principles concerning architectural design, promising advantages, and classification. Subsequently, we present recent advances and identify a series of challenges and opportunities. Additionally, we consider a specific case study where beamforming is designed using four different algorithms, and we analyze their performance with respect to sum rate and computation cost. To augment the beamforming capabilities in 6G BD-RIS with quantum enhancement, we analyze various hybrid quantum-classical machine learning (ML) models to improve beam prediction performance, employing real-world communication Scenario 8 from the DeepSense 6G dataset. Consequently, we derive useful insights about the practical implications of BD-RIS.
△ Less
Submitted 31 December, 2025; v1 submitted 29 December, 2025;
originally announced December 2025.
-
Multiconnectivity for SAGIN: Current Trends, Challenges, AI-driven Solutions, and Opportunities
Authors:
Abd Ullah Khan,
Adnan Shahid,
Haejoon Jung,
Hyundong Shin
Abstract:
Space-air-ground-integrated network (SAGIN)-enabled multiconnectivity (MC) is emerging as a key enabler for next-generation networks, enabling users to simultaneously utilize multiple links across multi-layer non-terrestrial networks (NTN) and multi-radio access technology (multi-RAT) terrestrial networks (TN). However, the heterogeneity of TN and NTN introduces complex architectural challenges th…
▽ More
Space-air-ground-integrated network (SAGIN)-enabled multiconnectivity (MC) is emerging as a key enabler for next-generation networks, enabling users to simultaneously utilize multiple links across multi-layer non-terrestrial networks (NTN) and multi-radio access technology (multi-RAT) terrestrial networks (TN). However, the heterogeneity of TN and NTN introduces complex architectural challenges that complicate MC implementation. Specifically, the diversity of link types, spanning air-to-air, air-to-space, space-to-space, space-to-ground, and ground-to-ground communications, renders optimal resource allocation highly complex. Recent advancements in reinforcement learning (RL) and agentic artificial intelligence (AI) have shown remarkable effectiveness in optimal decision-making in complex and dynamic environments. In this paper, we review the current developments in SAGIN-enabled MC and outline the key challenges associated with its implementation. We further highlight the transformative potential of AI-driven approaches for resource optimization in a heterogeneous SAGIN environment. To this end, we present a case study on resource allocation optimization enabled by agentic RL for SAGIN-enabled MC involving diverse radio access technologies (RATs). Results show that learning-based methods can effectively handle complex scenarios and substantially enhance network performance in terms of latency and capacity while incurring a moderate increase in power consumption as an acceptable tradeoff. Finally, open research problems and future directions are presented to realize efficient SAGIN-enabled MC.
△ Less
Submitted 25 January, 2026; v1 submitted 25 December, 2025;
originally announced December 2025.
-
Tracing Energy Flow: Learning Tactile-based Grasping Force Control to Prevent Slippage in Dynamic Object Interaction
Authors:
Cheng-Yu Kuo,
Hirofumi Shin,
Takamitsu Matsubara
Abstract:
Regulating grasping force to reduce slippage during dynamic object interaction remains a fundamental challenge in robotic manipulation, especially when objects are manipulated by multiple rolling contacts, have unknown properties (such as mass or surface conditions), and when external sensing is unreliable. In contrast, humans can quickly regulate grasping force by touch, even without visual cues.…
▽ More
Regulating grasping force to reduce slippage during dynamic object interaction remains a fundamental challenge in robotic manipulation, especially when objects are manipulated by multiple rolling contacts, have unknown properties (such as mass or surface conditions), and when external sensing is unreliable. In contrast, humans can quickly regulate grasping force by touch, even without visual cues. Inspired by this ability, we aim to enable robotic hands to rapidly explore objects and learn tactile-driven grasping force control under motion and limited sensing. We propose a physics-informed energy abstraction that models the object as a virtual energy container. The inconsistency between the fingers' applied power and the object's retained energy provides a physically grounded signal for inferring slip-aware stability. Building on this abstraction, we employ model-based learning and planning to efficiently model energy dynamics from tactile sensing and perform real-time grasping force optimization. Experiments in both simulation and hardware demonstrate that our method can learn grasping force control from scratch within minutes, effectively reduce slippage, and extend grasp duration across diverse motion-object pairs, all without relying on external sensing or prior object knowledge.
△ Less
Submitted 24 December, 2025;
originally announced December 2025.
-
Prototype-Guided Non-Exemplar Continual Learning for Cross-subject EEG Decoding
Authors:
Dan Li,
Hye-Bin Shin,
Yeon-Woo Choi
Abstract:
Due to the significant variability in electroencephalo-gram (EEG) signals across individuals, knowledge acquired from previous subjects is often overwritten as new subjects are introduced in continual EEG decoding tasks. Existing methods mainly rely on storing historical data from seen subjects as replay buffers to mitigate forgetting, which is impractical under privacy or memory constraints. To a…
▽ More
Due to the significant variability in electroencephalo-gram (EEG) signals across individuals, knowledge acquired from previous subjects is often overwritten as new subjects are introduced in continual EEG decoding tasks. Existing methods mainly rely on storing historical data from seen subjects as replay buffers to mitigate forgetting, which is impractical under privacy or memory constraints. To address this issue, we propose a Prototype-guided Non-Exemplar Continual Learning (ProNECL) framework that preserves prior knowledge without accessing historical EEG samples. ProNECL summarizes subject-specific discriminative representations into class-level prototypes and incrementally aligns new subject representations with a global prototype memory through prototype-based feature regulariza-tion and cross-subject alignment. Experiments on the BCI Com-petition IV 2a and 2b datasets demonstrate that ProNECL effec-tively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding tasks.
△ Less
Submitted 15 January, 2026; v1 submitted 24 November, 2025;
originally announced November 2025.
-
Accelerating Reinforcement Learning via Error-Related Human Brain Signals
Authors:
Suzie Kim,
Hye-Bin Shin,
Hyo-Jeong Jang
Abstract:
In this work, we investigate how implicit neural feed back can accelerate reinforcement learning in complex robotic manipulation settings. While prior electroencephalogram (EEG) guided reinforcement learning studies have primarily focused on navigation or low-dimensional locomotion tasks, we aim to understand whether such neural evaluative signals can improve policy learning in high-dimensional ma…
▽ More
In this work, we investigate how implicit neural feed back can accelerate reinforcement learning in complex robotic manipulation settings. While prior electroencephalogram (EEG) guided reinforcement learning studies have primarily focused on navigation or low-dimensional locomotion tasks, we aim to understand whether such neural evaluative signals can improve policy learning in high-dimensional manipulation tasks involving obstacles and precise end-effector control. We integrate error related potentials decoded from offline-trained EEG classifiers into reward shaping and systematically evaluate the impact of human-feedback weighting. Experiments on a 7-DoF manipulator in an obstacle-rich reaching environment show that neural feedback accelerates reinforcement learning and, depending on the human-feedback weighting, can yield task success rates that at times exceed those of sparse-reward baselines. Moreover, when applying the best-performing feedback weighting across all sub jects, we observe consistent acceleration of reinforcement learning relative to the sparse-reward setting. Furthermore, leave-one subject-out evaluations confirm that the proposed framework remains robust despite the intrinsic inter-individual variability in EEG decodability. Our findings demonstrate that EEG-based reinforcement learning can scale beyond locomotion tasks and provide a viable pathway for human-aligned manipulation skill acquisition.
△ Less
Submitted 24 November, 2025;
originally announced November 2025.
-
Multi-Port Selection for FAMA: Massive Connectivity with Fewer RF Chains than Users
Authors:
Hanjiang Hong,
Kai-Kit Wong,
Xusheng Zhu,
Hao Xu,
Han Xiao,
Farshad Rostami Ghadi,
Hyundong Shin
Abstract:
Fluid antenna multiple access (FAMA) is an emerging technology in massive access designed to meet the demands of future wireless communication networks by naturally mitigating multiuser interference through the utilization of the fluid antenna system (FAS) at RF-chain-limited mobile device. The transition from single-active-port to multi-active-port on a shared RF chain for slow FAMA can greatly e…
▽ More
Fluid antenna multiple access (FAMA) is an emerging technology in massive access designed to meet the demands of future wireless communication networks by naturally mitigating multiuser interference through the utilization of the fluid antenna system (FAS) at RF-chain-limited mobile device. The transition from single-active-port to multi-active-port on a shared RF chain for slow FAMA can greatly enhance its multiplexing capability but is not well understood. Motivated by this, this paper proposes and studies three port selection methods: the optimal exhaustive-search port selection (EPS) as a performance upper bound, and two suboptimal, low-complexity algorithms, namely incremental port selection (IPS) and decremental port selection (DPS). Then the performance of multi-active-port slow FAMA is analyzed, and the complexity of the proposed methods is compared. Simulation results indicate that the proposed methods outperform current state-of-the-art multi-port FAMA techniques. In particular, IPS achieves near-optimal performance while maintaining manageable computational complexity. This research provides a more general framework for port selection in FAMA systems.
△ Less
Submitted 21 November, 2025;
originally announced November 2025.
-
Fluid Antenna System-Enabled UAV-to-Ground Communications
Authors:
Xusheng Zhu,
Kai-Kit Wong,
Qingqing Wu,
Hyundong Shin,
Yangyang Zhang
Abstract:
Fluid antenna systems (FAS) have emerged as a revolutionary technology offering enhanced spatial diversity within a compact form factor. Concurrently, unmanned aerial vehicles (UAVs) are integral to future networks, necessitating channel models that capture both multipath fading and shadowing. This letter presents a novel performance analysis of a UAV-to-ground link, where the receiver is equipped…
▽ More
Fluid antenna systems (FAS) have emerged as a revolutionary technology offering enhanced spatial diversity within a compact form factor. Concurrently, unmanned aerial vehicles (UAVs) are integral to future networks, necessitating channel models that capture both multipath fading and shadowing. This letter presents a novel performance analysis of a UAV-to-ground link, where the receiver is equipped with an $N$-port FAS operating over the challenging double-shadowing fading channel. By adapting a tractable eigenvalue-based approximation for the correlated FAS ports, we derive new analytical expressions for the end-to-end signal-to-noise ratio statistics, namely the cumulative distribution function and the probability density function. Based on these statistics, we present exact integral expressions for the outage probability, average bit error rate, and average channel capacity. We further derive new, tractable closed-form solutions for the average bit error rate and capacity for the practical dual-rank, independent but non-identically distributed case. Finally, a key asymptotic analysis reveals that the system achieves a multiplicative diversity order of $G_d = M \times d$, which is precisely the product of the FAS spatial rank $M$ and the intrinsic channel diversity order $d$. Simulation results are provided to validate the high accuracy of our entire theoretical framework.
△ Less
Submitted 21 November, 2025;
originally announced November 2025.
-
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Authors:
Geon Choi,
Hangyul Yoon,
Hyunju Shin,
Hyunki Park,
Sang Hoon Seo,
Eunho Yang,
Edward Choi
Abstract:
The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on complex, expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce instruction-guided lesion segmentation (ILS), a medical-domain adaptation of referring image segmentation (RIS) designed to segm…
▽ More
The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on complex, expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce instruction-guided lesion segmentation (ILS), a medical-domain adaptation of referring image segmentation (RIS) designed to segment diverse lesion types based on simple, user-friendly instructions. Under this task, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from CXR images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we present ROSALIA, a LISA model fine-tuned on the MIMIC-ILS dataset. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding. The dataset and model are available at https://github.com/checkoneee/ROSALIA.
△ Less
Submitted 26 March, 2026; v1 submitted 19 November, 2025;
originally announced November 2025.
-
Cross-Modal Consistency-Guided Active Learning for Affective BCI Systems
Authors:
Hyo-Jeong Jang,
Hye-Bin Shin,
Kang Yin
Abstract:
Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easily corrupted by artifacts and individual variability, while emotional labels often stem from subjective and inconsistent reports-making robust affective decoding particularly difficult. We propose an uncertainty…
▽ More
Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easily corrupted by artifacts and individual variability, while emotional labels often stem from subjective and inconsistent reports-making robust affective decoding particularly difficult. We propose an uncertainty-aware active learning framework that enhances robustness to label noise by jointly leveraging model uncertainty and cross-modal consistency. Instead of relying solely on EEG-based uncertainty estimates, the method evaluates cross-modal alignment to determine whether uncertainty originates from cognitive ambiguity or sensor noise. A representation alignment module embeds EEG and face features into a shared latent space, enforcing semantic coherence between modalities. Residual discrepancies are treated as noise-induced inconsistencies, and these samples are selectively queried for oracle feedback during active learning. This feedback-driven process guides the network toward reliable, informative samples and reduces the impact of noisy labels. Experiments on the ASCERTAIN dataset examine the efficiency and robustness of ours, highlighting its potential as a data-efficient and noise-tolerant approach for EEG-based affective decoding in brain-computer interface systems.
△ Less
Submitted 19 November, 2025;
originally announced November 2025.
-
NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation
Authors:
Kang Yin,
Hye-Bin Shin
Abstract:
Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characterist…
▽ More
Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.
△ Less
Submitted 16 November, 2025;
originally announced November 2025.
-
Toward Adaptive BCIs: Enhancing Decoding Stability via User State-Aware EEG Filtering
Authors:
Yeon-Woo Choi,
Hye-Bin Shin,
Dan Li
Abstract:
Brain-computer interfaces (BCIs) often suffer from limited robustness and poor long-term adaptability. Model performance rapidly degrades when user attention fluctuates, brain states shift over time, or irregular artifacts appear during interaction. To mitigate these issues, we introduce a user state-aware electroencephalogram (EEG) filtering framework that refines neural representations before de…
▽ More
Brain-computer interfaces (BCIs) often suffer from limited robustness and poor long-term adaptability. Model performance rapidly degrades when user attention fluctuates, brain states shift over time, or irregular artifacts appear during interaction. To mitigate these issues, we introduce a user state-aware electroencephalogram (EEG) filtering framework that refines neural representations before decoding user intentions. The proposed method continuously estimates the user's cognitive state (e.g., focus or distraction) from EEG features and filters unreliable segments by applying adaptive weighting based on the estimated attention level. This filtering stage suppresses noisy or out-of-focus epochs, thereby reducing distributional drift and improving the consistency of subsequent decoding. Experiments on multiple EEG datasets that emulate real BCI scenarios demonstrate that the proposed state-aware filtering enhances classification accuracy and stability across different user states and sessions compared with conventional preprocessing pipelines. These findings highlight that leveraging brain-derived state information--even without additional user labels--can substantially improve the reliability of practical EEG-based BCIs.
△ Less
Submitted 11 November, 2025;
originally announced November 2025.
-
RCScore: Quantifying Response Consistency in Large Language Models
Authors:
Dongjun Jang,
Youngchae Ahn,
Hyopil Shin
Abstract:
Current LLM evaluations often rely on a single instruction template, overlooking models' sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance varia…
▽ More
Current LLM evaluations often rely on a single instruction template, overlooking models' sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.
△ Less
Submitted 30 October, 2025;
originally announced October 2025.
-
Dual Mixture-of-Experts Framework for Discrete-Time Survival Analysis
Authors:
Hyeonjun Lee,
Hyungseob Shin,
Gunhee Nam,
Hyeonsoo Lee
Abstract:
Survival analysis is a task to model the time until an event of interest occurs, widely used in clinical and biomedical research. A key challenge is to model patient heterogeneity while also adapting risk predictions to both individual characteristics and temporal dynamics. We propose a dual mixture-of-experts (MoE) framework for discrete-time survival analysis. Our approach combines a feature-enc…
▽ More
Survival analysis is a task to model the time until an event of interest occurs, widely used in clinical and biomedical research. A key challenge is to model patient heterogeneity while also adapting risk predictions to both individual characteristics and temporal dynamics. We propose a dual mixture-of-experts (MoE) framework for discrete-time survival analysis. Our approach combines a feature-encoder MoE for subgroup-aware representation learning with a hazard MoE that leverages patient features and time embeddings to capture temporal dynamics. This dual-MoE design flexibly integrates with existing deep learning based survival pipelines. On METABRIC and GBSG breast cancer datasets, our method consistently improves performance, boosting the time-dependent C-index up to 0.04 on the test sets, and yields further gains when incorporated into the Consurv framework.
△ Less
Submitted 29 October, 2025;
originally announced October 2025.
-
PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination
Authors:
Hyunseung Lim,
Sooyohn Nam,
Sungmin Na,
Ji Yong Cho,
June Yong Yang,
Hyungyu Shin,
Yoonjoo Lee,
Juho Kim,
Moontae Lee,
Hwajung Hong
Abstract:
Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims -- prior art -- in expert domains. Previous NLP studies have approached this challenge as a pred…
▽ More
Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims -- prior art -- in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in office actions documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, Non-Final Rejections, and Notices of Allowance. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination. Our dataset is openly available at https://huggingface.co/datasets/LG-AI-Research/PANORAMA.
△ Less
Submitted 24 October, 2025;
originally announced October 2025.
-
Cluster-wise processing in fronthaul-aware cell-free massive MIMO systems
Authors:
Zahra Mobini,
Ahmet Hasim Gokceoglu,
Li Wang,
Gunnar Peters,
Hyundong Shin,
Hien Quoc Ngo
Abstract:
We exploit a general cluster-based network architecture for a fronthaul-limited user-centric cell-free massive multiple-input multiple-output (CF-mMIMO) system under different degrees of cooperation among the access points (APs) to achieve scalable implementation. In particular, we consider a CF-mMIMO system wherein the available APs are grouped into multiple processing clusters (PCs) to share cha…
▽ More
We exploit a general cluster-based network architecture for a fronthaul-limited user-centric cell-free massive multiple-input multiple-output (CF-mMIMO) system under different degrees of cooperation among the access points (APs) to achieve scalable implementation. In particular, we consider a CF-mMIMO system wherein the available APs are grouped into multiple processing clusters (PCs) to share channel state information (CSI), ensuring that they have knowledge of the CSI for all users assigned to the given cluster for the purposes of designing resource allocation and precoding. We utilize the sum pseudo-SE metric, which accounts for intra-cluster interference and intercluster-leakage, providing a close approximation to the true sum achievable SE. For a given PC, we formulate two optimization problems to maximize the cluster-wise weighted sum pseudo-SE under fronthaul constraints, relying solely on local CSI. These optimization problems are associated with different computational complexity requirements. The first optimization problem jointly designs precoding, user association, and power allocation, and is performed at the small-scale fading time scale. The second optimization problem optimizes user association and power allocation at the large-scale fading time scale. Accordingly, we develop a novel application of modified weighted minimum mean square error (WMMSE)-based approach to solve the challenging formulated non-convex mixed-integer problems.
△ Less
Submitted 18 October, 2025;
originally announced October 2025.
-
Exploring Conditions for Diffusion models in Robotic Control
Authors:
Heeseong Shin,
Byeongho Heo,
Dongyoon Han,
Seungryong Kim,
Taekyung Kim
Abstract:
While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual cond…
▽ More
While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
△ Less
Submitted 8 April, 2026; v1 submitted 17 October, 2025;
originally announced October 2025.
-
XGrasp: Gripper-Aware Grasp Detection with Multi-Gripper Data Generation
Authors:
Yeonseo Lee,
Jungwook Mun,
Hyosup Shin,
Guebin Hwang,
Junhee Nam,
Taeyeop Lee,
Sungho Jo
Abstract:
Real-world robotic systems frequently require diverse end-effectors for different tasks, however most existing grasp detection methods are optimized for a single gripper type, demanding retraining or optimization for each novel gripper configuration. This gripper-specific retraining paradigm is neither scalable nor practical. We propose XGrasp, a real-time gripper-aware grasp detection framework t…
▽ More
Real-world robotic systems frequently require diverse end-effectors for different tasks, however most existing grasp detection methods are optimized for a single gripper type, demanding retraining or optimization for each novel gripper configuration. This gripper-specific retraining paradigm is neither scalable nor practical. We propose XGrasp, a real-time gripper-aware grasp detection framework that generalizes to novel gripper configurations without additional training or optimization. To resolve data scarcity, we augment existing single-gripper datasets with multi-gripper annotations by incorporating the physical characteristics and closing trajectories of diverse grippers. Each gripper is represented as a two-channel 2D image encoding its static shape (Gripper Mask) and dynamic closing trajectory (Gripper Path). XGrasp employs a hierarchical two-stage architecture consisting of a Grasp Point Predictor (GPP) and an Angle-Width Predictor (AWP). In the AWP, contrastive learning with a quality-aware anchor builds a gripper-agnostic embedding space, enabling generalization to novel grippers without additional training. Experimental results demonstrate that XGrasp outperforms existing gripper-aware methods in both grasp success rate and inference speed across diverse gripper types. Project page: https://sites.google.com/view/xgrasp
△ Less
Submitted 12 March, 2026; v1 submitted 13 October, 2025;
originally announced October 2025.
-
RIS-Assisted XL-MIMO for Near-Field and Far-Field Communications
Authors:
Xiaomin Cao,
Mohammadali Mohammadi,
Hien Quoc Ngo,
Hyundong Shin,
Michail Matthaiou
Abstract:
We consider a reconfigurable intelligent surface (RIS)-assisted extremely large-scale multiple-input multiple-output (XL-MIMO) downlink system, where an XL-MIMO array serves two groups of single-antennas users, namely near-field users (NFUEs) and far-field users (FFUEs). FFUEs are subject to blockage, and their communication is facilitated through the RIS. We consider three precoding schemes at th…
▽ More
We consider a reconfigurable intelligent surface (RIS)-assisted extremely large-scale multiple-input multiple-output (XL-MIMO) downlink system, where an XL-MIMO array serves two groups of single-antennas users, namely near-field users (NFUEs) and far-field users (FFUEs). FFUEs are subject to blockage, and their communication is facilitated through the RIS. We consider three precoding schemes at the XL-MIMO array, namely central zero-forcing (CZF), local zero-forcing (LZF) and maximum ratio transmission (MRT). Closed-form expressions for the spectral efficiency (SE) of all users are derived for MRT precoding, while statistical-form expressions are obtained for CZF and LZF processing. A heuristic visibility region (VR) selection algorithm is also introduced to help reduce the computational complexity of the precoding scheme. Furthermore, we devise a two-stage phase shifts design and power control algorithm to maximize the sum of weighted minimum SE of two groups of users with CZF, LZF and MRT precoding schemes. The simulation results indicate that, when equal priority is given to NFUEs and FFUEs, the proposed design improves the sum of the weighted minimum SE by 31.9\%, 37.8\%, and 119.2\% with CZF, LZF, and MRT, respectively, compared to the case with equal power allocation and random phase shifts design. CZF achieves the best performance, while LZF offers comparable results with lower complexity. When prioritizing NFUEs or FFUEs, LZF achieves strong performance for the prioritized group, whereas CZF ensures balanced performance between NFUEs and FFUEs.
△ Less
Submitted 27 September, 2025;
originally announced September 2025.
-
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Authors:
Zhilin Wang,
Jiaqi Zeng,
Olivier Delalleau,
Ellie Evans,
Daniel Egert,
Hoo-Chang Shin,
Felipe Soares,
Yi Dong,
Oleksii Kuchaiev
Abstract:
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-base…
▽ More
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025
△ Less
Submitted 30 October, 2025; v1 submitted 25 September, 2025;
originally announced September 2025.
-
Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers
Authors:
Chaehyun Kim,
Heeseong Shin,
Eunbeen Hong,
Heeji Yoon,
Anurag Arnab,
Paul Hongsuck Seo,
Sunghwan Hong,
Seungryong Kim
Abstract:
Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed underst…
▽ More
Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception
Authors:
Changwon Kang,
Jisong Kim,
Hongjae Shin,
Junseo Park,
Jun Won Choi
Abstract:
The goal of multi-task learning is to learn to conduct multiple tasks simultaneously based on a shared data representation. While this approach can improve learning efficiency, it may also cause performance degradation due to task conflicts that arise when optimizing the model for different objectives. To address this challenge, we introduce MAESTRO, a structured framework designed to generate tas…
▽ More
The goal of multi-task learning is to learn to conduct multiple tasks simultaneously based on a shared data representation. While this approach can improve learning efficiency, it may also cause performance degradation due to task conflicts that arise when optimizing the model for different objectives. To address this challenge, we introduce MAESTRO, a structured framework designed to generate task-specific features and mitigate feature interference in multi-task 3D perception, including 3D object detection, bird's-eye view (BEV) map segmentation, and 3D occupancy prediction. MAESTRO comprises three components: the Class-wise Prototype Generator (CPG), the Task-Specific Feature Generator (TSFG), and the Scene Prototype Aggregator (SPA). CPG groups class categories into foreground and background groups and generates group-wise prototypes. The foreground and background prototypes are assigned to the 3D object detection task and the map segmentation task, respectively, while both are assigned to the 3D occupancy prediction task. TSFG leverages these prototype groups to retain task-relevant features while suppressing irrelevant features, thereby enhancing the performance for each task. SPA enhances the prototype groups assigned for 3D occupancy prediction by utilizing the information produced by the 3D object detection head and the map segmentation head. Extensive experiments on the nuScenes and Occ3D benchmarks demonstrate that MAESTRO consistently outperforms existing methods across 3D object detection, BEV map segmentation, and 3D occupancy prediction tasks.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation
Authors:
Miseul Kim,
Soo Jin Park,
Kyungguen Byun,
Hyeon-Kyeong Shin,
Sunkuk Moon,
Shuhua Zhang,
Erik Visser
Abstract:
Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speec…
▽ More
Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability. We validate our approach on a simulated emotional speech dataset and the truncated AMI dataset, demonstrating significant improvements, with error rate reductions of 49% and 35% on each dataset, respectively.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection
Authors:
Zhixion Chen,
Jiangzhou Wang,
Hyundong Shin,
Arumugam Nallanathan
Abstract:
The deployment of unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting diverse Internet of Things (IoT) applications. Nevertheless, the limited endurance and communication range of UAVs necessitate intelligent trajectory planning. While reinforcement learning (RL) has been extensively explored for UAV…
▽ More
The deployment of unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting diverse Internet of Things (IoT) applications. Nevertheless, the limited endurance and communication range of UAVs necessitate intelligent trajectory planning. While reinforcement learning (RL) has been extensively explored for UAV trajectory optimization, its interactive nature entails high costs and risks in real-world environments. Offline RL mitigates these issues but remains susceptible to unstable training and heavily rely on expert-quality datasets. To address these challenges, we formulate a joint UAV trajectory planning and resource allocation problem to maximize energy efficiency of data collection. The resource allocation subproblem is first transformed into an equivalent linear programming formulation and solved optimally with polynomial-time complexity. Then, we propose a large language model (LLM)-empowered critic-regularized decision transformer (DT) framework, termed LLM-CRDT, to learn effective UAV control policies. In LLM-CRDT, we incorporate critic networks to regularize the DT model training, thereby integrating the sequence modeling capabilities of DT with critic-based value guidance to enable learning effective policies from suboptimal datasets. Furthermore, to mitigate the data-hungry nature of transformer models, we employ a pre-trained LLM as the transformer backbone of the DT model and adopt a parameter-efficient fine-tuning strategy, i.e., LoRA, enabling rapid adaptation to UAV control tasks with small-scale dataset and low computational overhead. Extensive simulations demonstrate that LLM-CRDT outperforms benchmark online and offline RL methods, achieving up to 36.7\% higher energy efficiency than the current state-of-the-art DT approaches.
△ Less
Submitted 19 September, 2025; v1 submitted 17 September, 2025;
originally announced September 2025.
-
Fluid Antenna Systems: A Geometric Approach to Error Probability and Fundamental Limits
Authors:
Xusheng Zhu,
Kai-Kit Wong,
Hao Xu,
Han Xiao,
Hanjiang Hong,
Hyundong Shin,
Yangyang Zhang
Abstract:
The fluid antenna system (FAS) concept is an emerging paradigm that promotes the utilization of the feature of shape and position reconfigurability in antennas to broaden the design of wireless communication systems. This also means that spatial diversity can be exploited in an unconventional way. However, a rigorous framework for error probability analysis of FAS under realistic spatially correla…
▽ More
The fluid antenna system (FAS) concept is an emerging paradigm that promotes the utilization of the feature of shape and position reconfigurability in antennas to broaden the design of wireless communication systems. This also means that spatial diversity can be exploited in an unconventional way. However, a rigorous framework for error probability analysis of FAS under realistic spatially correlated channels has been lacking. In this paper, we fill this gap by deriving a tight, closed-form asymptotic expression for the symbol error rate (SER) that establishes the fundamental scaling law linking the system's SER to the channel's spatial correlation structure. A key insight of our analysis is that the achievable diversity gain is governed not by the number of antenna ports, but by the channel's effective rank. To find this critical parameter, we propose a novel dual-pronged approach. First of all, we develop a geometry-based algorithm that extracts distinct performance thresholds from the channel's eigenvalue spectrum. Second, we theoretically prove that the effective rank converges to a fundamental limit dictated solely by the antenna's normalized aperture width. We further establish the equivalence between the threshold identified by the geometric algorithm and the derived theoretical limit, providing rigorous validation for the proposed method. Our effective rank model achieves higher accuracy than existing approaches in the literature. Building on this framework, we offer a complete characterization of diversity and coding gains. The analysis leads to a definitive design insight: FAS performance improvements are fundamentally driven by enlarging the antenna's explorable aperture, which increases the effective channel rank, whereas increasing port density within a fixed aperture yields diminishing returns.
△ Less
Submitted 10 September, 2025;
originally announced September 2025.
-
Visual Representation Alignment for Multimodal Large Language Models
Authors:
Heeji Yoon,
Jaewoo Jung,
Junwan Kim,
Hyungyu Choi,
Heeseong Shin,
Sangbeom Lim,
Honggyu An,
Chaehyun Kim,
Jisang Han,
Donghyun Kim,
Chanho Eom,
Sunghwan Hong,
Seungryong Kim
Abstract:
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-…
▽ More
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.
△ Less
Submitted 10 October, 2025; v1 submitted 9 September, 2025;
originally announced September 2025.
-
NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
Authors:
Wilka Carvalho,
Vikram Goddla,
Ishaan Sinha,
Hoon Shin,
Kunal Jha
Abstract:
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to…
▽ More
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and multi-agent researchers to develop algorithms for human-AI collaboration. We showcase NiceWebRL with 3 case studies that demonstrate its potential to help develop Human-like AI, Human-compatible AI, and Human-assistive AI. In the first case study (Human-like AI), NiceWebRL enables the development of a novel RL model of cognition. Here, NiceWebRL facilitates testing this model against human participants in both a grid world and Craftax, a 2D Minecraft domain. In our second case study (Human-compatible AI), NiceWebRL enables the development of a novel multi-agent RL algorithm that can generalize to human partners in the Overcooked domain. Finally, in our third case study (Human-assistive AI), we show how NiceWebRL can allow researchers to study how an LLM can assist humans on complex tasks in XLand-Minigrid, an environment with millions of hierarchical tasks. The library is available at https://github.com/KempnerInstitute/nicewebrl.
△ Less
Submitted 21 August, 2025;
originally announced August 2025.
-
Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization
Authors:
Sukhyun Jeong,
Hong-Gi Shin,
Yong-Hoon Choi
Abstract:
Recent progress in text-to-motion has advanced both 3D human motion generation and text-based motion control. Controllable motion generation (CoMo), which enables intuitive control, typically relies on pose code representations, but discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness. To overcome this, we propose a method that augments pose code-based late…
▽ More
Recent progress in text-to-motion has advanced both 3D human motion generation and text-based motion control. Controllable motion generation (CoMo), which enables intuitive control, typically relies on pose code representations, but discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness. To overcome this, we propose a method that augments pose code-based latent representations with continuous motion features using residual vector quantization (RVQ). This design preserves the interpretability and manipulability of pose codes while effectively capturing subtle motion characteristics such as high-frequency details. Experiments on the HumanML3D dataset show that our model reduces Frechet inception distance (FID) from 0.041 to 0.015 and improves Top-1 R-Precision from 0.508 to 0.510. Qualitative analysis of pairwise direction similarity between pose codes further confirms the model's controllability for motion editing.
△ Less
Submitted 20 August, 2025;
originally announced August 2025.
-
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
Authors:
NVIDIA,
:,
Aarti Basant,
Abhijit Khairnar,
Abhijit Paithankar,
Abhinav Khattar,
Adithya Renduchintala,
Aditya Malte,
Akhiad Bercovich,
Akshay Hazare,
Alejandra Rico,
Aleksander Ficek,
Alex Kondratenko,
Alex Shaposhnikov,
Alexander Bukharin,
Ali Taghibakhshi,
Amelia Barton,
Ameya Sunil Mahabaleshwarkar,
Amy Shen,
Andrew Tao,
Ann Guan,
Anna Shors,
Anubhav Mandarwal,
Arham Mehta,
Arun Venkatesan
, et al. (192 additional authors not shown)
Abstract:
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achi…
▽ More
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
△ Less
Submitted 2 September, 2025; v1 submitted 20 August, 2025;
originally announced August 2025.