CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation

Authors: Juyeong Hwang, Seong-Eun Hong, Jinhyun Kim, JaeYoung Seon, Giljoo Nam, Hanyoung Jang, HyeongYeop Kang

Abstract: Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is… ▽ More Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable per-frame control, and success-biased datasets-through: (i) agent-centric visual supervision via semantically reconstructed environments and Low-Rank Adaptation (LoRA) fine-tuning of a pretrained vision-language model, (ii) a motion skill action space that bridges symbolic decision making and continuous locomotion, and (iii) exploration-based question answering that exposes agents to counterfactual actions and their outcomes through simulation rollouts. Our results shift crowd simulation from motion-centric synthesis toward perception-driven, consequence-aware decision making, enabling crowds that move not just realistically, but meaningfully. △ Less

Submitted 7 April, 2026; originally announced April 2026.

arXiv:2602.17738 [pdf, ps, other]

Reasoning-Native Agentic Communication for 6G

Authors: Hyowoon Seo, Joonho Seon, Jin Young Kim, Mehdi Bennis, Wan Choi, Dong In Kim

Abstract: Future 6G networks will interconnect not only devices, but autonomous machines that continuously sense, reason, and act. In such environments, communication can no longer be understood solely as delivering bits or even preserving semantic meaning. Even when two agents interpret the same information correctly, they may still behave inconsistently if their internal reasoning processes evolve differe… ▽ More Future 6G networks will interconnect not only devices, but autonomous machines that continuously sense, reason, and act. In such environments, communication can no longer be understood solely as delivering bits or even preserving semantic meaning. Even when two agents interpret the same information correctly, they may still behave inconsistently if their internal reasoning processes evolve differently. We refer to this emerging challenge as belief divergence. This article introduces reasoning native agentic communication, a new paradigm in which communication is explicitly designed to address belief divergence rather than merely transmitting representations. Instead of triggering transmissions based only on channel conditions or data relevance, the proposed framework activates communication according to predicted misalignment in agents internal belief states. We present a reasoning native architecture that augments the conventional communication stack with a coordination plane grounded in a shared knowledge structure and bounded belief modeling. Through enabling mechanisms and representative multi agent scenarios, we illustrate how such an approach can prevent coordination drift and maintain coherent behavior across heterogeneous systems. By reframing communication as a regulator of distributed reasoning, reasoning native agentic communication enables 6G networks to act as an active harmonizer of autonomous intelligence. △ Less

Submitted 18 February, 2026; originally announced February 2026.

Comments: 8 pages 4 figures

arXiv:2602.04292 [pdf, ps, other]

Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis

Authors: Seong-Eun Hong, JaeYoung Seon, JuYeong Hwang, JongHwan Shin, HyeongYeop Kang

Abstract: Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be tempor… ▽ More Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we propose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we construct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground-truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts. △ Less

Submitted 4 February, 2026; originally announced February 2026.

Comments: 28 pages, 7 figures. Accepted to ICLR 2026

arXiv:2508.08930 [pdf, ps, other]

doi 10.1145/3757377.3763849

How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

Authors: Juyeong Hwang, Seong-Eun Hong, JaeYoung Seon, Hyeongyeop Kang

Abstract: Natural head rotation is critical for believable embodied virtual agents, yet this micro-level behavior remains largely underexplored. While head-rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooki… ▽ More Natural head rotation is critical for believable embodied virtual agents, yet this micro-level behavior remains largely underexplored. While head-rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooking obstacles or task-relevant cues, diminishing realism in a virtual environment. We introduce SCORE, a Symbolic Cognitive Reasoning framework for Embodied Head Rotation, a data-agnostic framework that produces context-aware head movements without task-specific training or hand-tuned heuristics. A controlled VR study (N=20) identifies five motivational drivers of human head movements: Interest, Information Seeking, Safety, Social Schema, and Habit. SCORE encodes these drivers as symbolic predicates, perceives the scene with a Vision-Language Model (VLM), and plans head poses with a Large Language Model (LLM). The framework employs a hybrid workflow: the VLM-LLM reasoning is executed offline, after which a lightweight FastVLM performs online validation to suppress hallucinations while maintaining responsiveness to scene dynamics. The result is an agent that predicts not only where to look but also why, generalizing to unseen scenes and multi-agent crowds while retaining behavioral plausibility. △ Less

Submitted 6 January, 2026; v1 submitted 12 August, 2025; originally announced August 2025.

Comments: 13 pages, 8 figures. Accepted to SIGGRAPH Asia Conference Papers '25

Journal ref: SIGGRAPH Asia Conference Papers '25, December 15-18, 2025, Hongkong

arXiv:2407.14059 [pdf, other]

Regularizing Dynamic Radiance Fields with Kinematic Fields

Authors: Woobin Im, Geonho Cha, Sebin Lee, Jumin Lee, Juhyeong Seon, Dongyoon Wee, Sung-Eui Yoon

Abstract: This paper presents a novel approach for reconstructing dynamic radiance fields from monocular videos. We integrate kinematics with dynamic radiance fields, bridging the gap between the sparse nature of monocular videos and the real-world physics. Our method introduces the kinematic field, capturing motion through kinematic quantities: velocity, acceleration, and jerk. The kinematic field is joint… ▽ More This paper presents a novel approach for reconstructing dynamic radiance fields from monocular videos. We integrate kinematics with dynamic radiance fields, bridging the gap between the sparse nature of monocular videos and the real-world physics. Our method introduces the kinematic field, capturing motion through kinematic quantities: velocity, acceleration, and jerk. The kinematic field is jointly learned with the dynamic radiance field by minimizing the photometric loss without motion ground truth. We further augment our method with physics-driven regularizers grounded in kinematics. We propose physics-driven regularizers that ensure the physical validity of predicted kinematic quantities, including advective acceleration and jerk. Additionally, we control the motion trajectory based on rigidity equations formed with the predicted kinematic quantities. In experiments, our method outperforms the state-of-the-arts by capturing physical motion patterns within challenging real-world monocular videos. △ Less

Submitted 19 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2406.06163 [pdf, other]

Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation

Authors: Juhyeong Seon, Woobin Im, Sebin Lee, Jumin Lee, Sung-Eui Yoon

Abstract: Audio-visual segmentation (AVS) aims to segment sound sources in the video sequence, requiring a pixel-level understanding of audio-visual correspondence. As the Segment Anything Model (SAM) has strongly impacted extensive fields of dense prediction problems, prior works have investigated the introduction of SAM into AVS with audio as a new modality of the prompt. Nevertheless, constrained by SAM'… ▽ More Audio-visual segmentation (AVS) aims to segment sound sources in the video sequence, requiring a pixel-level understanding of audio-visual correspondence. As the Segment Anything Model (SAM) has strongly impacted extensive fields of dense prediction problems, prior works have investigated the introduction of SAM into AVS with audio as a new modality of the prompt. Nevertheless, constrained by SAM's single-frame segmentation scheme, the temporal context across multiple frames of audio-visual data remains insufficiently utilized. To this end, we study the extension of SAM's capabilities to the sequence of audio-visual scenes by analyzing contextual cross-modal relationships across the frames. To achieve this, we propose a Spatio-Temporal, Bidirectional Audio-Visual Attention (ST-BAVA) module integrated into the middle of SAM's image encoder and mask decoder. It adaptively updates the audio-visual features to convey the spatio-temporal correspondence between the video frames and audio streams. Extensive experiments demonstrate that our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to ICIP 2024

arXiv:2403.07773 [pdf, other]

SemCity: Semantic Scene Generation with Triplane Diffusion

Authors: Jumin Lee, Sebin Lee, Changho Jo, Woobin Im, Juhyeong Seon, Sung-Eui Yoon

Abstract: We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a… ▽ More We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at https://github.com/zoomin-lee/SemCity. △ Less

Submitted 17 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2401.01692 [pdf]

doi 10.1145/3636555.3636905

Predicting challenge moments from students' discourse: A comparison of GPT-4 to two traditional natural language processing approaches

Authors: Wannapon Suraworachet, Jennifer Seon, Mutlu Cukurova

Abstract: Effective collaboration requires groups to strategically regulate themselves to overcome challenges. Research has shown that groups may fail to regulate due to differences in members' perceptions of challenges which may benefit from external support. In this study, we investigated the potential of leveraging three distinct natural language processing models: an expert knowledge rule-based model, a… ▽ More Effective collaboration requires groups to strategically regulate themselves to overcome challenges. Research has shown that groups may fail to regulate due to differences in members' perceptions of challenges which may benefit from external support. In this study, we investigated the potential of leveraging three distinct natural language processing models: an expert knowledge rule-based model, a supervised machine learning (ML) model and a Large Language model (LLM), in challenge detection and challenge dimension identification (cognitive, metacognitive, emotional and technical/other challenges) from student discourse, was investigated. The results show that the supervised ML and the LLM approaches performed considerably well in both tasks, in contrast to the rule-based approach, whose efficacy heavily relies on the engineered features by experts. The paper provides an extensive discussion of the three approaches' performance for automated detection and support of students' challenge moments in collaborative learning activities. It argues that, although LLMs provide many advantages, they are unlikely to be the panacea to issues of the detection and feedback provision of socially shared regulation of learning due to their lack of reliability, as well as issues of validity evaluation, privacy and confabulation. We conclude the paper with a discussion on additional considerations, including model transparency to explore feasible and meaningful analytical feedback for students and educators using LLMs. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: 13 pages, 1 figure

arXiv:2310.06486 [pdf, other]

Topological RANSAC for instance verification and retrieval without fine-tuning

Authors: Guoyuan An, Juhyung Seon, Inkyu An, Yuchi Huo, Sung-Eui Yoon

Abstract: This paper presents an innovative approach to enhancing explainable image retrieval, particularly in situations where a fine-tuning set is unavailable. The widely-used SPatial verification (SP) method, despite its efficacy, relies on a spatial model and the hypothesis-testing strategy for instance recognition, leading to inherent limitations, including the assumption of planar structures and negle… ▽ More This paper presents an innovative approach to enhancing explainable image retrieval, particularly in situations where a fine-tuning set is unavailable. The widely-used SPatial verification (SP) method, despite its efficacy, relies on a spatial model and the hypothesis-testing strategy for instance recognition, leading to inherent limitations, including the assumption of planar structures and neglect of topological relations among features. To address these shortcomings, we introduce a pioneering technique that replaces the spatial model with a topological one within the RANSAC process. We propose bio-inspired saccade and fovea functions to verify the topological consistency among features, effectively circumventing the issues associated with SP's spatial model. Our experimental results demonstrate that our method significantly outperforms SP, achieving state-of-the-art performance in non-fine-tuning retrieval. Furthermore, our approach can enhance performance when used in conjunction with fine-tuned features. Importantly, our method retains high explainability and is lightweight, offering a practical and adaptable solution for a variety of real-world applications. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:1809.00123 [pdf, ps, other]

doi 10.5303/JKAS.2018.51.4.111

A Numerical Method to Analyze Geometric Factors of a Space Particle Detector Relative to Omnidirectional Proton and Electron Fluxes

Authors: Sungmin Pak, Yuchul Shin, Ju Woo, Jongho Seon

Abstract: A numerical method is proposed to calculate the response of detectors measuring particle energies from incident isotropic fluxes of electrons and positive ions. The isotropic flux is generated by injecting particles moving radially inward on a hypothetical, spherical surface encompassing the detectors. A geometric projection of the field-of-view from the detectors onto the spherical surface allows… ▽ More A numerical method is proposed to calculate the response of detectors measuring particle energies from incident isotropic fluxes of electrons and positive ions. The isotropic flux is generated by injecting particles moving radially inward on a hypothetical, spherical surface encompassing the detectors. A geometric projection of the field-of-view from the detectors onto the spherical surface allows for the identification of initial positions and momenta corresponding to the clear field-of-view of the detectors. The contamination of detector responses by particles penetrating through, or scattering off, the structure is also similarly identified by tracing the initial positions and momenta of the detected particles. The relative contribution from the contaminating particles is calculated using GEANT4 to obtain the geometric factor of the instrument as a function of the energy. This calculation clearly shows that the geometric factor is a strong function of incident particle energies. The current investigation provides a simple and decisive method to analyze the instrument geometric factor, which is a complicated function of contributions from the anticipated field-of-view particles, together with penetrating or scattered particles. △ Less

Submitted 1 September, 2018; originally announced September 2018.

Comments: 7 pages, 8 figures and 1 table

Journal ref: JKAS 51 (2018) 111-117

Showing 1–10 of 10 results for author: Seon, J