arXiv:2604.07375 [pdf, ps, other]

Assessing the Feasibility of a Video-Based Conversational Chatbot Survey for Measuring Perceived Cycling Safety: A Pilot Study in New York City

Authors: Feiyang Ren, Zhaoxi Zhang, Tamir Mendel, Takahiro Yabe

Abstract: Bicycle safety is important for bikeability and transportation efficiency. However, conventional surveys often fall short in capturing how people actually perceive cycling environments because they rely heavily on respondents' recall rather than in-the-moment experience. By leveraging large language models (LLMs), this study proposes a new method of combining video-based surveys with a conversatio… ▽ More Bicycle safety is important for bikeability and transportation efficiency. However, conventional surveys often fall short in capturing how people actually perceive cycling environments because they rely heavily on respondents' recall rather than in-the-moment experience. By leveraging large language models (LLMs), this study proposes a new method of combining video-based surveys with a conversational AI chatbot to collect human perceptions of cycling safety and the reasons behind these perceptions. The paper developed the AI chatbot using a modular LLM architecture, integrating prompt engineering, state management, and rule-based control to support the structure of human-AI interaction. This paper evaluates the feasibility of the proposed video-based conversational chatbot using complete responses from sixteen participants to the pilot survey across nine street segments in New York City. The method feasibility was assessed using a seven-point scale rating for user experience (i.e., ease of use, supportiveness, efficiency) and a five-point scale for chatbot usability (i.e., personality, roboticness, friendliness), yielding positive results with mean scores of 5.00 out of 7 (standard deviation = 1.6) and 3.47 out of 5 (standard deviation = 0.43), respectively. The data feasibility was assessed using multiple techniques: (1) Natural language processing (NLP), such as KeyBERT, for overall safety and feature analysis to extract built-environment attributes; (2) K-means clustering for semantic analysis to identify reasons and suggestions; and (3) regression to estimate the effects of built-environment and demographic variables on perceived safety outcomes. The results show the potential of AI chatbots as a novel approach to collecting data on human perception, behavior, and future visions for transport planning. △ Less

Submitted 7 April, 2026; originally announced April 2026.

arXiv:2604.05887 [pdf, ps, other]

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Authors: Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, Huan Li

Abstract: Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency eve… ▽ More Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to $7.9\times$ and achieves $1.52\times$ faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM. △ Less

Submitted 7 April, 2026; originally announced April 2026.

arXiv:2604.05546 [pdf, ps, other]

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Authors: Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li

Abstract: Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency tech… ▽ More Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community. △ Less

Submitted 7 April, 2026; originally announced April 2026.

Comments: Accepted to ACL 2026 Findings

arXiv:2604.05363 [pdf, ps, other]

Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection

Authors: Rixiang Ni, Boyang Li, Jun Chen, Yonghao Li, Feiyu Ren, Yuji Wang, Haoyang Yuan, Wujiao He, Wei An

Abstract: Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided "encoder-decoder" segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter background… ▽ More Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided "encoder-decoder" segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter backgrounds. Based on this observation, we argue that the first principle of IRSTD should be target localization instead of separating all target region accompanied with indistinguishable background noise. In this paper, we reformulate IRSTD as a centroid regression task and propose a novel Single-Point Supervision guided Infrared Probabilistic Response Encoding method (namely, SPIRE), which is indeed challenging due to the mismatch between reduced supervision network and equivalent output. Specifically, we first design a Point-Response Prior Supervision (PRPS), which transforms single-point annotations into probabilistic response map consistent with infrared point-target response characteristics, with a High-Resolution Probabilistic Encoder (HRPE) that enables encoder-only, end-to-end regression without decoder reconstruction. By preserving high-resolution features and increasing effective supervision density, SPIRE alleviates optimization instability under sparse target distributions. Finally, extensive experiments on various IRSTD benchmarks, including SIRST-UAVB and SIRST4 demonstrate that SPIRE achieves competitive target-level detection performance with consistently low false alarm rate (Fa) and significantly reduced computational cost. Code is publicly available at: https://github.com/NIRIXIANG/SPIRE-IRSTD. △ Less

Submitted 6 April, 2026; originally announced April 2026.

arXiv:2604.02713 [pdf, ps, other]

Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

Authors: Jiawen Deng, Wentao Zhang, Ziyun Jiao, Fuji Ren

Abstract: Conversational AI is increasingly deployed in emotionally charged and ethically sensitive interactions. Previous research has primarily concentrated on emotional benchmarks or static safety checks, overlooking how alignment unfolds in evolving conversation. We explore the research question: what breakdowns arise when conversational agents confront emotionally and ethically sensitive behaviors, and… ▽ More Conversational AI is increasingly deployed in emotionally charged and ethically sensitive interactions. Previous research has primarily concentrated on emotional benchmarks or static safety checks, overlooking how alignment unfolds in evolving conversation. We explore the research question: what breakdowns arise when conversational agents confront emotionally and ethically sensitive behaviors, and how do these affect dialogue quality? To stress-test chatbot performance, we develop a persona-conditioned user simulator capable of engaging in multi-turn dialogue with psychological personas and staged emotional pacing. Our analysis reveals that mainstream models exhibit recurrent breakdowns that intensify as emotional trajectories escalate. We identify several common failure patterns, including affective misalignments, ethical guidance failures, and cross-dimensional trade-offs where empathy supersedes or undermines responsibility. We organize these patterns into a taxonomy and discuss the design implications, highlighting the necessity to maintain ethical coherence and affective sensitivity throughout dynamic interactions. The study offers the HCI community a new perspective on the diagnosis and improvement of conversational AI in value-sensitive and emotionally charged contexts. △ Less

Submitted 3 April, 2026; originally announced April 2026.

Comments: 22 pages, ACM CHI 2026

arXiv:2604.00368 [pdf, ps, other]

TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving

Authors: Feng Ren, Ruoyu Qin, Teng Ma, Shangming Cai, Zheng Liu, Chao Lei, Dejiang Zhu, Ke Yang, Zheming Li, Jialei Cui, Weixiao Huang, Yikai Zhao, Yineng Zhang, Hao Wu, Xiang Gao, Yuhao Fu, Jinlei Jiang, Yongwei Wu, Mingxing Zhang

Abstract: Modern GPU clusters are built upon a complex hierarchy of heterogeneous interconnects, ranging from multi-rail RDMA to proprietary fabrics such as Multi-Node NVLink and Ascend UB. Orchestrating these diverse links effectively remains a critical challenge in disaggregated LLM serving. Operating Mooncake TE on thousands of GPUs exposed a critical limitation shared by existing frameworks: imperative,… ▽ More Modern GPU clusters are built upon a complex hierarchy of heterogeneous interconnects, ranging from multi-rail RDMA to proprietary fabrics such as Multi-Node NVLink and Ascend UB. Orchestrating these diverse links effectively remains a critical challenge in disaggregated LLM serving. Operating Mooncake TE on thousands of GPUs exposed a critical limitation shared by existing frameworks: imperative, statically bound path selection. This rigidity forces engines to rely on state-blind striping that ignores congestion signals, creating communication silos, wasting multi-rail bandwidth due to head-of-line blocking, and leading to operational fragility where routine faults require manual intervention. We present TENT, a data-movement engine that decouples transfer intent from physical execution. Instead of locking workloads to fixed backends, TENT unifies heterogeneous interconnects into a single dynamic resource pool. Applications simply declare transfer intents, while TENT dynamically decomposes elephant flows into fine-grained slices and "sprays" them across links based on instantaneous link quality. This telemetry-driven orchestration eliminates head-of-line blocking and enables transparent, sub-50 ms self-healing by rerouting slices around failures without application logic. TENT serves as the production data plane for LLM inference and RL pipelines at multiple industrial sites. Our evaluation on H800 HGX clusters shows that TENT outperforms state-of-the-art baselines, including Mooncake TE, NIXL, and UCCL. In LLM inference with SGLang HiCache, TENT achieves up to 1.36x higher throughput and 26% lower P90 TTFT than Mooncake TE. In RL pipelines, TENT accelerates parameter updates in Moonshot Checkpoint Engine by 20-26%. △ Less

Submitted 31 March, 2026; originally announced April 2026.

arXiv:2603.16008 [pdf, ps, other]

CoDesignAI: An AI-Enabled Multi-Agent, Multi-User System for Collaborative Urban Design at the Conceptual Stage

Authors: Zhaoxi Zhang, Ruolin Wu, Feiyang Ren, Sridevi Turaga, Tamir Mendel

Abstract: Public participation has become increasingly important in collaborative urban design; yet, existing processes often face challenges in achieving efficient and scalable citizen engagement. To address this gap, this study explores how large language models (LLMs) can support cooperation among community members in participatory design. We introduce CoDesignAI, a collaborative urban design tool that c… ▽ More Public participation has become increasingly important in collaborative urban design; yet, existing processes often face challenges in achieving efficient and scalable citizen engagement. To address this gap, this study explores how large language models (LLMs) can support cooperation among community members in participatory design. We introduce CoDesignAI, a collaborative urban design tool that combines multiple users, representing residents or stakeholders, with multiple AI agents, representing domain experts who provide facilitation and professional knowledge during the conceptual stage of urban design. This paper presents the system architecture and main components of the tool, illustrating how users interact with AI agents within a collaborative and iterative design workflow. Specifically, the system integrates generative AI with spatial mapping services to support street-level visualization of design proposals. AI agents assist users by summarizing discussion content, extracting shared design intentions, and generating prompts for presenting design interventions. The system also enables users to revise and refine their ideas over multiple rounds while documenting the design process. By combining conversational AI, multi-user interaction, and image-based design grounded in real-world urban contexts, this study argues that AI-enabled design systems can help shift urban design from an expert-centered practice to a more open and participatory process. The paper contributes a new web-based platform for AI-assisted collaborative design and offers an early exploration of how AI agents may expand the capacity for public participation in urban design. △ Less

Submitted 16 March, 2026; originally announced March 2026.

arXiv:2603.10637 [pdf, ps, other]

Q-StaR: A Quasi-Static Routing Scheme for NoCs

Authors: Yang Zhang, Yiren Zhao, Xu Wang, Fengyuan Ren

Abstract: In networks-on-chip, static routing schemes are favored for their simplicity and predictability, but they cannot effectively balance network load due to the unawareness of runtime load distribution. Q-StaR discovers two factors (topology and traffic distribution) that determine the long-term trend of load distribution, and proposes N-Rank to extract this trend. The obtained information is used to… ▽ More In networks-on-chip, static routing schemes are favored for their simplicity and predictability, but they cannot effectively balance network load due to the unawareness of runtime load distribution. Q-StaR discovers two factors (topology and traffic distribution) that determine the long-term trend of load distribution, and proposes N-Rank to extract this trend. The obtained information is used to guide BiDOR's route selection at runtime, thereby improving load balancing while retaining simplicity and predictability. Simulation validates that Q-StaR significantly outperforms the typical dimension-order routing (throughput under uniform traffic improved by 42.9\%, and mean/maximum latency under realistic workloads reduced by 86.4\%/95.3\%). △ Less

Submitted 11 March, 2026; originally announced March 2026.

Comments: 7 pages,9 figures

arXiv:2603.10088 [pdf, ps, other]

ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

Authors: Zijian Zhu, Fei Ren, Zhanhong Tan, Kaisheng Ma

Abstract: Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs a… ▽ More Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbf{ES-dLLM}, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the vanilla implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality. △ Less

Submitted 10 March, 2026; originally announced March 2026.

Comments: Accepted at ICLR 2026

arXiv:2603.00632 [pdf, ps, other]

Stop Treating Collisions Equally: Qualification-Aware Semantic ID Learning for Recommendation at Industrial Scale

Authors: Zheng Hu, Yuxin Chen, Yongsen Pan, Xu Yuan, Yuting Yin, Daoyuan Wang, Boyang Xia, Zefei Luo, Hongyang Wang, Songhao Ni, Dongxu Liang, Jun Wang, Shimin Cai, Tao Zhou, Fuji Ren, Wenwu Ou

Abstract: Semantic IDs (SIDs) are compact discrete representations derived from multimodal item features, serving as a unified abstraction for ID-based and generative recommendation. However, learning high-quality SIDs remains challenging due to two issues. (1) Collision problem: the quantized token space is prone to collisions, in which semantically distinct items are assigned identical or overly similar S… ▽ More Semantic IDs (SIDs) are compact discrete representations derived from multimodal item features, serving as a unified abstraction for ID-based and generative recommendation. However, learning high-quality SIDs remains challenging due to two issues. (1) Collision problem: the quantized token space is prone to collisions, in which semantically distinct items are assigned identical or overly similar SID compositions, resulting in semantic entanglement. (2) Collision-signal heterogeneity: collisions are not uniformly harmful. Some reflect genuine conflicts between semantically unrelated items, while others stem from benign redundancy or systematic data effects. To address these challenges, we propose Qualification-Aware Semantic ID Learning (QuaSID), an end-to-end framework that learns collision-qualified SIDs by selectively repelling qualified conflict pairs and scaling the repulsion strength by collision severity. QuaSID consists of two mechanisms: Hamming-guided Margin Repulsion, which translates low-Hamming SID overlaps into explicit, severity-scaled geometric constraints on the encoder space; and Conflict-Aware Valid Pair Masking, which masks protocol-induced benign overlaps to denoise repulsion supervision. In addition, QuaSID incorporates a dual-tower contrastive objective to inject collaborative signals into tokenization. Experiments on public benchmarks and industrial data validate QuaSID. On public datasets, QuaSID consistently outperforms strong baselines, improving top-K ranking quality by 5.9% over the best baseline while increasing SID composition diversity. In an online A/B test on Kuaishou e-commerce with a 5% traffic split, QuaSID increases ranking GMV-S2 by 2.38% and improves completed orders on cold-start retrieval by up to 6.42%. Finally, we show that the proposed repulsion loss is plug-and-play and enhances a range of SID learning frameworks across datasets. △ Less

Submitted 28 February, 2026; originally announced March 2026.

arXiv:2602.03786 [pdf, ps, other]

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Authors: Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, Yuyu Luo, Jiayi Zhang

Abstract: Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction… ▽ More Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra △ Less

Submitted 7 February, 2026; v1 submitted 3 February, 2026; originally announced February 2026.

arXiv:2601.16527 [pdf, ps, other]

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

Authors: Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

Abstract: Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurg… ▽ More Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization. △ Less

Submitted 23 January, 2026; originally announced January 2026.

arXiv:2601.07507 [pdf, ps, other]

High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning

Authors: Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, Hinrich Schütze

Abstract: As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacit… ▽ More As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity when compared to full parameter fine-tuning. We present \textbf{SMoA}, a high-rank \textbf{S}tructured \textbf{MO}dulation \textbf{A}dapter that uses fewer trainable parameters while maintaining a higher rank, thereby improving the model's representational capacity and offering improved performance potential. The core idea is to freeze the original pretrained weights and selectively amplify or suppress important features of the original weights across multiple subspaces. The subspace mechanism provides an efficient way to increase the capacity and complexity of a model. We conduct both theoretical analyses and empirical studies on various tasks. Experiment results show that SMoA outperforms LoRA and its variants on 10 tasks, with extensive ablation studies validating its effectiveness. △ Less

Submitted 12 January, 2026; originally announced January 2026.

Comments: under review

arXiv:2512.12530 [pdf, ps, other]

Principled Performance Tunability in Operating System Kernels

Authors: Zhongjie Chen, Wentao Zhang, Yulong Tang, Ran Shu, Fengyuan Ren, Tianyin Xu, Jing Liu

Abstract: The Linux kernel source code contains numerous constant values that critically influence system performance. Many of these constants, which we term perf-consts, are magic numbers that encode brittle assumptions about hardware and workloads. As systems and workloads evolve, such constants often become suboptimal. Unfortunately, deployed kernels lack support for safe and efficient in-situ tuning of… ▽ More The Linux kernel source code contains numerous constant values that critically influence system performance. Many of these constants, which we term perf-consts, are magic numbers that encode brittle assumptions about hardware and workloads. As systems and workloads evolve, such constants often become suboptimal. Unfortunately, deployed kernels lack support for safe and efficient in-situ tuning of perf-consts without a long and disruptive process of rebuilding and redeploying the kernel image. This paper advocates principled OS performance tunability. We present KernelX, a system that provides a safe, efficient, and programmable interface for in-situ tuning of arbitrary perf-consts on a running kernel. KernelX transforms any perf-const into a tunable knob on demand using a novel mechanism called Scoped Indirect Execution (SIE). SIE precisely identifies the binary boundaries where a perf-const influences system state and redirects execution to synthesized instructions that update the state as if new values were used. KernelX goes beyond version atomicity to guarantee side-effect safety, a property not provided by existing kernel update mechanisms. KernelX also provides a programmable interface that allows policies to incorporate application hints, hardware heuristics, and fine-grained isolation, without modifying kernel source code or disrupting deployed OS kernels. Case studies across multiple kernel subsystems demonstrate that KernelX enables significant performance improvements by making previously untunable perf-consts safely tunable at runtime, while supporting millisecond-scale policy updates. △ Less

Submitted 13 December, 2025; originally announced December 2025.

Comments: 12 pages

arXiv:2511.10325 [pdf, ps, other]

TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities

Authors: Yan Zhuang, Minhao Liu, Yanru Zhang, Jiawen Deng, Fuji Ren

Abstract: Multimodal Sentiment Analysis (MSA) aims to infer human sentiment by integrating information from multiple modalities such as text, audio, and video. In real-world scenarios, however, the presence of missing modalities and noisy signals significantly hinders the robustness and accuracy of existing models. While prior works have made progress on these issues, they are typically addressed in isolati… ▽ More Multimodal Sentiment Analysis (MSA) aims to infer human sentiment by integrating information from multiple modalities such as text, audio, and video. In real-world scenarios, however, the presence of missing modalities and noisy signals significantly hinders the robustness and accuracy of existing models. While prior works have made progress on these issues, they are typically addressed in isolation, limiting overall effectiveness in practical settings. To jointly mitigate the challenges posed by missing and noisy modalities, we propose a framework called Two-stage Modality Denoising and Complementation (TMDC). TMDC comprises two sequential training stages. In the Intra-Modality Denoising Stage, denoised modality-specific and modality-shared representations are extracted from complete data using dedicated denoising modules, reducing the impact of noise and enhancing representational robustness. In the Inter-Modality Complementation Stage, these representations are leveraged to compensate for missing modalities, thereby enriching the available information and further improving robustness. Extensive evaluations on MOSI, MOSEI, and IEMOCAP demonstrate that TMDC consistently achieves superior performance compared to existing methods, establishing new state-of-the-art results. △ Less

Submitted 13 November, 2025; originally announced November 2025.

Comments: Accepted by AAAI 2026

arXiv:2510.24668 [pdf, ps, other]

InteractComp: Evaluating Search Agents With Ambiguous Queries

Authors: Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo

Abstract: Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks ca… ▽ More Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp. △ Less

Submitted 28 October, 2025; originally announced October 2025.

arXiv:2510.24577 [pdf]

Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges

Authors: He Yang, Fei Ren, Francesco Calabro, Hai-Sui Yu, Xiaohui Chen, Pei-Zhi Zhuang

Abstract: We are delighted to see the recent development of physics-informed extreme learning machine (PIELM) for its higher computational efficiency and accuracy compared to other physics-informed machine learning (PIML) paradigms. Since a comprehensive summary or review of PIELM is currently unavailable, we would like to take this opportunity to share our perspectives and experiences on this promising res… ▽ More We are delighted to see the recent development of physics-informed extreme learning machine (PIELM) for its higher computational efficiency and accuracy compared to other physics-informed machine learning (PIML) paradigms. Since a comprehensive summary or review of PIELM is currently unavailable, we would like to take this opportunity to share our perspectives and experiences on this promising research direction. We can see that many efforts have been made to solve ordinary/partial differential equations (ODEs/PDEs) characterized by sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling, and interpretability. Despite these encouraging successes, many pressing challenges remain to be tackled, which also provides opportunities to develop more robust, interpretable, and generalizable PIELM frameworks for scientific and engineering applications. △ Less

Submitted 2 November, 2025; v1 submitted 28 October, 2025; originally announced October 2025.

arXiv:2510.21426 [pdf]

A Rapid Physics-Informed Machine Learning Framework Based on Extreme Learning Machine for Inverse Stefan Problems

Authors: Pei-Zhi Zhuang, Ming-Yue Yang, Fei Ren, Hong-Ya Yue, He Yang

Abstract: The inverse Stefan problem, as a typical phase-change problem with moving boundaries, finds extensive applications in science and engineering. Recent years have seen the applications of physics-informed neural networks (PINNs) to solving Stefan problems, yet they still exhibit shortcomings in hyperparameter dependency, training efficiency, and prediction accuracy. To address this, this paper devel… ▽ More The inverse Stefan problem, as a typical phase-change problem with moving boundaries, finds extensive applications in science and engineering. Recent years have seen the applications of physics-informed neural networks (PINNs) to solving Stefan problems, yet they still exhibit shortcomings in hyperparameter dependency, training efficiency, and prediction accuracy. To address this, this paper develops a physics-informed extreme learning machine (PIELM), a rapid physics-informed learning method framework for inverse Stefan problems. PIELM replaces conventional deep neural networks with an extreme learning machine network. The input weights are fixed in the PIELM framework, and the output weights are determined by optimizing a loss vector of physical laws composed by initial and boundary conditions and governing partial differential equations (PDEs). Then, solving inverse Stefan problems is transformed into finding the Moore-Penrose generalized inverse by the least squares method. Case studies show that the PIELM can increase the prediction accuracy by 3-7 order of magnitude in terms of the relative L2 error, and meanwhile saving more than 94% training time, compared to conventional PINNs. △ Less

Submitted 24 October, 2025; originally announced October 2025.

arXiv:2510.19980 [pdf, ps, other]

Abstain Mask Retain Core: Time Series Prediction by Adaptive Masking Loss with Representation Consistency

Authors: Renzhao Liang, Sizhe Xu, Chenggang Xie, Jingru Chen, Feiyang Ren, Shu Yang, Takahiro Yabe

Abstract: Time series forecasting plays a pivotal role in critical domains such as energy management and financial markets. Although deep learning-based approaches (e.g., MLP, RNN, Transformer) have achieved remarkable progress, the prevailing "long-sequence information gain hypothesis" exhibits inherent limitations. Through systematic experimentation, this study reveals a counterintuitive phenomenon: appro… ▽ More Time series forecasting plays a pivotal role in critical domains such as energy management and financial markets. Although deep learning-based approaches (e.g., MLP, RNN, Transformer) have achieved remarkable progress, the prevailing "long-sequence information gain hypothesis" exhibits inherent limitations. Through systematic experimentation, this study reveals a counterintuitive phenomenon: appropriately truncating historical data can paradoxically enhance prediction accuracy, indicating that existing models learn substantial redundant features (e.g., noise or irrelevant fluctuations) during training, thereby compromising effective signal extraction. Building upon information bottleneck theory, we propose an innovative solution termed Adaptive Masking Loss with Representation Consistency (AMRC), which features two core components: 1) Dynamic masking loss, which adaptively identified highly discriminative temporal segments to guide gradient descent during model training; 2) Representation consistency constraint, which stabilized the mapping relationships among inputs, labels, and predictions. Experimental results demonstrate that AMRC effectively suppresses redundant feature learning while significantly improving model performance. This work not only challenges conventional assumptions in temporal modeling but also provides novel theoretical insights and methodological breakthroughs for developing efficient and robust forecasting models. △ Less

Submitted 22 October, 2025; originally announced October 2025.

Comments: 20 pages, 4 figures. Accepted as Spotlight poster in NeurIPS 2025

arXiv:2510.12293 [pdf]

General Fourier Feature Physics-Informed Extreme Learning Machine (GFF-PIELM) for High-Frequency PDEs

Authors: Fei Ren, Sifan Wang, Pei-Zhi Zhuang, Hai-Sui Yu, He Yang

Abstract: Conventional physics-informed extreme learning machine (PIELM) often faces challenges in solving partial differential equations (PDEs) involving high-frequency and variable-frequency behaviors. To address these challenges, we propose a general Fourier feature physics-informed extreme learning machine (GFF-PIELM). We demonstrate that directly concatenating multiple Fourier feature mappings (FFMs) a… ▽ More Conventional physics-informed extreme learning machine (PIELM) often faces challenges in solving partial differential equations (PDEs) involving high-frequency and variable-frequency behaviors. To address these challenges, we propose a general Fourier feature physics-informed extreme learning machine (GFF-PIELM). We demonstrate that directly concatenating multiple Fourier feature mappings (FFMs) and an extreme learning machine (ELM) network makes it difficult to determine frequency-related hyperparameters. Fortunately, we find an alternative to establish the GFF-PIELM in three main steps. First, we integrate a variation of FFM into ELM as the Fourier-based activation function, so there is still one hidden layer in the GFF-PIELM framework. Second, we assign a set of frequency coefficients to the hidden neurons, which enables ELM network to capture diverse frequency components of target solutions. Finally, we develop an innovative, straightforward initialization method for these hyperparameters by monitoring the distribution of ELM output weights. GFF-PIELM not only retains the high accuracy, efficiency, and simplicity of the PIELM framework but also inherits the ability of FFMs to effectively handle high-frequency problems. We carry out five case studies with a total of ten numerical examples to highlight the feasibility and validity of the proposed GFF-PIELM, involving high frequency, variable frequency, multi-scale behaviour, irregular boundary and inverse problems. Compared to conventional PIELM, the GFF-PIELM approach significantly improves predictive accuracy without additional cost in training time and architecture complexity. Our results confirm that that PIELM can be extended to solve high-frequency and variable-frequency PDEs with high accuracy, and our initialization strategy may further inspire advances in other physics-informed machine learning (PIML) frameworks. △ Less

Submitted 14 October, 2025; originally announced October 2025.

arXiv:2510.00698 [pdf]

Physics-Informed Extreme Learning Machine (PIELM) for Tunnelling-Induced Soil-Pile Interactions

Authors: Fu-Chen Guo, Pei-Zhi Zhuang, Fei Ren, Hong-Ya Yue, He Yang

Abstract: Physics-informed machine learning has been a promising data-driven and physics-informed approach in geotechnical engineering. This study proposes a physics-informed extreme learning machine (PIELM) framework for analyzing tunneling-induced soil-pile interactions. The pile foundation is modeled as an Euler-Bernoulli beam, and the surrounding soil is modeled as a Pasternak foundation. The soil-pile… ▽ More Physics-informed machine learning has been a promising data-driven and physics-informed approach in geotechnical engineering. This study proposes a physics-informed extreme learning machine (PIELM) framework for analyzing tunneling-induced soil-pile interactions. The pile foundation is modeled as an Euler-Bernoulli beam, and the surrounding soil is modeled as a Pasternak foundation. The soil-pile interaction is formulated into a fourth-order ordinary differential equation (ODE) that constitutes the physics-informed component, while measured data are incorporated into PIELM as the data-driven component. Combining physics and data yields a loss vector of the extreme learning machine (ELM) network, which is trained within 1 second by the least squares method. After validating the PIELM approach by the boundary element method (BEM) and finite difference method (FDM), parametric studies are carried out to examine the effects of ELM network architecture, data monitoring locations and numbers on the performance of PIELM. The results indicate that monitored data should be placed at positions where the gradients of pile deflections are significant, such as at the pile tip/top and near tunneling zones. Two application examples highlight the critical role of physics-informed and data-driven approach for tunnelling-induced soil-pile interactions. The proposed approach shows great potential for real-time monitoring and safety assessment of pile foundations, and benefits for intelligent early-warning systems in geotechnical engineering. △ Less

Submitted 1 October, 2025; originally announced October 2025.

arXiv:2509.21151 [pdf, ps, other]

Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

Authors: Lei Hei, Tingjing Liao, Yingxin Pei, Yiyang Qi, Jiaqi Wang, Ruiting Li, Feiliang Ren

Abstract: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entit… ▽ More Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability. △ Less

Submitted 25 September, 2025; originally announced September 2025.

Comments: Accepted by EMNLP 2025 Main Conference

arXiv:2509.12875 [pdf, ps, other]

LTA-thinker: Latent Thought-Augmented Training Framework for Large Language Models on Complex Reasoning

Authors: Jiaqi Wang, Binquan Ji, Haibo Luo, Yiyang Qi, Ruiting Li, Huiyan Wang, Yuantao Han, Cangyi Yang, jiaxu Zhang, Feiliang Ren

Abstract: Complex Reasoning in Large Language Models can be dynamically optimized using Test-Time Scaling (TTS) to mitigate Overthinking. Methods such as Coconut, SoftCoT and its variant are effective in continuous latent space inference, the core bottleneck still lies in the efficient generation and utilization of high-quality Latent Thought. Drawing from the theory of SoftCoT++ that a larger variance in t… ▽ More Complex Reasoning in Large Language Models can be dynamically optimized using Test-Time Scaling (TTS) to mitigate Overthinking. Methods such as Coconut, SoftCoT and its variant are effective in continuous latent space inference, the core bottleneck still lies in the efficient generation and utilization of high-quality Latent Thought. Drawing from the theory of SoftCoT++ that a larger variance in the generated Latent Thought distribution more closely approximates the golden truth distribution, we propose a Latent Thought-Augmented Training Framework--LTA-Thinker, which improves distributional variance and enhances reasoning performance from two perspectives. First, LTA-Thinker constructs a Latent Thought generation architecture based on a learnable prior. This architecture aims to increase the variance distribution of generated Latent Thought Vectors in order to simplify the overall structure and raise the performance ceiling. Second, LTA-Thinker introduces a distribution-based directional optimization paradigm that jointly constrains both distribution locality and distribution scale. This mechanism improves information efficiency and computational cost through a multi-objective co-training strategy, which combines standard Supervised Fine-Tuning (SFT) loss with two novel losses: Semantic Alignment Loss, which utilizes KL divergence to ensure that the Latent Thought is highly relevant to the semantics of the question; Reasoning Focus Loss, which utilizes a contrastive learning mechanism to guide the model to focus on the most critical reasoning steps. Experiments show that LTA-thinker achieves state-of-the-art (SOTA) performance among various baselines and demonstrates a higher performance ceiling and better scaling effects. △ Less

Submitted 16 December, 2025; v1 submitted 16 September, 2025; originally announced September 2025.

arXiv:2509.12811 [pdf, ps, other]

ConvergeWriter: Data-Driven Bottom-Up Article Construction

Authors: Binquan Ji, Jiaqi Wang, Ruiting Li, Xingchen Han, Yiyang Qi, Shichao Wang, Yifei Lu, Yuantao Han, Feiliang Ren

Abstract: Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to… ▽ More Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains. △ Less

Submitted 16 September, 2025; originally announced September 2025.

arXiv:2509.11628 [pdf, ps, other]

doi 10.1145/3746027.3755331

SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching

Authors: Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, Linfeng Zhang

Abstract: Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large languag… ▽ More Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations. SpeCa's core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: \textbf{https://github.com/Shenyi-Z/Cache4Diffusion} △ Less

Submitted 15 September, 2025; originally announced September 2025.

Comments: 15 pages, 9 figures, ACM Multimedia 2025

arXiv:2508.17062 [pdf, ps, other]

SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

Authors: Peng Hu, Yu Gu, Liang Luo, Fuji Ren

Abstract: Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this… ▽ More Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency. △ Less

Submitted 23 August, 2025; originally announced August 2025.

arXiv:2508.02173 [pdf, ps, other]

doi 10.1145/3746059.3747659

EchoLadder: Progressive AI-Assisted Design of Immersive VR Scenes

Authors: Zhuangze Hou, Jingze Tian, Nianlong Li, Farong Ren, Can Liu

Abstract: Mixed reality platforms allow users to create virtual environments, yet novice users struggle with both ideation and execution in spatial design. While existing AI models can automatically generate scenes based on user prompts, the lack of interactive control limits users' ability to iteratively steer the output. In this paper, we present EchoLadder, a novel human-AI collaboration pipeline that le… ▽ More Mixed reality platforms allow users to create virtual environments, yet novice users struggle with both ideation and execution in spatial design. While existing AI models can automatically generate scenes based on user prompts, the lack of interactive control limits users' ability to iteratively steer the output. In this paper, we present EchoLadder, a novel human-AI collaboration pipeline that leverages large vision-language model (LVLM) to support interactive scene modification in virtual reality. EchoLadder accepts users' verbal instructions at varied levels of abstraction and spatial specificity, generates concrete design suggestions throughout a progressive design process. The suggestions can be automatically applied, regenerated and retracted by users' toggle control.Our ablation study showed effectiveness of our pipeline components. Our user study found that, compared to baseline without showing suggestions, EchoLadder better supports user creativity in spatial design. It also contributes insights on users' progressive design strategies under AI assistance, providing design implications for future systems. △ Less

Submitted 4 August, 2025; originally announced August 2025.

Comments: To appear at UIST 2025

arXiv:2507.17147 [pdf, ps, other]

CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

Authors: Cheng Liu, Yifei Lu, Fanghua Ye, Jian Li, Xingyu Chen, Feiliang Ren, Zhaopeng Tu, Xiaolong Li

Abstract: Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we… ▽ More Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbf{CogDual}, a novel RPLA adopting a \textit{cognize-then-respond } reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks. △ Less

Submitted 22 July, 2025; originally announced July 2025.

arXiv:2506.17692 [pdf, ps, other]

Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering

Authors: Binquan Ji, Haibo Luo, Yifei Lu, Lei Hei, Jiaqi Wang, Tingjing Liao, Lingyu Wang, Shichao Wang, Feiliang Ren

Abstract: Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges -such as hallucinations and semantic drift-for lightweight LLMs with few… ▽ More Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges -such as hallucinations and semantic drift-for lightweight LLMs with fewer parameters. This work proposes a novel framework called DEC (Dynamic Enhancement Chain). DEC first decomposes complex questions into logically coherent subquestions to form a hallucination-free reasoning chain. It then iteratively refines these subquestions through context-aware rewriting to generate effective query formulations. For retrieval, we introduce a lightweight discriminative keyword extraction module that leverages extracted keywords to achieve targeted, precise document recall with relatively low computational overhead. Extensive experiments on three multi-hop QA datasets demonstrate that DEC performs on par with or surpasses state-of-the-art benchmarks while significantly reducing token consumption. Notably, our approach attains state-of-the-art results on models with 8B parameters, showcasing its effectiveness in various scenarios, particularly in resource-constrained environments. △ Less

Submitted 21 June, 2025; originally announced June 2025.

arXiv:2506.08381 [pdf]

Physics-informed extreme learning machine for Terzaghi consolidation problems and interpretation of coefficient of consolidation based on CPTu data

Authors: He Yang, Pin-Qiang Mo, Fei Ren, Hai-Sui Yu, Xueyu Geng, Pei-Zhi Zhuang

Abstract: This paper conducts a preliminary study to investigate the feasibility of a physics-informed extreme learning machine (PIELM) for solving the Terzaghi consolidation equation and interpreting the coefficient of consolidation of soil from piezocone penetration tests (CPTu). In the PIELM framework, the target solution is approximated by a single-layer feed-forward extreme learning machine (ELM) netwo… ▽ More This paper conducts a preliminary study to investigate the feasibility of a physics-informed extreme learning machine (PIELM) for solving the Terzaghi consolidation equation and interpreting the coefficient of consolidation of soil from piezocone penetration tests (CPTu). In the PIELM framework, the target solution is approximated by a single-layer feed-forward extreme learning machine (ELM) network, instead of the deep neural networks typically employed in physics-informed neural networks (PINNs). Physical laws and measured data are integrated into a loss vector, which is minimized via least squares methods during ELM training. As a result, training efficiency is significantly improved by avoiding the gradient-descent optimisation commonly used in PINNs. The performance of PIELM is evaluated using three forward-problem case studies. Notably, a time-stepping strategy is incorporated into the PIELM framework to alleviate sharp gradients caused by inconsistent initial and boundary conditions. This paper further applies PIELM to estimate the soil consolidation coefficient, given that initial distributions of excess water pressure are often unavailable in CPTu dissipation tests (conducted following the pauses of penetration). By combining physical laws (excluding initial conditions) with measured data (i.e., excess pore-water pressure at the probe surface), the results demonstrate that PIELM is an effective tool for interpreting CPTu dissipation tests, owing to its ability to fuse data with physical constraints. This study contributes to the interpretation of consolidation coefficients from CPTu dissipation tests, particularly in scenarios where initial distributions of excess water pressure are not prior-known. △ Less

Submitted 6 February, 2026; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2505.20310 [pdf, ps, other]

Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System

Authors: Wanghan Xu, Wenlong Zhang, Fenghua Ling, Ben Fei, Yusong Hu, Runmin Ma, Bo Zhang, Fangxuan Ren, Jintai Lin, Wanli Ouyang, Lei Bai

Abstract: Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screen… ▽ More Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction, which demands substantial human effort and time. However, while LLM-based methods can accelerate certain stages, they still face significant challenges, such as hallucinations in paper screening and data extraction. In this paper, we propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls. The hybrid review, hierarchical extraction, self-proving, and feedback checking strategies implemented in Manalyzer significantly alleviate these two hallucinations. To comprehensively evaluate the performance of meta-analysis, we construct a new benchmark comprising 729 papers across 3 domains, encompassing text, image, and table modalities, with over 10,000 data points. Extensive experiments demonstrate that Manalyzer achieves significant performance improvements over the LLM baseline in multi meta-analysis tasks. Project page: https://black-yt.github.io/meta-analysis-page/ . △ Less

Submitted 20 January, 2026; v1 submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.15715 [pdf, ps, other]

Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling

Authors: He Hu, Yucheng Zhou, Juzheng Si, Qianning Wang, Hengheng Zhang, Fuji Ren, Fei Ma, Laizhong Cui, Qi Tian

Abstract: Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating dive… ▽ More Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop PsyLLM, we design a novel automated data synthesis pipeline that processes real-world mental health posts collected from Reddit, where users frequently share psychological distress and seek community support. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark. The model weights and dataset have been publicly released at https://github.com/Emo-gml/PsyLLM. △ Less

Submitted 3 November, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.09261 [pdf, ps, other]

Instantiating Standards: Enabling Standard-Driven Text TTP Extraction with Evolvable Memory

Authors: Cheng Meng, ZhengWei Jiang, QiuYun Wang, XinYi Li, ChunYan Ma, FangMing Dong, FangLi Ren, BaoXu Liu

Abstract: Extracting MITRE ATT\&CK Tactics, Techniques, and Procedures (TTPs) from natural language threat reports is crucial yet challenging. Existing methods primarily focus on performance metrics using data-driven approaches, often neglecting mechanisms to ensure faithful adherence to the official standard. This deficiency compromises reliability and consistency of TTP assignments, creating intelligence… ▽ More Extracting MITRE ATT\&CK Tactics, Techniques, and Procedures (TTPs) from natural language threat reports is crucial yet challenging. Existing methods primarily focus on performance metrics using data-driven approaches, often neglecting mechanisms to ensure faithful adherence to the official standard. This deficiency compromises reliability and consistency of TTP assignments, creating intelligence silos and contradictory threat assessments across organizations. To address this, we introduce a novel framework that converts abstract standard definitions into actionable, contextualized knowledge. Our method utilizes Large Language Model (LLM) to generate, update, and apply this knowledge. This framework populates an evolvable memory with dual-layer situational knowledge instances derived from labeled examples and official definitions. The first layer identifies situational contexts (e.g., "Communication with C2 using encoded subdomains"), while the second layer captures distinctive features that differentiate similar techniques (e.g., distinguishing T1132 "Data Encoding" from T1071 "Application Layer Protocol" based on whether the focus is on encoding methods or protocol usage). This structured approach provides a transparent basis for explainable TTP assignments and enhanced human oversight, while also helping to standardize other TTP extraction systems. Experiments show our framework (using Qwen2.5-32B) boosts Technique F1 scores by 11\% over GPT-4o. Qualitative analysis confirms superior standardization, enhanced transparency, and improved explainability in real-world threat intelligence scenarios. To the best of our knowledge, this is the first work that uses the LLM to generate, update, and apply the a new knowledge for TTP extraction. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2504.17307 [pdf, ps, other]

An Extensible Software Transport Layer for GPU Networking

Authors: Yang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, Fengyuan Ren, Zhiying Xu, Costin Raiciu, Ion Stoica

Abstract: Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UC… ▽ More Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UCCL decouples the data path and control path of existing RDMA NICs and efficiently runs the control-path transport on host CPUs. This software extensibility brings in transport innovations that cannot be achieved in hardware for ML workloads, e.g., a multipath transport to resolve flow collisions. ML collectives atop UCCL achieve up to 4.5x higher performance compared to existing RDMA NICs. △ Less

Submitted 4 August, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

arXiv:2503.20840 [pdf, ps, other]

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision

Authors: Yifei Lu, Fanghua Ye, Jian Li, Qiang Gao, Cheng Liu, Haibo Luo, Nan Du, Xiaolong Li, Feiliang Ren

Abstract: Tool invocation significantly enhances the capabilities of Large Language Models (LLMs), yet challenges persist, particularly in complex task scenarios. Current methods, such as instruction-enhanced reasoning and supervised fine-tuning, often result in unnecessarily long reasoning paths and face difficulties in verifying the correctness of intermediate steps. In this paper, we propose CodeTool, a… ▽ More Tool invocation significantly enhances the capabilities of Large Language Models (LLMs), yet challenges persist, particularly in complex task scenarios. Current methods, such as instruction-enhanced reasoning and supervised fine-tuning, often result in unnecessarily long reasoning paths and face difficulties in verifying the correctness of intermediate steps. In this paper, we propose CodeTool, a novel framework for stepwise code generation that improves LLM tool invocation by leveraging the concise and easily verifiable nature of code. CodeTool incorporates two distinct process rewards: the On-the-spot Reward, which provides immediate feedback on the accuracy of each tool invocation, and the Latent Reward, which assesses the contribution of each step toward overall task completion. By maximizing the cumulative reward of the On-the-spot and Latend Rewards at each step, LLMs are guided to follow efficient and accurate reasoning paths. Extensive experiments on StableToolBench and RestBench-TMDB demonstrate the superiority of CodeTool over existing approaches. △ Less

Submitted 12 June, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

Comments: ACL2025

arXiv:2503.11342 [pdf, other]

Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset

Authors: Yibing Weng, Yu Gu, Fuji Ren

Abstract: Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then enga… ▽ More Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation. △ Less

Submitted 14 March, 2025; originally announced March 2025.

arXiv:2503.00974 [pdf, other]

SAF: Scalable Acceleration Framework for dynamic and flexible scaling of FPGAs

Authors: Masudul Hassan Quraishi, Michael Riera, Fengbo Ren, Aman Arora, Aviral Shrivastava

Abstract: FPGAs are increasingly gaining traction in cloud and edge computing environments due to their hardware flexibility, low latency, and low energy consumption. However, the existing hardware stack of FPGA and the host-FPGA connectivity does not allow flexible scaling and simultaneous reconfiguration of multiple devices, which limits the adoption of FPGA at scale. In this paper, we present SAF -- an E… ▽ More FPGAs are increasingly gaining traction in cloud and edge computing environments due to their hardware flexibility, low latency, and low energy consumption. However, the existing hardware stack of FPGA and the host-FPGA connectivity does not allow flexible scaling and simultaneous reconfiguration of multiple devices, which limits the adoption of FPGA at scale. In this paper, we present SAF -- an Ethernet-based scalable acceleration framework that allows FPGA to be hot-plugged into a network in a stand-alone fashion without connecting to a local host CPU, which enables flexible scalability. SAF provides a custom FPGA shell and a set of Ethernet protocols that allow FPGAs to connect with a remote host to accelerate application kernels. SAF can configure multiple FPGAs simultaneously, which significantly reduces the reconfiguration time in scaling effort. We implemented the SAF framework using Intel FPGA SDK for OpenCL and 20 Bittware 385A cards with Arria-10 FPGAs. We analyze a case study and conduct experiments to compare SAF with state-of-the-art multi-FPGA clusters. Results show that SAF provides 13X faster reconfiguration than sequential PCIe programming, reduces the hardware setup costs by 38%, application runtime by 25%, and energy consumption by 27%. We evaluated the performance scalability of SAF using the PTRANS benchmark of the HPCC FPGA benchmark suite and showed an almost linear speedup for strong and weak scaling scenarios. △ Less

Submitted 2 March, 2025; originally announced March 2025.

arXiv:2502.18013 [pdf, other]

Exploring the Effects of Traditional Chinese Medicine Scents on Mitigating Driving Fatigue

Authors: Nengyue Su, Liang Luo, Yu Gu, Fuji Ren

Abstract: The rise of autonomous driving technology has led to concerns about inactivity-induced fatigue. This paper explores Traditional Chinese Medicine (TCM) scents for mitigating. Two human-involved studies have been conducted in a high-fidelity driving simulator. Study 1 maps six prevalent TCM scents onto the arousal/valence circumplex to select proper candidates, i.e., argy wormwood (with the highest… ▽ More The rise of autonomous driving technology has led to concerns about inactivity-induced fatigue. This paper explores Traditional Chinese Medicine (TCM) scents for mitigating. Two human-involved studies have been conducted in a high-fidelity driving simulator. Study 1 maps six prevalent TCM scents onto the arousal/valence circumplex to select proper candidates, i.e., argy wormwood (with the highest arousal) and tangerine peel (with the highest valence). Study 2 tests both scents in an auto-driving course. Statistics show both scents can improve driver alertness and reaction-time, but should be used in different ways: argy wormwood is suitable for short-term use due to its higher intensity but poor acceptance, while tangerine peel is ideal for long-term use due to its higher likeness. These findings provide insights for in-car fatigue mitigation to enhance driver safety and well-being. However, issues such as scent longevity as for aromatherapy and automatic fatigue prediction remain unresolved. △ Less

Submitted 25 February, 2025; originally announced February 2025.

arXiv:2502.06855 [pdf, ps, other]

Self-Supervised Prompt Optimization

Authors: Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Xinbing Liang, Fengwei Teng, Jinhao Tu, Fashen Ren, Xiangru Tang, Sirui Hong, Chenglin Wu, Yuyu Luo

Abstract: Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or b… ▽ More Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO. △ Less

Submitted 21 August, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.00405 [pdf, ps, other]

Spectral Sufficient Conditions for Graph Factors

Authors: Fengyun Ren, Shumin Zhang, Ke Wang

Abstract: The $\{K_{1,1}, K_{1,2},C_m: m\geq3\}$-factor of a graph is a spanning subgraph whose each component is an element of $\{K_{1,1}, K_{1,2},C_m: m\geq3\}$. In this paper, through the graph spectral methods, we establish the lower bound of the signless Laplacian spectral radius and the upper bound of the distance spectral radius to determine whether a graph admits a $\{K_2\}$-factor. We get a lower b… ▽ More The $\{K_{1,1}, K_{1,2},C_m: m\geq3\}$-factor of a graph is a spanning subgraph whose each component is an element of $\{K_{1,1}, K_{1,2},C_m: m\geq3\}$. In this paper, through the graph spectral methods, we establish the lower bound of the signless Laplacian spectral radius and the upper bound of the distance spectral radius to determine whether a graph admits a $\{K_2\}$-factor. We get a lower bound on the size (resp. the spectral radius) of $G$ to guarantee that $G$ contains a $\{K_{1,1}, K_{1,2},C_m: m\geq3\}$-factor. Then we determine an upper bound on the distance spectral radius of $G$ to ensure that $G$ has a $\{K_{1,1}, K_{1,2},C_m: m\geq3\}$-factor. Furthermore, by constructing extremal graphs, we show that the above all bounds are best possible. △ Less

Submitted 1 February, 2025; originally announced February 2025.

arXiv:2501.15000 [pdf, ps, other]

MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Authors: Zhongpu Chen, Yinfeng Liu, Long Shi, Xingyan Chen, Yu Zhao, Fuji Ren

Abstract: Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readabilit… ▽ More Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark. △ Less

Submitted 26 August, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

Comments: WWW 2025

arXiv:2501.13570 [pdf, other]

Occamy: A Preemptive Buffer Management for On-chip Shared-memory Switches

Authors: Danfeng Shan, Yunguang Li, Jinchao Ma, Zhenxing Zhang, Zeyu Liang, Xinyu Wen, Hao Li, Wanchun Jiang, Nan Li, Fengyuan Ren

Abstract: Today's high-speed switches employ an on-chip shared packet buffer. The buffer is becoming increasingly insufficient as it cannot scale with the growing switching capacity. Nonetheless, the buffer needs to face highly intense bursts and meet stringent performance requirements for datacenter applications. This imposes rigorous demand on the Buffer Management (BM) scheme, which dynamically allocates… ▽ More Today's high-speed switches employ an on-chip shared packet buffer. The buffer is becoming increasingly insufficient as it cannot scale with the growing switching capacity. Nonetheless, the buffer needs to face highly intense bursts and meet stringent performance requirements for datacenter applications. This imposes rigorous demand on the Buffer Management (BM) scheme, which dynamically allocates the buffer across queues. However, the de facto BM scheme, designed over two decades ago, is ill-suited to meet the requirements of today's network. In this paper, we argue that shallow-buffer switches, intense bursts, along with dynamic traffic call for a highly agile BM that can quickly adjust the buffer allocation as traffic changes. However, the agility of the current BM is fundamentally limited by its non-preemptive nature. Nonetheless, we find that preemptive BM, considered unrealizable in history, is now feasible on modern switch chips. We propose Occamy, a preemptive BM that can quickly adjust buffer allocation. Occamy utilizes the redundant memory bandwidth to actively reclaim and reallocate the over-allocated buffer. Testbed experiments and large-scale simulations show that Occamy can improve the end-to-end performance by up to ~55%. △ Less

Submitted 23 January, 2025; originally announced January 2025.

Comments: 18 pages, 23 figures

arXiv:2412.13544 [pdf, other]

Bridging the User-side Knowledge Gap in Knowledge-aware Recommendations with Large Language Models

Authors: Zheng Hu, Zhe Li, Ziyun Jiao, Satoshi Nakagawa, Jiawen Deng, Shimin Cai, Tao Zhou, Fuji Ren

Abstract: In recent years, knowledge graphs have been integrated into recommender systems as item-side auxiliary information, enhancing recommendation accuracy. However, constructing and integrating structural user-side knowledge remains a significant challenge due to the improper granularity and inherent scarcity of user-side features. Recent advancements in Large Language Models (LLMs) offer the potential… ▽ More In recent years, knowledge graphs have been integrated into recommender systems as item-side auxiliary information, enhancing recommendation accuracy. However, constructing and integrating structural user-side knowledge remains a significant challenge due to the improper granularity and inherent scarcity of user-side features. Recent advancements in Large Language Models (LLMs) offer the potential to bridge this gap by leveraging their human behavior understanding and extensive real-world knowledge. Nevertheless, integrating LLM-generated information into recommender systems presents challenges, including the risk of noisy information and the need for additional knowledge transfer. In this paper, we propose an LLM-based user-side knowledge inference method alongside a carefully designed recommendation framework to address these challenges. Our approach employs LLMs to infer user interests based on historical behaviors, integrating this user-side information with item-side and collaborative data to construct a hybrid structure: the Collaborative Interest Knowledge Graph (CIKG). Furthermore, we propose a CIKG-based recommendation framework that includes a user interest reconstruction module and a cross-domain contrastive learning module to mitigate potential noise and facilitate knowledge transfer. We conduct extensive experiments on three real-world datasets to validate the effectiveness of our method. Our approach achieves state-of-the-art performance compared to competitive baselines, particularly for users with sparse interactions. △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: Accepted at AAAI 2025

arXiv:2412.07116 [pdf, other]

A Review of Human Emotion Synthesis Based on Generative Technology

Authors: Fei Ma, Yukan Li, Yifan Xie, Ying He, Yi Zhang, Hongwei Ren, Zhou Liu, Wei Yao, Fuji Ren, Fei Richard Yu, Shiguang Ni

Abstract: Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Seque… ▽ More Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: 25 pages, 10 figures

arXiv:2411.15180 [pdf, other]

Multi-layer matrix factorization for cancer subtyping using full and partial multi-omics dataset

Authors: Yingxuan Ren, Fengtao Ren, Bo Yang

Abstract: Cancer, with its inherent heterogeneity, is commonly categorized into distinct subtypes based on unique traits, cellular origins, and molecular markers specific to each type. However, current studies primarily rely on complete multi-omics datasets for predicting cancer subtypes, often overlooking predictive performance in cases where some omics data may be missing and neglecting implicit relations… ▽ More Cancer, with its inherent heterogeneity, is commonly categorized into distinct subtypes based on unique traits, cellular origins, and molecular markers specific to each type. However, current studies primarily rely on complete multi-omics datasets for predicting cancer subtypes, often overlooking predictive performance in cases where some omics data may be missing and neglecting implicit relationships across multiple layers of omics data integration. This paper introduces Multi-Layer Matrix Factorization (MLMF), a novel approach for cancer subtyping that employs multi-omics data clustering. MLMF initially processes multi-omics feature matrices by performing multi-layer linear or nonlinear factorization, decomposing the original data into latent feature representations unique to each omics type. These latent representations are subsequently fused into a consensus form, on which spectral clustering is performed to determine subtypes. Additionally, MLMF incorporates a class indicator matrix to handle missing omics data, creating a unified framework that can manage both complete and incomplete multi-omics data. Extensive experiments conducted on 10 multi-omics cancer datasets, both complete and with missing values, demonstrate that MLMF achieves results that are comparable to or surpass the performance of several state-of-the-art approaches. △ Less

Submitted 18 November, 2024; originally announced November 2024.

arXiv:2411.12676 [pdf, other]

IoT-Based 3D Pose Estimation and Motion Optimization for Athletes: Application of C3D and OpenPose

Authors: Fei Ren, Chao Ren, Tianyi Lyu

Abstract: This study proposes the IoT-Enhanced Pose Optimization Network (IE-PONet) for high-precision 3D pose estimation and motion optimization of track and field athletes. IE-PONet integrates C3D for spatiotemporal feature extraction, OpenPose for real-time keypoint detection, and Bayesian optimization for hyperparameter tuning. Experimental results on NTURGB+D and FineGYM datasets demonstrate superior p… ▽ More This study proposes the IoT-Enhanced Pose Optimization Network (IE-PONet) for high-precision 3D pose estimation and motion optimization of track and field athletes. IE-PONet integrates C3D for spatiotemporal feature extraction, OpenPose for real-time keypoint detection, and Bayesian optimization for hyperparameter tuning. Experimental results on NTURGB+D and FineGYM datasets demonstrate superior performance, with AP$^p50$ scores of 90.5 and 91.0, and mAP scores of 74.3 and 74.0, respectively. Ablation studies confirm the essential roles of each module in enhancing model accuracy. IE-PONet provides a robust tool for athletic performance analysis and optimization, offering precise technical insights for training and injury prevention. Future work will focus on further model optimization, multimodal data integration, and developing real-time feedback mechanisms to enhance practical applications. △ Less

Submitted 19 November, 2024; originally announced November 2024.

Comments: 17 pages

arXiv:2408.10841 [pdf, other]

DELIA: Diversity-Enhanced Learning for Instruction Adaptation in Large Language Models

Authors: Yuanhao Zeng, Fei Ren, Xinpeng Zhou, Yihang Wang, Yingxia Shao

Abstract: Although instruction tuning is widely used to adjust behavior in Large Language Models (LLMs), extensive empirical evidence and research indicates that it is primarily a process where the model fits to specific task formats, rather than acquiring new knowledge or capabilities. We propose that this limitation stems from biased features learned during instruction tuning, which differ from ideal task… ▽ More Although instruction tuning is widely used to adjust behavior in Large Language Models (LLMs), extensive empirical evidence and research indicates that it is primarily a process where the model fits to specific task formats, rather than acquiring new knowledge or capabilities. We propose that this limitation stems from biased features learned during instruction tuning, which differ from ideal task-specfic features, leading to learn less underlying semantics in downstream tasks. However, ideal features are unknown and incalculable, constraining past work to rely on prior knowledge to assist reasoning or training, which limits LLMs' capabilities to the developers' abilities, rather than data-driven scalable learning. In our paper, through our novel data synthesis method, DELIA (Diversity-Enhanced Learning for Instruction Adaptation), we leverage the buffering effect of extensive diverse data in LLMs training to transform biased features in instruction tuning into approximations of ideal features, without explicit prior ideal features. Experiments show DELIA's better performance compared to common instruction tuning and other baselines. It outperforms common instruction tuning by 17.07%-33.41% on Icelandic-English translation bleurt score (WMT-21 dataset, gemma-7b-it) and improves accuracy by 36.1% on formatted text generation (Llama2-7b-chat). Notably, among knowledge injection methods we've known, DELIA uniquely align the internal representations of new special tokens with their prior semantics. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: 8 pages, 5 figures

arXiv:2408.08709 [pdf, other]

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Authors: Lei Hei, Ning An, Tingjing Liao, Qi Ma, Jiaqi Wang, Feiliang Ren

Abstract: Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we… ▽ More Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: 15 pages, 7 figures, preprint

arXiv:2407.03640 [pdf, other]

Generative Technology for Human Emotion Recognition: A Scope Review

Authors: Fei Ma, Yucheng Yuan, Yifan Xie, Hongwei Ren, Ivan Liu, Ying He, Fuji Ren, Fei Richard Yu, Shiguang Ni

Abstract: Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progre… ▽ More Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progress has been made in generative models, including Autoencoder, Generative Adversarial Network, Diffusion Model, and Large Language Model. These models, with their powerful data generation capabilities, emerge as pivotal tools in advancing emotion recognition. However, up to now, there remains a paucity of systematic efforts that review generative technology for emotion recognition. This survey aims to bridge the gaps in the existing literature by conducting a comprehensive analysis of over 320 research papers until June 2024. Specifically, this survey will firstly introduce the mathematical principles of different generative models and the commonly used datasets. Subsequently, through a taxonomy, it will provide an in-depth analysis of how generative techniques address emotion recognition based on different modalities in several aspects, including data augmentation, feature extraction, semi-supervised learning, cross-domain, etc. Finally, the review will outline future research directions, emphasizing the potential of generative models to advance the field of emotion recognition and enhance the emotional intelligence of AI systems. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Under Review

arXiv:2404.19154 [pdf, other]

RTF: Region-based Table Filling Method for Relational Triple Extraction

Authors: Ning An, Lei Hei, Yong Jiang, Weiping Meng, Jingjing Hu, Boran Huang, Feiliang Ren

Abstract: Relational triple extraction is crucial work for the automatic construction of knowledge graphs. Existing methods only construct shallow representations from a token or token pair-level. However, previous works ignore local spatial dependencies of relational triples, resulting in a weakness of entity pair boundary detection. To tackle this problem, we propose a novel Region-based Table Filling met… ▽ More Relational triple extraction is crucial work for the automatic construction of knowledge graphs. Existing methods only construct shallow representations from a token or token pair-level. However, previous works ignore local spatial dependencies of relational triples, resulting in a weakness of entity pair boundary detection. To tackle this problem, we propose a novel Region-based Table Filling method (RTF). We devise a novel region-based tagging scheme and bi-directional decoding strategy, which regard each relational triple as a region on the relation-specific table, and identifies triples by determining two endpoints of each region. We also introduce convolution to construct region-level table representations from a spatial perspective which makes triples easier to be captured. In addition, we share partial tagging scores among different relations to improve learning efficiency of relation classifier. Experimental results show that our method achieves state-of-the-art with better generalization capability on three variants of two widely used benchmark datasets. △ Less

Submitted 13 June, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

Comments: Rejected by EMNLP 2023

Showing 1–50 of 121 results for author: Ren, F