Skip to main content

Showing 1–50 of 206 results for author: Lyu, M R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2603.27333  [pdf, ps, other

    cs.SE cs.AI

    ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair

    Authors: Jia Li, Zeyang Zhuang, Zhuangbin Chen, Yuxin Su, Wei Meng, Michael R. Lyu

    Abstract: Compilation errors pose pervasive and critical challenges in software development, significantly hindering productivity. Therefore, Automated Compilation Error Repair (ACER) techniques are proposed to mitigate these issues. Despite recent advancements in ACER, its real-world performance remains poorly evaluated. This can be largely attributed to the limitations of existing benchmarks, \ie decontex… ▽ More

    Submitted 28 March, 2026; originally announced March 2026.

  2. arXiv:2603.00468  [pdf, ps, other

    cs.SE

    Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

    Authors: Yilun Wang, Guangba Yu, Haiyu Huang, Zirui Wang, Yujie Huang, Pengfei Chen, Michael R. Lyu

    Abstract: The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We introduce Cloud-OpsBench, a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud, featuring 452 distinc… ▽ More

    Submitted 28 February, 2026; originally announced March 2026.

    Comments: 22 pages, 4 figures

  3. arXiv:2603.00155  [pdf, ps, other

    cs.CV cs.AI cs.IR

    EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection

    Authors: Wenxin Tang, Jingyu Xiao, Yanpei Gong, Fengyuan Ran, Tongchuan Xia, Junliang Liu, Man Ho Lam, Wenxuan Wang, Michael R. Lyu

    Abstract: Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end fra… ▽ More

    Submitted 25 February, 2026; originally announced March 2026.

  4. arXiv:2602.19276  [pdf, ps, other

    cs.SE

    ComUICoder: Component-based Reusable UI Code Generation for Complex Websites via Semantic Segmentation and Element-wise Feedback

    Authors: Jingyu Xiao, Jiantong Qin, Shuoqi Li, Man Ho Lam, Yuxuan Wan, Jen-tse Huang, Yintong Huo, Michael R. Lyu

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance on the UI-to-code task, which aims to generate UI code from design mock-ups. However, when applied to long and complex websites, they often struggle with fragmented segmentation, redundant code generation for repetitive components, and frequent UI inconsistencies. To systematically investigate and address these challenge… ▽ More

    Submitted 22 February, 2026; originally announced February 2026.

  5. arXiv:2601.15808  [pdf, ps, other

    cs.AI

    Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

    Authors: Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu

    Abstract: Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise… ▽ More

    Submitted 22 January, 2026; originally announced January 2026.

  6. arXiv:2601.13655  [pdf, ps, other

    cs.SE cs.AI cs.DC

    Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs

    Authors: Guangba Yu, Zirui Wang, Yujie Huang, Renyi Zhong, Yuedong Zhong, Yilun Wang, Michael R. Lyu

    Abstract: The democratization of open-source Large Language Models (LLMs) allows users to fine-tune and deploy models on local infrastructure but exposes them to a First Mile deployment landscape. Unlike black-box API consumption, the reliability of user-managed orchestration remains a critical blind spot. To bridge this gap, we conduct the first large-scale empirical study of 705 real-world failures from t… ▽ More

    Submitted 20 January, 2026; originally announced January 2026.

  7. arXiv:2601.09393  [pdf, ps, other

    cs.SE cs.DC cs.PF

    AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems

    Authors: Zirui Wang, Guangba Yu, Michael R. Lyu

    Abstract: The transition from Cloud-Native to AI-Native architectures is fundamentally reshaping software engineering, replacing deterministic microservices with probabilistic agentic services. However, this shift renders traditional black-box evaluation paradigms insufficient: existing benchmarks measure raw model capabilities while remaining blind to system-level execution dynamics. To bridge this gap, we… ▽ More

    Submitted 14 January, 2026; originally announced January 2026.

  8. arXiv:2601.03731  [pdf, ps, other

    cs.SE cs.AI

    From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

    Authors: Jia Li, Yuxin Su, Michael R. Lyu

    Abstract: As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive a… ▽ More

    Submitted 9 January, 2026; v1 submitted 7 January, 2026; originally announced January 2026.

  9. arXiv:2511.18528  [pdf, ps, other

    cs.SE

    End-to-End Automated Logging via Multi-Agent Framework

    Authors: Renyi Zhong, Yintong Huo, Wenwei Gu, Yichen Li, Michael R. Lyu

    Abstract: Software logging is critical for system observability, yet developers face a dual crisis of costly overlogging and risky underlogging. Existing automated logging tools often overlook the fundamental whether-to-log decision and struggle with the composite nature of logging. In this paper, we propose Autologger, a novel hybrid framework that addresses the complete the end-to-end logging pipeline. Au… ▽ More

    Submitted 23 November, 2025; originally announced November 2025.

  10. arXiv:2510.24706  [pdf, ps, other

    cs.CL cs.AI cs.HC cs.SE

    ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

    Authors: Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu

    Abstract: Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchma… ▽ More

    Submitted 28 October, 2025; originally announced October 2025.

  11. arXiv:2510.22986  [pdf, ps, other

    cs.SE cs.DC cs.MA

    CodeAD: Synthesize Code of Rules for Log-based Anomaly Detection with LLMs

    Authors: Junjie Huang, Minghua He, Jinyang Liu, Yintong Huo, Domenico Bianculli, Michael R. Lyu

    Abstract: Log-based anomaly detection (LogAD) is critical for maintaining the reliability and availability of large-scale online service systems. While machine learning, deep learning, and large language models (LLMs)-based methods have advanced the LogAD, they often suffer from limited interpretability, high inference costs, and extensive preprocessing requirements, limiting their practicality for real-tim… ▽ More

    Submitted 27 October, 2025; originally announced October 2025.

  12. arXiv:2510.21094  [pdf, ps, other

    cs.SE

    BDiff: Block-aware and Accurate Text-based Code Differencing

    Authors: Yao Lu, Wanwei Liu, Tanghaoran Zhang, Kang Yang, Yang Zhang, Wenyu Xu, Longfei Sun, Xinjun Mao, Shuzheng Gao, Michael R. Lyu

    Abstract: Code differencing is a fundamental technique in software engineering practice and research. While researchers have proposed text-based differencing techniques capable of identifying line changes over the past decade, existing methods exhibit a notable limitation in identifying edit actions (EAs) that operate on text blocks spanning multiple lines. Such EAs are common in developers' practice, such… ▽ More

    Submitted 23 October, 2025; originally announced October 2025.

  13. arXiv:2510.17163  [pdf, ps, other

    cs.SE cs.AI

    TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

    Authors: Shuzheng Gao, Eric John Li, Man Ho Lam, Jingyu Xiao, Yuxuan Wan, Chaozheng Wang, Ng Man Tik, Michael R. Lyu

    Abstract: Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models' trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from li… ▽ More

    Submitted 20 October, 2025; originally announced October 2025.

  14. arXiv:2510.17130  [pdf, ps, other

    cs.SE

    SEER: Enhancing Chain-of-Thought Code Generation through Self-Exploring Deep Reasoning

    Authors: Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Michael R. Lyu

    Abstract: Code generation, the task of creating executable programs from natural language requirements, has recently seen tremendous advances through Chain-of-Thought (CoT) reasoning, which enables Large Language Models (LLMs) to develop high-level reasoning plans before writing code. Recent research has proposed various methods to enhance models' CoT reasoning for code generation such as prompt engineering… ▽ More

    Submitted 19 October, 2025; originally announced October 2025.

    Comments: The paper was completed in Feb. 2025, submitted to ICSE 2026 in Mar. 2025, received a major revision in Jun. 2025, and was finally accepted in Oct. 2025

  15. arXiv:2510.01182  [pdf, ps, other

    cs.SE

    When Shared Worlds Break: Demystifying Defects in Multi-User Extended Reality Software Systems

    Authors: Shuqing Li, Chenran Zhang, Binchang Li, Cuiyun Gao, Michael R. Lyu

    Abstract: Multi-user Extended Reality (XR) systems enable transformative shared experiences but introduce unique software defects that compromise user experience. Understanding software defects in multi-user XR systems is crucial for enhancing system reliability, yet remains underexplored. To fill the gap, this paper presents the first large-scale empirical study of multi-user XR defects, analyzing 2,649 re… ▽ More

    Submitted 1 October, 2025; originally announced October 2025.

  16. arXiv:2509.26161  [pdf, ps, other

    cs.AI cs.SE

    90% Faster, 100% Code-Free: MLLM-Driven Zero-Code 3D Game Development

    Authors: Runxin Yang, Yuxuan Wan, Shuqing Li, Michael R. Lyu

    Abstract: Developing 3D games requires specialized expertise across multiple domains, including programming, 3D modeling, and engine configuration, which limits access to millions of potential creators. Recently, researchers have begun to explore automated game development. However, existing approaches face three primary challenges: (1) limited scope to 2D content generation or isolated code snippets; (2) r… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

  17. arXiv:2509.25874  [pdf, ps, other

    cs.SE

    LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems

    Authors: Zhihan Jiang, Jinyang Liu, Yichen Li, Haiyu Huang, Xiao He, Tieying Zhang, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu

    Abstract: Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reas… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

    Comments: Accepted by the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025)

  18. arXiv:2509.25297  [pdf, ps, other

    cs.SE cs.AI

    Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

    Authors: Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R. Lyu

    Abstract: Developing full-stack web applications is complex and time-intensive, demanding proficiency across diverse technologies and frameworks. Although recent advances in multimodal large language models (MLLMs) enable automated webpage generation from visual inputs, current solutions remain limited to front-end tasks and fail to deliver fully functional applications. In this work, we introduce TDDev, th… ▽ More

    Submitted 1 October, 2025; v1 submitted 29 September, 2025; originally announced September 2025.

  19. arXiv:2509.24215  [pdf, ps, other

    cs.SE cs.AI cs.CL cs.MM

    Metamorphic Testing for Audio Content Moderation Software

    Authors: Wenxuan Wang, Yongjiang Wu, Junyuan Zhang, Shuqing Li, Yun Peng, Wenting Chen, Shuai Wang, Michael R. Lyu

    Abstract: The rapid growth of audio-centric platforms and applications such as WhatsApp and Twitter has transformed the way people communicate and share audio content in modern society. However, these platforms are increasingly misused to disseminate harmful audio content, such as hate speech, deceptive advertisements, and explicit material, which can have significant negative consequences (e.g., detrimenta… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

    Comments: Accepted by ASE 2025

  20. arXiv:2509.13852  [pdf, ps, other

    cs.SE

    Trace Sampling 2.0: Code Knowledge Enhanced Span-level Sampling for Distributed Tracing

    Authors: Yulun Wu, Guangba Yu, Zhihan Jiang, Yichen Li, Michael R. Lyu

    Abstract: Distributed tracing is an essential diagnostic tool in microservice systems, but the sheer volume of traces places a significant burden on backend storage. A common approach to mitigating this issue is trace sampling, which selectively retains traces based on specific criteria, often preserving only anomalous ones. However, this method frequently discards valuable information, including normal tra… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

  21. arXiv:2509.12159  [pdf, ps, other

    cs.SE cs.AI

    EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression

    Authors: Jingyu Xiao, Zhongyi Zhang, Yuxuan Wan, Yintong Huo, Yang Liu, Michael R. Lyu

    Abstract: Multimodal Large Language Models have demonstrated exceptional performance in UI2Code tasks, significantly enhancing website development efficiency. However, these tasks incur substantially higher computational overhead than traditional code generation due to the large number of input image tokens and extensive output code tokens required. Our comprehensive study identifies significant redundancie… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

  22. arXiv:2509.11312  [pdf, ps, other

    cs.SE cs.AI

    Weakly Supervised Vulnerability Localization via Multiple Instance Learning

    Authors: Wenchao Gu, Yupan Chen, Yanlin Wang, Hongyu Zhang, Cuiyun Gao, Michael R. Lyu

    Abstract: Software vulnerability detection has emerged as a significant concern in the field of software security recently, capturing the attention of numerous researchers and developers. Most previous approaches focus on coarse-grained vulnerability detection, such as at the function or file level. However, the developers would still encounter the challenge of manually inspecting a large volume of code ins… ▽ More

    Submitted 14 September, 2025; originally announced September 2025.

  23. arXiv:2508.10074  [pdf, ps, other

    cs.SE cs.LG

    Next Edit Prediction: Learning to Predict Code Edits from Context and Interaction History

    Authors: Ruofan Lu, Yintong Huo, Meng Zhang, Yichen Li, Michael R. Lyu

    Abstract: The rapid advancement of large language models (LLMs) has led to the widespread adoption of AI-powered coding assistants integrated into a development environment. On one hand, low-latency code completion offers completion suggestions but is fundamentally constrained to the cursor's current position. On the other hand, chat-based editing can perform complex modifications, yet forces developers to… ▽ More

    Submitted 14 September, 2025; v1 submitted 13 August, 2025; originally announced August 2025.

  24. arXiv:2508.06926  [pdf, ps, other

    cs.SE

    Integrating Rules and Semantics for LLM-Based C-to-Rust Translation

    Authors: Feng Luo, Kexing Ji, Cuiyun Gao, Shuzheng Gao, Jia Feng, Kui Liu, Xin Xia, Michael R. Lyu

    Abstract: Automated translation of legacy C code into Rust aims to ensure memory safety while reducing the burden of manual migration. Early approaches in code translation rely on static rule-based methods, but they suffer from limited coverage due to dependence on predefined rule patterns. Recent works regard the task as a sequence-to-sequence problem by leveraging large language models (LLMs). Although th… ▽ More

    Submitted 9 August, 2025; originally announced August 2025.

    Comments: Accepted in ICSME 25 Industry Track

  25. arXiv:2508.00593  [pdf, ps, other

    cs.SE

    Can User Feedback Help Issue Detection? An Empirical Study on a One-billion-user Online Service System

    Authors: Shuyao Jiang, Jiazhen Gu, Wujie Zheng, Yangfan Zhou, Michael R. Lyu

    Abstract: Background: It has long been suggested that user feedback, typically written in natural language by end-users, can help issue detection. However, for large-scale online service systems that receive a tremendous amount of feedback, it remains a challenging task to identify severe issues from user feedback. Aims: To develop a better feedback-based issue detection approach, it is crucial first to gai… ▽ More

    Submitted 1 August, 2025; originally announced August 2025.

    Comments: Accepted by the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2025)

  26. arXiv:2508.00546  [pdf, ps, other

    cs.SE cs.AI

    SPENCER: Self-Adaptive Model Distillation for Efficient Code Retrieval

    Authors: Wenchao Gu, Zongyi Lyu, Yanlin Wang, Hongyu Zhang, Cuiyun Gao, Michael R. Lyu

    Abstract: Code retrieval aims to provide users with desired code snippets based on users' natural language queries. With the development of deep learning technologies, adopting pre-trained models for this task has become mainstream. Considering the retrieval efficiency, most of the previous approaches adopt a dual-encoder for this task, which encodes the description and code snippet into representation vect… ▽ More

    Submitted 1 August, 2025; originally announced August 2025.

  27. arXiv:2507.22827  [pdf, ps, other

    cs.CV

    ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

    Authors: Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

    Abstract: Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, w… ▽ More

    Submitted 20 October, 2025; v1 submitted 30 July, 2025; originally announced July 2025.

    Comments: ScreenCoder-v2

  28. arXiv:2507.22099  [pdf, ps, other

    cs.CV cs.AI cs.MM cs.SE

    Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go?

    Authors: Shuqing Li, Qiang Chen, Xiaoxue Ren, Michael R. Lyu

    Abstract: Physics Engines (PEs) are fundamental software frameworks that simulate physical interactions in applications ranging from entertainment to safety-critical systems. Despite their importance, PEs suffer from physics failures, deviations from expected physical behaviors that can compromise software reliability, degrade user experience, and potentially cause critical failures in autonomous vehicles o… ▽ More

    Submitted 29 July, 2025; originally announced July 2025.

  29. arXiv:2507.18625  [pdf, ps, other

    cs.CV cs.AI cs.MM cs.SE

    3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation

    Authors: Shuqing Li, Anson Y. Lam, Yun Peng, Wenxuan Wang, Michael R. Lyu

    Abstract: Graphical user interface (UI) software has undergone a fundamental transformation from traditional two-dimensional (2D) desktop/web/mobile interfaces to spatial three-dimensional (3D) environments. While existing work has made remarkable success in automated 2D software generation, such as HTML/CSS and mobile app interface code synthesis, the generation of 3D software still remains under-explored.… ▽ More

    Submitted 17 December, 2025; v1 submitted 24 July, 2025; originally announced July 2025.

    Comments: Accepted by the IEEE/ACM International Conference on Software Engineering (ICSE) 2026, Rio de Janeiro, Brazil

  30. arXiv:2507.06056  [pdf, ps, other

    cs.CL cs.AI

    Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

    Authors: Yizhan Huang, Zhe Yang, Meifang Chen, Huang Nianchen, Jianping Zhang, Michael R. Lyu

    Abstract: Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present t… ▽ More

    Submitted 27 September, 2025; v1 submitted 8 July, 2025; originally announced July 2025.

  31. arXiv:2506.20558  [pdf, ps, other

    cs.SE

    CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency

    Authors: Renyi Zhong, Yintong Huo, Wenwei Gu, Jinxi Kuang, Zhihan Jiang, Guangba Yu, Yichen Li, David Lo, Michael R. Lyu

    Abstract: Comments within code serve as a crucial foundation for software documentation, facilitating developers to communicate and understand the code effectively. However, code-comment inconsistency (CCI) can negatively affect software development, testing, and maintenance. Recent efforts to mitigate this issue have emerged, but existing studies often suffer from inaccurate datasets and inadequate solutio… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: This manuscript is under review

  32. arXiv:2506.07964  [pdf, ps, other

    cs.CV cs.AI

    SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design

    Authors: Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, Michael R. Lyu

    Abstract: Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  33. arXiv:2506.06251  [pdf, ps, other

    cs.SE cs.AI

    DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

    Authors: Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream deve… ▽ More

    Submitted 15 March, 2026; v1 submitted 6 June, 2025; originally announced June 2025.

  34. arXiv:2506.04569  [pdf, ps, other

    cs.SE

    KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems

    Authors: Wenwei Gu, Renyi Zhong, Guangba Yu, Xinying Sun, Jinyang Liu, Yintong Huo, Zhuangbin Chen, Jianping Zhang, Jiazhen Gu, Yongqiang Yang, Michael R. Lyu

    Abstract: To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis and resolution. Traditional methods rely on similarity calculations, which can be ineffective in complex, interdependent cloud environments. While deep learning-b… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  35. arXiv:2505.21130  [pdf, other

    cs.CR cs.SE

    ColorGo: Directed Concolic Execution

    Authors: Jia Li, Jiacheng Shen, Yuxin Su, Michael R. Lyu

    Abstract: Directed fuzzing is a critical technique in cybersecurity, targeting specific sections of a program. This approach is essential in various security-related domains such as crash reproduction, patch testing, and vulnerability detection. Despite its importance, current directed fuzzing methods exhibit a trade-off between efficiency and effectiveness. For instance, directed grey-box fuzzing, while ef… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  36. arXiv:2505.16590  [pdf, ps, other

    cs.SE

    Larger Is Not Always Better: Exploring Small Open-source Language Models in Logging Statement Generation

    Authors: Renyi Zhong, Yichen Li, Guangba Yu, Wenwei Gu, Jinxi Kuang, Yintong Huo, Michael R. Lyu

    Abstract: Developers use logging statements to create logs that document system behavior and aid in software maintenance. As such, high-quality logging is essential for effective maintenance; however, manual logging often leads to errors and inconsistency. Recent methods emphasize using large language models (LLMs) for automated logging statement generation, but these present privacy and resource issues, hi… ▽ More

    Submitted 4 September, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  37. arXiv:2505.00342  [pdf, other

    cs.SE

    LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

    Authors: Zhihan Jiang, Rui Ren, Guangba Yu, Yulun Wu, Wenwei Gu, Yichen Li, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

    Abstract: Large Language Models (LLMs) have brought about revolutionary changes in diverse fields, rendering LLM training of utmost importance for modern enterprises. To meet this demand, multi-tenant large-scale LLM training platforms have been built to offer LLM training services. Nevertheless, due to the complexity and synchronous nature of LLM training process, performance issues occur frequently and ca… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  38. arXiv:2504.14119  [pdf, ps, other

    cs.AI cs.SE

    CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning

    Authors: Man Ho Lam, Chaozheng Wang, Jen-tse Huang, Michael R. Lyu

    Abstract: Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with 1,279 questions from CruxEval and LiveCodeBench, designed to evaluate reasoning reliability under structural perturbations and misleading natural language (NL) con… ▽ More

    Submitted 11 October, 2025; v1 submitted 18 April, 2025; originally announced April 2025.

    Comments: NeurIPS 2025; 10 pages of main text; 25 pages of appendices. Website - https://cuhk-arise.github.io/CodeCrash/

  39. arXiv:2504.05738  [pdf, ps, other

    cs.SE

    MioHint: LLM-assisted Mutation for Whitebox API Testing

    Authors: Jia Li, Jiacheng Shen, Yuxin Su, Michael R. Lyu

    Abstract: Cloud applications heavily rely on APIs to communicate with each other and exchange data. To ensure the reliability of cloud applications, cloud providers widely adopt API testing techniques. Unfortunately, existing API testing approaches are insufficient to reach strict conditions, a problem known as fitness plateaus, due to the lack of gradient provided by coverage metrics. To address this issue… ▽ More

    Submitted 5 March, 2026; v1 submitted 8 April, 2025; originally announced April 2025.

    Comments: Accepted by ICSE 2026 (research track)

  40. arXiv:2504.03702  [pdf, ps, other

    cs.DC

    Hierarchical Prediction-based Management for LMaaS Systems

    Authors: Zhihan Jiang, Yujie Huang, Guangba Yu, Junjie Huang, Jiazhen Gu, Michael R. Lyu

    Abstract: Large Language Models (LLMs) have revolutionized numerous domains, driving the rise of Language-Model-as-a-Service (LMaaS) platforms that process millions of queries daily. These platforms must minimize latency and meet Service Level Objectives (SLOs) while optimizing resource usage. However, conventional cloud service management techniques, designed for traditional workloads, are suboptimal for L… ▽ More

    Submitted 19 October, 2025; v1 submitted 25 March, 2025; originally announced April 2025.

    Comments: This paper has been accepted by the 48th IEEE/ACM International Conference on Software Engineering (ICSE'26)

  41. arXiv:2503.23051  [pdf, other

    cs.SE

    COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge

    Authors: Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu, Michael R. Lyu

    Abstract: Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures is critical for ensuring high reliability and availability. However, prevailing automatic root cause analysis (RCA) approaches rely significantly on comprehensiv… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

    Comments: Accepted by the 47th IEEE/ACM International Conference on Software Engineering (ICSE'25)

  42. arXiv:2503.20263  [pdf, other

    cs.SE cs.DC

    L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis

    Authors: Zhihan Jiang, Junjie Huang, Zhuangbin Chen, Yichen Li, Guangba Yu, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu

    Abstract: As Large Language Models (LLMs) show their capabilities across various applications, training customized LLMs has become essential for modern enterprises. However, due to the complexity of LLM training, which requires massive computational resources and extensive training time, failures are inevitable during the training process. These failures result in considerable waste of resource and time, hi… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: To appear in companion proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE'25). 13 pages

  43. arXiv:2503.19519  [pdf, other

    cs.CR

    Towards Imperceptible Adversarial Attacks for Time Series Classification with Local Perturbations and Frequency Analysis

    Authors: Wenwei Gu, Renyi Zhong, Jianping Zhang, Michael R. Lyu

    Abstract: Adversarial attacks in time series classification (TSC) models have recently gained attention due to their potential to compromise model robustness. Imperceptibility is crucial, as adversarial examples detected by the human vision system (HVS) can render attacks ineffective. Many existing methods fail to produce high-quality imperceptible examples, often generating perturbations with more percepti… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  44. arXiv:2502.05849  [pdf, ps, other

    cs.CL

    Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases

    Authors: Jen-tse Huang, Yuhang Yan, Linqi Liu, Yixin Wan, Wenxuan Wang, Kai-Wei Chang, Michael R. Lyu

    Abstract: Recent failures such as Google Gemini generating people of color in Nazi-era uniforms illustrate how AI outputs can be factually plausible yet socially harmful. AI models are increasingly evaluated for "fairness," yet existing benchmarks often conflate two fundamentally different dimensions: factual correctness and normative fairness. A model may generate responses that are factually accurate but… ▽ More

    Submitted 29 September, 2025; v1 submitted 9 February, 2025; originally announced February 2025.

    Comments: Accepted to EMNLP 2025 (Findings)

  45. arXiv:2501.10711  [pdf, ps, other

    cs.SE cs.AI cs.CL

    Rigor, Reliability, and Reproducibility Matter: A Decade-Scale Survey of 572 Code Benchmarks

    Authors: Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung

    Abstract: Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the community interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014-2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual practice. For example, in 2025 a… ▽ More

    Submitted 8 February, 2026; v1 submitted 18 January, 2025; originally announced January 2025.

    Comments: 65 pages

  46. arXiv:2412.20100  [pdf, other

    cs.SE

    Distinguishability-guided Test Program Generation for WebAssembly Runtime Performance Testing

    Authors: Shuyao Jiang, Ruiying Zeng, Yangfan Zhou, Michael R. Lyu

    Abstract: WebAssembly (Wasm) is a binary instruction format designed as a portable compilation target, which has been widely used on both the web and server sides in recent years. As high performance is a critical design goal of Wasm, it is essential to conduct performance testing for Wasm runtimes. However, existing research on Wasm runtime performance testing still suffers from insufficient high-quality t… ▽ More

    Submitted 28 December, 2024; originally announced December 2024.

    Comments: Accepted by the 32nd edition of the IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER 2025)

  47. arXiv:2412.15310  [pdf, other

    cs.SE cs.AI cs.IR

    MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs

    Authors: Yuxuan Wan, Yi Dong, Jingyu Xiao, Yintong Huo, Wenxuan Wang, Michael R. Lyu

    Abstract: Multi-page websites dominate modern web development. However, existing design-to-code methods rely on simplified assumptions, limiting to single-page, self-contained webpages without external resource connection. To address this gap, we introduce the Multi-Page Resource-Aware Webpage (MRWeb) generation task, which transforms UI designs into multi-page, functional web UIs with internal/external nav… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

  48. arXiv:2412.11728  [pdf, other

    cs.SE

    SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing

    Authors: Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Michael R. Lyu

    Abstract: Code retrieval, which retrieves code snippets based on users' natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval paradigm from lexical-based matching towards leveraging deep learning models to encode source code and queries into vector representations, facilitating code retri… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

  49. arXiv:2412.06759  [pdf, ps, other

    cs.SE cs.AI cs.CR cs.HC

    XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

    Authors: Shuqing Li, Chenran Zhang, Cuiyun Gao, Michael R. Lyu

    Abstract: The rapid advancement of Extended Reality (XR, encompassing AR, MR, and VR) and spatial computing technologies forms a foundational layer for the emerging Metaverse, enabling innovative applications across healthcare, education, manufacturing, and entertainment. However, research in this area is often limited by the lack of large, representative, and highquality application datasets that can suppo… ▽ More

    Submitted 1 October, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

  50. arXiv:2412.04947  [pdf, ps, other

    cs.CL

    C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation

    Authors: Yanyang Li, Tin Long Wong, Cheung To Hung, Jianqiao Zhao, Duo Zheng, Ka Wai Liu, Michael R. Lyu, Liwei Wang

    Abstract: Recent advances in large language models (LLMs) have shown significant promise, yet their evaluation raises concerns, particularly regarding data contamination due to the lack of access to proprietary training data. To address this issue, we present C$^2$LEVA, a comprehensive bilingual benchmark featuring systematic contamination prevention. C$^2$LEVA firstly offers a holistic evaluation encompass… ▽ More

    Submitted 29 May, 2025; v1 submitted 6 December, 2024; originally announced December 2024.

    Comments: Findings of ACL 2025; Project Page: https://github.com/LaVi-Lab/C2LEVA