Skip to main content

Showing 1–50 of 60 results for author: Gu, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.08296  [pdf, ps, other

    cs.AI

    Towards a Science of Scaling Agent Systems

    Authors: Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, Xin Liu

    Abstract: Agents, language model-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored. We address this by deriving quantitative scaling principles for agent systems. We first formalize a definition for agentic evaluation and ch… ▽ More

    Submitted 16 December, 2025; v1 submitted 9 December, 2025; originally announced December 2025.

  2. arXiv:2511.03128  [pdf, ps, other

    cs.LG cs.CL

    From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation

    Authors: Najrin Sultana, Md Rafi Ur Rashid, Kang Gu, Shagufta Mehnaz

    Abstract: LLMs can provide substantial zero-shot performance on diverse tasks using a simple task prompt, eliminating the need for training or fine-tuning. However, when applying these models to sensitive tasks, it is crucial to thoroughly assess their robustness against adversarial inputs. In this work, we introduce Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack frameworks des… ▽ More

    Submitted 4 November, 2025; originally announced November 2025.

    Comments: Findings of the Association for Computational Linguistics: EMNLP 2025 (camera-ready)

  3. arXiv:2510.25744  [pdf

    cs.CL cs.AI

    Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents

    Authors: Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag

    Abstract: Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs… ▽ More

    Submitted 30 October, 2025; v1 submitted 29 October, 2025; originally announced October 2025.

    Comments: 22 pages, 5 figures, 3 tables

  4. arXiv:2510.24427  [pdf, ps, other

    cs.CL

    SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

    Authors: Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff

    Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task… ▽ More

    Submitted 30 October, 2025; v1 submitted 28 October, 2025; originally announced October 2025.

  5. arXiv:2510.23511  [pdf, ps, other

    cs.RO

    Dexbotic: Open-Source Vision-Language-Action Toolbox

    Authors: Bin Xie, Erjin Zhou, Fan Jia, Hao Shi, Haoqiang Fan, Haowei Zhang, Hebei Li, Jianjian Sun, Jie Bin, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Lin Sun, Meng Zhang, Peilong Han, Ruitao Hao, Ruitao Zhang, Saike Huang, Songhan Xie, Tiancai Wang, Tianle Liu, Wenbin Tang, Wenqi Zhu, Yang Chen , et al. (14 additional authors not shown)

    Abstract: In this paper, we present Dexbotic, an open-source Vision-Language-Action (VLA) model toolbox based on PyTorch. It aims to provide a one-stop VLA research service for professionals in the field of embodied intelligence. It offers a codebase that supports multiple mainstream VLA policies simultaneously, allowing users to reproduce various VLA methods with just a single environment setup. The toolbo… ▽ More

    Submitted 27 October, 2025; originally announced October 2025.

    Comments: Authors are listed in alphabetical order. The official website is located at https://dexbotic.com/. Code is available at https://github.com/Dexmal/dexbotic

  6. arXiv:2510.17950  [pdf, ps, other

    cs.RO

    RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies

    Authors: Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang, Wei Sun, Wenbin Tang, Yajun Wei , et al. (12 additional authors not shown)

    Abstract: Testing on real machines is indispensable for robotic control algorithms. In the context of learning-based algorithms, especially VLA models, demand for large-scale evaluation, i.e. testing a large number of models on a large number of tasks, is becoming increasingly urgent. However, doing this right is highly non-trivial, especially when scalability and reproducibility is taken into account. In t… ▽ More

    Submitted 20 October, 2025; originally announced October 2025.

    Comments: Authors are listed in alphabetical order. The official website is located at https://robochallenge.ai

  7. arXiv:2510.11660  [pdf, ps, other

    cs.RO cs.AI

    ManiAgent: An Agentic Framework for General Robotic Manipulation

    Authors: Yi Yang, Kefan Gu, Yuqing Wen, Hebei Li, Yucheng Zhao, Tiancai Wang, Xudong Liu

    Abstract: While Vision-Language-Action (VLA) models have demonstrated impressive capabilities in robotic manipulation, their performance in complex reasoning and long-horizon task planning is limited by data scarcity and model capacity. To address this, we introduce ManiAgent, an agentic architecture for general manipulation tasks that achieves end-to-end output from task descriptions and environmental inpu… ▽ More

    Submitted 13 October, 2025; v1 submitted 13 October, 2025; originally announced October 2025.

    Comments: 8 pages, 6 figures, conference

  8. arXiv:2510.07778  [pdf, ps, other

    cs.RO cs.AI cs.CV

    IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

    Authors: Yandu Chen, Kefan Gu, Yuqing Wen, Yucheng Zhao, Tiancai Wang, Liqiang Nie

    Abstract: Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to… ▽ More

    Submitted 9 October, 2025; originally announced October 2025.

  9. arXiv:2510.00475  [pdf, ps, other

    cs.LG cs.CV

    Diagnosing Shortcut-Induced Rigidity in Continual Learning: The Einstellung Rigidity Index (ERI)

    Authors: Kai Gu, Weishi Shi

    Abstract: Deep neural networks frequently exploit shortcut features, defined as incidental correlations between inputs and labels without causal meaning. Shortcut features undermine robustness and reduce reliability under distribution shifts. In continual learning (CL), the consequences of shortcut exploitation can persist and intensify: weights inherited from earlier tasks bias representation reuse toward… ▽ More

    Submitted 30 September, 2025; originally announced October 2025.

    Comments: 10 pages, 6 figures

  10. arXiv:2509.22718  [pdf, ps, other

    eess.AS cs.MM cs.SD

    PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos

    Authors: Ke Gu, Zhicong Wu, Peng Bai, Sitong Qiao, Zhiqi Jiang, Junchen Lu, Xiaodong Shi, Xinyuan Qian

    Abstract: Existing singing voice synthesis (SVS) models largely rely on fine-grained, phoneme-level durations, which limits their practical application. These methods overlook the complementary role of visual information in duration prediction.To address these issues, we propose PerformSinger, a pioneering multimodal SVS framework, which incorporates lip cues from video as a visual modality, enabling high-q… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

  11. arXiv:2509.17336  [pdf, ps, other

    cs.MM cs.CL cs.CV

    Mano Technical Report

    Authors: Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang

    Abstract: Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking cap… ▽ More

    Submitted 31 October, 2025; v1 submitted 21 September, 2025; originally announced September 2025.

  12. arXiv:2509.06932  [pdf, ps, other

    cs.RO cs.CV

    LLaDA-VLA: Vision Language Diffusion Action Models

    Authors: Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, Xiaoyan Sun

    Abstract: The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-base… ▽ More

    Submitted 10 September, 2025; v1 submitted 8 September, 2025; originally announced September 2025.

  13. arXiv:2508.20148  [pdf

    cs.AI cs.HC cs.MA

    The Anatomy of a Personal Health Agent

    Authors: A. Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, Ahmed A. Metwally, Brent Winslow, Yubin Kim, Kumar Ayush, Yuzhe Yang, Girish Narayanswamy, Maxwell A. Xu, Jake Garrison, Amy Armento Lee, Jenny Vafeiadou, Ben Graef, Isaac R. Galatzer-Levy, Erik Schenck, Andrew Barakat, Javier Perez , et al. (13 additional authors not shown)

    Abstract: Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason… ▽ More

    Submitted 18 September, 2025; v1 submitted 27 August, 2025; originally announced August 2025.

    Comments: Minor updates to the manuscript (V2)

  14. arXiv:2508.15285  [pdf, ps, other

    cs.DB

    Efficient Cloud-Edge-Device Query Execution Based on Collaborative Scan Operator

    Authors: Chunyu Zhao, Hongzhi Wang, Kaixin Zhang, Hongliang Li, Yihan Zhang, Jiawei Zhang, Kunkai Gu, Yuan Tian, Xiangdong Huang, Jingyi Xu

    Abstract: In cloud-edge-device (CED) collaborative query (CQ) processing, by leveraging CED collaboration, the advantages of both cloud computing and edge resources can be fully integrated. However, it is difficult to implement collaborative operators that can flexibly switch between the cloud and the edge during query execution. Thus, in this paper, we aim to improve the query performance when the edge res… ▽ More

    Submitted 21 August, 2025; originally announced August 2025.

    Comments: 12 pages, 23 figures. Submitted to IEEE Transactions on ICDE

    ACM Class: C.2.4; H.2.4; D.2.11

  15. arXiv:2508.05023  [pdf, ps, other

    cs.CL cs.AI

    Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning

    Authors: Kun Peng, Cong Cao, Hao Peng, Zhifeng Hao, Lei Jiang, Kongjing Gu, Yanbing Liu, Philip S. Yu

    Abstract: Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogue… ▽ More

    Submitted 7 August, 2025; originally announced August 2025.

    Comments: Accepted by CIKM2025

  16. arXiv:2506.13679  [pdf, ps, other

    cs.RO cs.AI cs.CV

    ROSA: Harnessing Robot States for Vision-Language and Action Alignment

    Authors: Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiaoyan Sun

    Abstract: Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using e… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  17. arXiv:2506.08249  [pdf, ps, other

    cs.DB cs.CL

    RADAR: Benchmarking Language Models on Imperfect Tabular Data

    Authors: Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri, Hong Yu, Mehran Kazemi, Kumar Ayush, A. Ali Heydari, Maxwell A. Xu, Girish Narayanswamy, Yun Liu, Ming-Zher Poh, Yuzhe Yang, Mark Malhotra, Shwetak Patel, Hamid Palangi, Xuhai Xu, Daniel McDuff, Tim Althoff, Xin Liu

    Abstract: Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compro… ▽ More

    Submitted 30 October, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: NeurIPS 2025 Dataset and Benchmark Track

  18. arXiv:2506.05321  [pdf, other

    cs.LG

    LSM-2: Learning from Incomplete Wearable Sensor Data

    Authors: Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, Daniel McDuff

    Abstract: Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Xu and Narayanswamy are co-first authors. McDuff and Liu are co-last authors

  19. arXiv:2505.01224  [pdf, ps, other

    cs.CV eess.IV

    VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement

    Authors: Kui Jiang, Yan Luo, Junjun Jiang, Ke Gu, Nan Ma, Xianming Liu

    Abstract: State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field. However, in the context of Underwater Image Enhancement (UIE), the standard sequential scanning mechanism is fundamentally challenged by the unique statistical distribution characteristics of underwater scenes. The predominance of large-portion, homogeneous but… ▽ More

    Submitted 15 October, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

  20. arXiv:2503.24306  [pdf, other

    cs.CV

    Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge

    Authors: Adam Schmidt, Mert Asim Karaoglu, Soham Sinha, Mingang Jang, Ho-Gun Ha, Kyungmin Jung, Kyeongmo Gu, Ihsan Ullah, Hyunki Lee, Jonáš Šerých, Michal Neoral, Jiří Matas, Rulin Zhou, Wenlong He, An Wang, Hongliang Ren, Bruno Silva, Sandro Queirós, Estêvão Lima, João L. Vilaça, Shunsuke Kikuchi, Atsushi Kouno, Hiroki Matsuzaki, Tongtong Li, Yulu Chen , et al. (15 additional authors not shown)

    Abstract: Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to a… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  21. arXiv:2503.11194  [pdf, other

    cs.CV

    Online Test-time Adaptation for 3D Human Pose Estimation: A Practical Perspective with Estimated 2D Poses

    Authors: Qiuxia Lin, Kerui Gu, Linlin Yang, Angela Yao

    Abstract: Online test-time adaptation for 3D human pose estimation is used for video streams that differ from training data. Ground truth 2D poses are used for adaptation, but only estimated 2D poses are available in practice. This paper addresses adapting models to streaming videos with estimated 2D poses. Comparing adaptations reveals the challenge of limiting estimation errors while preserving accurate p… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  22. arXiv:2502.10724  [pdf, other

    cs.CV

    Semantics-aware Test-time Adaptation for 3D Human Pose Estimation

    Authors: Qiuxia Lin, Rongyu Chen, Kerui Gu, Angela Yao

    Abstract: This work highlights a semantics misalignment in 3D human pose estimation. For the task of test-time adaptation, the misalignment manifests as overly smoothed and unguided predictions. The smoothing settles predictions towards some average pose. Furthermore, when there are occlusions or truncations, the adaptation becomes fully unguided. To this end, we pioneer the integration of a semantics-aware… ▽ More

    Submitted 28 May, 2025; v1 submitted 15 February, 2025; originally announced February 2025.

    Comments: 10 pages, 4 figures

  23. arXiv:2412.10838  [pdf, other

    cond-mat.mtrl-sci cs.AI physics.app-ph

    Deep Learning Models for Colloidal Nanocrystal Synthesis

    Authors: Kai Gu, Yingping Liang, Jiaming Su, Peihan Sun, Jia Peng, Naihua Miao, Zhimei Sun, Ying Fu, Haizheng Zhong, Jun Zhang

    Abstract: Colloidal synthesis of nanocrystals usually includes complex chemical reactions and multi-step crystallization processes. Despite the great success in the past 30 years, it remains challenging to clarify the correlations between synthetic parameters of chemical reaction and physical properties of nanocrystals. Here, we developed a deep learning-based nanocrystal synthesis model that correlates syn… ▽ More

    Submitted 14 December, 2024; originally announced December 2024.

  24. arXiv:2412.00749  [pdf, other

    cs.DB cs.AI

    CONCERTO: Complex Query Execution Mechanism-Aware Learned Cost Estimation

    Authors: Kaixin Zhang, Hongzhi Wang, Kunkai Gu, Ziqi Li, Chunyu Zhao, Yingze Li, Yu Yan

    Abstract: With the growing demand for massive data analysis, many DBMSs have adopted complex underlying query execution mechanisms, including vectorized operators, parallel execution, and dynamic pipeline modifications. However, there remains a lack of targeted Query Performance Prediction (QPP) methods for these complex execution mechanisms and their interactions, as most existing approaches focus on tradi… ▽ More

    Submitted 28 March, 2025; v1 submitted 1 December, 2024; originally announced December 2024.

  25. arXiv:2411.18628  [pdf

    cs.CY

    Cohort profile: the Northwest China Real-world and Population-based Cohort

    Authors: Qi Huang, Yanjun Li, Bo Yin, Yaoguo Wang, Yujuan Yuan, Yanying Guo, Kuiying Gu, Yining Yang, Qian Di

    Abstract: The Northwest China Real-World and Population-based cohort is an ongoing prospective cohort with more than 25 million population, covering almost all residents across approximately 1.66 million square kilometers in northwest China; The cohort integrates data from various sources, including health profiles, examination records, electronic health records, mortality records, statistical yearbooks, an… ▽ More

    Submitted 13 November, 2024; originally announced November 2024.

    Comments: 32 pages,2 tables 2 figures, and 1 appendix

  26. arXiv:2409.06381  [pdf, other

    cs.CV

    A Cross-Font Image Retrieval Network for Recognizing Undeciphered Oracle Bone Inscriptions

    Authors: Zhicong Wu, Qifeng Su, Ke Gu, Xiaodong Shi

    Abstract: Oracle Bone Inscription (OBI) is the earliest mature writing system in China, which represents a crucial stage in the development of hieroglyphs. Nevertheless, the substantial quantity of undeciphered OBI characters remains a significant challenge for scholars, while conventional methods of ancient script research are both time-consuming and labor-intensive. In this paper, we propose a cross-font… ▽ More

    Submitted 25 December, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

  27. arXiv:2408.09667  [pdf, ps, other

    cs.CL

    BLADE: Benchmarking Language Model Agents for Data-Driven Science

    Authors: Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, Tim Althoff

    Abstract: Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-dri… ▽ More

    Submitted 10 November, 2025; v1 submitted 18 August, 2024; originally announced August 2024.

    Comments: EMNLP 2024

  28. arXiv:2407.11009  [pdf, other

    cs.CL cs.LG

    CharED: Character-wise Ensemble Decoding for Large Language Models

    Authors: Kevin Gu, Eva Tuecke, Dmitriy Katz, Raya Horesh, David Alvarez-Melis, Mikhail Yurochkin

    Abstract: Large language models (LLMs) have shown remarkable potential for problem solving, with open source models achieving increasingly impressive performance on benchmarks measuring areas from logical reasoning to mathematical ability. Ensembling models can further improve capabilities across a variety of domains. However, conventional methods of combining models at inference time such as shallow fusion… ▽ More

    Submitted 25 June, 2024; originally announced July 2024.

    Comments: 9 pages, 4 figures

  29. arXiv:2407.00574  [pdf, other

    cs.CV

    Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery

    Authors: Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Tze Ho Elden Tse, Angela Yao

    Abstract: Accurate camera motion estimation is essential for recovering global human motion in world coordinates from RGB video inputs. SLAM is widely used for estimating camera trajectory and point cloud, but monocular SLAM does so only up to an unknown scale factor. Previous works estimate the scale factor through optimization, but this is unreliable and time-consuming. This paper presents an optimization… ▽ More

    Submitted 12 December, 2024; v1 submitted 29 June, 2024; originally announced July 2024.

    Comments: 13 pages, 11 figures, 6 tables

  30. arXiv:2406.19888  [pdf, other

    cs.AI

    Fine-tuning of Geospatial Foundation Models for Aboveground Biomass Estimation

    Authors: Michal Muszynski, Levente Klein, Ademir Ferreira da Silva, Anjani Prasad Atluri, Carlos Gomes, Daniela Szwarcman, Gurkanwar Singh, Kewen Gu, Maciel Zortea, Naomi Simumba, Paolo Fraccaro, Shraddha Singh, Steve Meliksetian, Campbell Watson, Daiki Kimura, Harini Srinivasan

    Abstract: Global vegetation structure mapping is critical for understanding the global carbon cycle and maximizing the efficacy of nature-based carbon sequestration initiatives. Moreover, vegetation structure mapping can help reduce the impacts of climate change by, for example, guiding actions to improve water security, increase biodiversity and reduce flood risk. Global satellite measurements provide an i… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  31. arXiv:2405.19833  [pdf, other

    cs.CV

    KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation

    Authors: Fengyuan Yang, Kerui Gu, Angela Yao

    Abstract: 2D keypoints are commonly used as an additional cue to refine estimated 3D human meshes. Current methods optimize the pose and shape parameters with a reprojection loss on the provided 2D keypoints. Such an approach, while simple and intuitive, has limited effectiveness because the optimal solution is hard to find in ambiguous parameter space and may sacrifice depth. Additionally, divergent gradie… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR24

  32. arXiv:2403.14863  [pdf, other

    physics.med-ph cs.CV cs.LG

    Distribution-informed and wavelength-flexible data-driven photoacoustic oximetry

    Authors: Janek Gröhl, Kylie Yeung, Kevin Gu, Thomas R. Else, Monika Golinska, Ellie V. Bunce, Lina Hacker, Sarah E. Bohndiek

    Abstract: Significance: Photoacoustic imaging (PAI) promises to measure spatially-resolved blood oxygen saturation, but suffers from a lack of accurate and robust spectral unmixing methods to deliver on this promise. Accurate blood oxygenation estimation could have important clinical applications, from cancer detection to quantifying inflammation. Aim: This study addresses the inflexibility of existing da… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: 37 pages, 7 figures

    ACM Class: F.2.1

  33. arXiv:2403.10557  [pdf, other

    cs.LG cs.AI cs.CL

    Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models

    Authors: Kang Gu, Md Rafi Ur Rashid, Najrin Sultana, Shagufta Mehnaz

    Abstract: With the rapid development of Large Language Models (LLMs), we have witnessed intense competition among the major LLM products like ChatGPT, LLaMa, and Gemini. However, various issues (e.g. privacy leakage and copyright violation) of the training corpus still remain underexplored. For example, the Times sued OpenAI and Microsoft for infringing on its copyrights by using millions of its articles fo… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  34. arXiv:2402.16281  [pdf, other

    cs.RO cs.AI

    RobKiNet: Robotic Kinematics Informed Neural Network for Optimal Robot Configuration Prediction

    Authors: Yanlong Peng, Zhigang Wang, Yisheng Zhang, Pengxu Chang, Ziwen He, Kai Gu, Hongshen Zhang, Ming Chen

    Abstract: Task and Motion Planning (TAMP) is essential for robots to interact with the world and accomplish complex tasks. The TAMP problem involves a critical gap: exploring the robot's configuration parameters (such as chassis position and robotic arm joint angles) within continuous space to ensure that task-level global constraints are met while also enhancing the efficiency of subsequent motion planning… ▽ More

    Submitted 4 March, 2025; v1 submitted 25 February, 2024; originally announced February 2024.

  35. arXiv:2401.13956  [pdf, other

    cs.CV

    A New Image Quality Database for Multiple Industrial Processes

    Authors: Xuanchao Ma, Yanlin Jiang, Hongyan Liu, Chengxu Zhou, Ke Gu

    Abstract: Recent years have witnessed a broader range of applications of image processing technologies in multiple industrial processes, such as smoke detection, security monitoring, and workpiece inspection. Different kinds of distortion types and levels must be introduced into an image during the processes of acquisition, compression, transmission, storage, and display, which might heavily degrade the ima… ▽ More

    Submitted 15 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

  36. arXiv:2401.02823  [pdf, other

    cs.CL cs.IR

    DocGraphLM: Documental Graph Language Model for Information Extraction

    Authors: Dongsheng Wang, Zhiqiang Ma, Armineh Nourbakhsh, Kang Gu, Sameena Shah

    Abstract: Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve t… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: Published at SIGIR'23 (repost for easier access)

  37. arXiv:2312.00462  [pdf, other

    cs.CV

    Learning Unorthogonalized Matrices for Rotation Estimation

    Authors: Kerui Gu, Zhihao Li, Shiyong Liu, Jianzhuang Liu, Songcen Xu, Youliang Yan, Michael Bi Mi, Kenji Kawaguchi, Angela Yao

    Abstract: Estimating 3D rotations is a common procedure for 3D computer vision. The accuracy depends heavily on the rotation representation. One form of representation -- rotation matrices -- is popular due to its continuity, especially for pose estimation tasks. The learning process usually incorporates orthogonalization to ensure orthonormal matrices. Our work reveals, through gradient analysis, that comm… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  38. arXiv:2311.17105  [pdf, other

    cs.CV

    On the Calibration of Human Pose Estimation

    Authors: Kerui Gu, Rongyu Chen, Angela Yao

    Abstract: Most 2D human pose estimation frameworks estimate keypoint confidence in an ad-hoc manner, using heuristics such as the maximum value of heatmaps. The confidence is part of the evaluation scheme, e.g., AP for the MSCOCO dataset, yet has been largely overlooked in the development of state-of-the-art methods. This paper takes the first steps in addressing miscalibration in pose estimation. From a ca… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  39. arXiv:2310.16152  [pdf, ps, other

    cs.CR cs.LG

    Gradient-Free Privacy Leakage in Federated Language Models through Selective Weight Tampering

    Authors: Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Kang Gu, Najrin Sultana, Shagufta Mehnaz

    Abstract: Federated learning (FL) has become a key component in various language modeling applications such as machine translation, next-word prediction, and medical record analysis. These applications are trained on datasets from many FL participants that often include privacy-sensitive data, such as healthcare records, phone/credit card numbers, login credentials, etc. Although FL enables computation with… ▽ More

    Submitted 9 December, 2025; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: 21 pages (including bibliography and Appendix), Submitted to PETS'26

  40. arXiv:2309.10947  [pdf, other

    cs.HC

    How Do Analysts Understand and Verify AI-Assisted Data Analyses?

    Authors: Ken Gu, Ruoxi Shang, Tim Althoff, Chenglong Wang, Steven M. Drucker

    Abstract: Data analysis is challenging as it requires synthesizing domain knowledge, statistical expertise, and programming skills. Assistants powered by large language models (LLMs), such as ChatGPT, can assist analysts by translating natural language instructions into code. However, AI-assistant responses and analysis code can be misaligned with the analyst's intent or be seemingly correct but lead to inc… ▽ More

    Submitted 4 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted to CHI 2024

  41. arXiv:2309.10108  [pdf, other

    cs.HC

    How Do Data Analysts Respond to AI Assistance? A Wizard-of-Oz Study

    Authors: Ken Gu, Madeleine Grunde-McLaughlin, Andrew M. McNutt, Jeffrey Heer, Tim Althoff

    Abstract: Data analysis is challenging as analysts must navigate nuanced decisions that may yield divergent conclusions. AI assistants have the potential to support analysts in planning their analyses, enabling more robust decision making. Though AI-based assistants that target code execution (e.g., Github Copilot) have received significant attention, limited research addresses assistance for both analysis… ▽ More

    Submitted 4 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted to CHI 2024

  42. arXiv:2308.04160  [pdf, other

    cs.RO

    S&Reg: End-to-End Learning-Based Model for Multi-Goal Path Planning Problem

    Authors: Yuan Huang, Kairui Gu, Hee-hyol Lee

    Abstract: In this paper, we propose a novel end-to-end approach for solving the multi-goal path planning problem in obstacle environments. Our proposed model, called S&Reg, integrates multi-task learning networks with a TSP solver and a path planner to quickly compute a closed and feasible path visiting all goals. Specifically, the model first predicts promising regions that potentially contain the optimal… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: 7 paegs, 12 figures. Accepted at IEEE International Conference on Robot and Human Interactive Communication (ROMAN), 2023

  43. arXiv:2301.10431  [pdf, other

    cs.CV

    Bias-Compensated Integral Regression for Human Pose Estimation

    Authors: Kerui Gu, Linlin Yang, Michael Bi Mi, Angela Yao

    Abstract: In human and hand pose estimation, heatmaps are a crucial intermediate representation for a body or hand keypoint. Two popular methods to decode the heatmap into a final joint coordinate are via an argmax, as done in heatmap detection, or via softmax and expectation, as done in integral regression. Integral regression is learnable end-to-end, but has lower accuracy than detection. This paper uncov… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

  44. arXiv:2210.03804  [pdf, other

    cs.HC cs.SE

    Understanding and Supporting Debugging Workflows in Multiverse Analysis

    Authors: Ken Gu, Eunice Jun, Tim Althoff

    Abstract: Multiverse analysis, a paradigm for statistical analysis that considers all combinations of reasonable analysis choices in parallel, promises to improve transparency and reproducibility. Although recent tools help analysts specify multiverse analyses, they remain difficult to use in practice. In this work, we identify debugging as a key barrier due to the latency from running analyses to detecting… ▽ More

    Submitted 4 June, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: CHI 2023

    Journal ref: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23-28, 2023, Hamburg, Germany. ACM, New York, NY, USA

  45. arXiv:2205.03574  [pdf, other

    cs.CV eess.IV

    Utility-Oriented Underwater Image Quality Assessment Based on Transfer Learning

    Authors: Weiling Chen, Rongfu Lin, Honggang Liao, Tiesong Zhao, Ke Gu, Patrick Le Callet

    Abstract: The widespread image applications have greatly promoted the vision-based tasks, in which the Image Quality Assessment (IQA) technique has become an increasingly significant issue. For user enjoyment in multimedia systems, the IQA exploits image fidelity and aesthetics to characterize user experience; while for other tasks such as popular object recognition, there exists a low correlation between u… ▽ More

    Submitted 7 May, 2022; originally announced May 2022.

  46. arXiv:2107.11413  [pdf, other

    cs.LG cs.HC

    An Instance-Dependent Simulation Framework for Learning with Label Noise

    Authors: Keren Gu, Xander Masotto, Vandana Bachani, Balaji Lakshminarayanan, Jack Nikodem, Dong Yin

    Abstract: We propose a simulation framework for generating instance-dependent noisy labels via a pseudo-labeling paradigm. We show that the distribution of the synthetic noisy labels generated with our framework is closer to human labels compared to independent and class-conditional random flipping. Equipped with controllable label noise, we study the negative impact of noisy labels across a few practical s… ▽ More

    Submitted 17 October, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

    Comments: Datasets released at https://github.com/deepmind/deepmind-research/tree/master/noisy_label

  47. Feature Selection for Multivariate Time Series via Network Pruning

    Authors: Kang Gu, Soroush Vosoughi, Temiloluwa Prioleau

    Abstract: In recent years, there has been an ever increasing amount of multivariate time series (MTS) data in various domains, typically generated by a large family of sensors such as wearable devices. This has led to the development of novel learning methods on MTS data, with deep learning models dominating the most recent advancements. Prior literature has primarily focused on designing new network archit… ▽ More

    Submitted 21 October, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

    Comments: In ICDM 2021 Workshop on Systematic Feature Engineering for Time-Series Data Mining (SFE-TSDM)

  48. arXiv:2006.14002  [pdf, other

    cs.CE cs.LG stat.ML

    Bi-Level Graph Neural Networks for Drug-Drug Interaction Prediction

    Authors: Yunsheng Bai, Ken Gu, Yizhou Sun, Wei Wang

    Abstract: We introduce Bi-GNN for modeling biological link prediction tasks such as drug-drug interaction (DDI) and protein-protein interaction (PPI). Taking drug-drug interaction as an example, existing methods using machine learning either only utilize the link structure between drugs without using the graph representation of each drug molecule, or only leverage the individual drug compound structures wit… ▽ More

    Submitted 11 June, 2020; originally announced June 2020.

  49. Shop The Look: Building a Large Scale Visual Shopping System at Pinterest

    Authors: Raymond Shiau, Hao-Yu Wu, Eric Kim, Yue Li Du, Anqi Guo, Zhiyuan Zhang, Eileen Li, Kunlong Gu, Charles Rosenberg, Andrew Zhai

    Abstract: As online content becomes ever more visual, the demand for searching by visual queries grows correspondingly stronger. Shop The Look is an online shopping discovery service at Pinterest, leveraging visual search to enable users to find and buy products within an image. In this work, we provide a holistic view of how we built Shop The Look, a shopping oriented visual search system, along with lesso… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

    Comments: 10 pages, 7 figures, Accepted to KDD'20

    ACM Class: I.2.10; I.4.8; I.4.9; I.4.10; I.5.4; K.4.4

  50. Bootstrapping Complete The Look at Pinterest

    Authors: Eileen Li, Eric Kim, Andrew Zhai, Josh Beal, Kunlong Gu

    Abstract: Putting together an ideal outfit is a process that involves creativity and style intuition. This makes it a particularly difficult task to automate. Existing styling products generally involve human specialists and a highly curated set of fashion items. In this paper, we will describe how we bootstrapped the Complete The Look (CTL) system at Pinterest. This is a technology that aims to learn the s… ▽ More

    Submitted 29 June, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: 9 pages, 12 figures, To be published in KDD '20