Skip to main content

Showing 1–50 of 78 results for author: Bai, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.08044  [pdf, ps, other

    cs.AR

    A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

    Authors: Cong Li, Chenhao Xue, Yi Ren, Xiping Dong, Yu Cheng, Yinbo Hu, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Zhi Yang, Zhe Cheng, Yuan Xie, Guangyu Sun

    Abstract: Large language models (LLMs) exhibit memory-intensive behavior during decoding, making it a key bottleneck in LLM inference. To accelerate decoding execution, hybrid-bonding-based 3D-DRAM has been adopted in LLM accelerators. While this emerging technology provides strong performance gains over existing hardware, current 3D-DRAM accelerators (3D-Accelerators) rely on closed-source evaluation tools… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  2. arXiv:2603.24093  [pdf, ps, other

    cs.LG cs.AI

    Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

    Authors: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu

    Abstract: Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and i… ▽ More

    Submitted 25 March, 2026; originally announced March 2026.

  3. arXiv:2603.09095  [pdf, ps, other

    cs.CL cs.CV

    Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

    Authors: Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai

    Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find… ▽ More

    Submitted 9 March, 2026; originally announced March 2026.

  4. arXiv:2603.04797  [pdf, ps, other

    cs.AR

    Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

    Authors: Cong Li, Yihan Yin, Chenhao Xue, Zhao Wang, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Yuan Xie, Guangyu Sun

    Abstract: Large language models (LLMs) have been widely deployed for online generative services, where numerous LLM instances jointly handle workloads with fluctuating request arrival rates and variable request lengths. To efficiently execute coexisting compute-intensive and memory-intensive operators, near-memory processing (NMP) based computing paradigm has been extensively proposed. However, existing NMP… ▽ More

    Submitted 4 March, 2026; originally announced March 2026.

  5. arXiv:2602.02905  [pdf, ps, other

    cs.AI

    FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

    Authors: Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu

    Abstract: Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics t… ▽ More

    Submitted 2 February, 2026; originally announced February 2026.

    Comments: 30 pages, 4 figures, 10 tables

  6. arXiv:2601.11590  [pdf, ps, other

    cs.DC cs.AI

    EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend

    Authors: Fan Bai, Pai Peng, Zhengzhi Tang, Zhe Wang, Gong Chen, Xiang Lu, Yinuo Li, Huan Lin, Weizhe Lin, Yaoyuan Wang, Xiaosong Li

    Abstract: With the widespread adoption of large multimodal models, efficient inference across text, image, audio, and video modalities has become critical. However, existing multimodal inference systems typically employ monolithic architectures that tightly couple the Encode, Prefill, and Decode stages on homogeneous hardware, neglecting the heterogeneous computational characteristics of each stage. This de… ▽ More

    Submitted 4 January, 2026; originally announced January 2026.

  7. arXiv:2512.08995  [pdf

    cs.HC cs.IR

    PoultryTalk: A Multi-modal Retrieval-Augmented Generation (RAG) System for Intelligent Poultry Management and Decision Support

    Authors: Kapalik Khanal, Biswash Khatiwada, Stephen Afrifa, Ranjan Sapkota, Sanjay Shah, Frank Bai, Ramesh Bahadur Bist

    Abstract: The Poultry industry plays a vital role in global food security, yet small- and medium-scale farmers frequently lack timely access to expert-level support for disease diagnosis, nutrition planning, and management decisions. With rising climate stress, unpredictable feed prices, and persistent disease threats, poultry producers often struggle to make quick, informed decisions. Therefore, there is a… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

  8. arXiv:2510.14703  [pdf, ps, other

    cs.AI

    ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

    Authors: Jianghao Lin, Yuanyuan Shi, Xin Peng, Renjie Ding, Hairui Wang, Yuxuan Peng, Bizhe Bai, Weixi Song, Fengshuo Bai, Huacan Chai, Weinan Zhang, Fei Huang, Ying Wen

    Abstract: Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scalin… ▽ More

    Submitted 16 October, 2025; originally announced October 2025.

  9. arXiv:2510.07043  [pdf, ps, other

    cs.LG

    COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

    Authors: Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao

    Abstract: Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constraine… ▽ More

    Submitted 8 October, 2025; originally announced October 2025.

  10. arXiv:2509.23829  [pdf, ps, other

    cs.RO

    DexFlyWheel: A Scalable and Self-improving Data Generation Framework for Dexterous Manipulation

    Authors: Kefei Zhu, Fengshuo Bai, YuanHao Xiang, Yishuai Cai, Xinglin Chen, Ruochong Li, Xingtao Wang, Hao Dong, Yaodong Yang, Xiaopeng Fan, Yuanpei Chen

    Abstract: Dexterous manipulation is critical for advancing robot capabilities in real-world applications, yet diverse and high-quality datasets remain scarce. Existing data collection methods either rely on human teleoperation or require significant human engineering, or generate data with limited diversity, which restricts their scalability and generalization. In this paper, we introduce DexFlyWheel, a sca… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

    Comments: NeurIPS 2025, Spotlight

  11. arXiv:2508.08053  [pdf, ps, other

    cs.AI

    AdaptFlow: Adaptive Workflow Optimization via Meta-Learning

    Authors: Runchuan Zhu, Bowen Jiang, Lingrui Mei, Fangkai Yang, Lu Wang, Haoxiang Gao, Fengshuo Bai, Pu Zhao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

    Abstract: Recent advances in large language models (LLMs) have sparked growing interest in agentic workflows, which are structured sequences of LLM invocations intended to solve complex tasks. However, existing approaches often rely on static templates or manually designed workflows, which limit adaptability to diverse tasks and hinder scalability. We propose AdaptFlow, a natural language-based meta-learnin… ▽ More

    Submitted 11 August, 2025; originally announced August 2025.

  12. arXiv:2508.07534  [pdf, ps, other

    cs.CL

    From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

    Authors: Jia Deng, Jie Chen, Zhipeng Chen, Daixuan Cheng, Fei Bai, Beichen Zhang, Yinqian Min, Yanzipeng Gao, Wayne Xin Zhao, Ji-Rong Wen

    Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrat… ▽ More

    Submitted 16 August, 2025; v1 submitted 10 August, 2025; originally announced August 2025.

    Comments: 27pages,25figures. arXiv admin note: text overlap with arXiv:2508.02260

  13. arXiv:2507.16337  [pdf, ps, other

    cs.CV

    One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution

    Authors: Xinyu Mao, Xiaohan Xing, Fei Meng, Jianbang Liu, Fan Bai, Qiang Nie, Max Meng

    Abstract: Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anyth… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

    Comments: accepted by ICCV2025

  14. arXiv:2507.13575  [pdf, ps, other

    cs.LG cs.AI

    Apple Intelligence Foundation Language Models: Tech Report 2025

    Authors: Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang, Xiyou Zhou, Jun Qin, Dian Ang Yap, Narendran Raghavan, Xuankai Chang, Margit Bowler, Eray Yildiz, John Peebles, Hannah Gillis Coleman, Matteo Ronchi, Peter Gray, Keen You, Anthony Spalvieri-Kruse, Ruoming Pang, Reed Li, Yuli Yang, Emad Soroush, Zhiyun Lu, Crystal Xiao, Rong Situ, Jordan Huffaker, David Griffiths , et al. (373 additional authors not shown)

    Abstract: We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transform… ▽ More

    Submitted 27 August, 2025; v1 submitted 17 July, 2025; originally announced July 2025.

  15. arXiv:2507.01925  [pdf, ps, other

    cs.RO

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Authors: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

    Abstract: The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 70 pages, 5 figures

  16. arXiv:2506.06485  [pdf, ps, other

    cs.CL cs.AI

    Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict

    Authors: Kaiser Sun, Fan Bai, Mark Dredze

    Abstract: Large language models (LLMs) draw on both contextual information and parametric memory, yet these sources can conflict. Prior studies have largely examined this issue in contextual question answering, implicitly assuming that tasks should rely on the provided context, leaving unclear how LLMs behave when tasks require different types and degrees of knowledge utilization. We address this gap with a… ▽ More

    Submitted 6 January, 2026; v1 submitted 6 June, 2025; originally announced June 2025.

    Comments: Major revision

  17. arXiv:2505.24480  [pdf, other

    cs.CL cs.AI cs.LG

    Towards Effective Code-Integrated Reasoning

    Authors: Fei Bai, Yingqian Min, Beichen Zhang, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen

    Abstract: In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Technical Report on Slow Thinking with LLMs: Code-Integrated Reasoning

  18. arXiv:2505.23722  [pdf, ps, other

    cs.CL

    LLMs are Better Than You Think: Label-Guided In-Context Learning for Named Entity Recognition

    Authors: Fan Bai, Hamid Hassanzadeh, Ardavan Saeedi, Mark Dredze

    Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. However, in Named Entity Recognition (NER), existing ICL methods typically rely on task-agnostic semantic similarity for demonstration retrieval, which often yields less relevant examples and leads to inferior results. We introduce DEER, a training-free ICL approach that enables LLM… ▽ More

    Submitted 29 October, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted to EMNLP 2025

  19. arXiv:2505.19558  [pdf, ps, other

    cs.CY cs.LG

    PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

    Authors: Zhaowei Zhang, Xiaobo Wang, Minghua Yi, Mengmeng Wang, Fengshuo Bai, Zilong Zheng, Yipeng Kang, Yaodong Yang

    Abstract: Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities in this scope are still understudied. In this paper, we introduce PoliCon, a novel benchmark constructed from 2,225 high-quality deliberation records… ▽ More

    Submitted 13 February, 2026; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted by ICLR 2026

    ACM Class: K.4.1; K.4.3; I.2.7

  20. arXiv:2505.16834  [pdf, ps, other

    cs.CL cs.AI cs.IR

    SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

    Authors: Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen

    Abstract: Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for… ▽ More

    Submitted 8 October, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  21. arXiv:2505.15859  [pdf, ps, other

    cs.IR cs.AI

    AutoData: A Multi-Agent System for Open Web Data Collection

    Authors: Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, Chuxu Zhang, Yanfang Ye

    Abstract: The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adapta… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  22. arXiv:2505.13403  [pdf, ps, other

    cs.CL

    MR. Judge: Multimodal Reasoner as a Judge

    Authors: Renjie Pi, Felix Bai, Qibin Chen, Simon Wang, Jiulong Shan, Kieran Liu, Meng Cao

    Abstract: The paradigm of using Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) as evaluative judges has emerged as an effective approach in RLHF and inference-time scaling. In this work, we propose Multimodal Reasoner as a Judge (MR. Judge), a paradigm for empowering general-purpose MLLMs judges with strong reasoning capabilities. Instead of directly assigning scores for each resp… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  23. arXiv:2502.19148  [pdf, ps, other

    cs.CL cs.LG

    Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

    Authors: Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang

    Abstract: How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cann… ▽ More

    Submitted 12 June, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

    Comments: Accepted by ICLR 2025, Project page: https://zowiezhang.github.io/projects/Amulet

    ACM Class: I.2.7

  24. arXiv:2502.18423  [pdf, other

    cs.RO

    Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand

    Authors: Fengshuo Bai, Yu Li, Jie Chu, Tawei Chou, Runchuan Zhu, Ying Wen, Yaodong Yang, Yuanpei Chen

    Abstract: Retrieving objects buried beneath multiple objects is not only challenging but also time-consuming. Performing manipulation in such environments presents significant difficulty due to complex contact relationships. Existing methods typically address this task by sequentially grasping and removing each occluding object, resulting in lengthy execution times and requiring impractical grasping capabil… ▽ More

    Submitted 26 February, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  25. arXiv:2502.05911  [pdf, other

    cs.CL

    GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

    Authors: Runchuan Zhu, Zinco Jiang, Jiang Wu, Zhipeng Ma, Jiahe Song, Fengshuo Bai, Dahua Lin, Lijun Wu, Conghui He

    Abstract: Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions t… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: Equal contribution: Runchuan Zhu, Zinco Jiang, Jiang Wu; Corresponding author: Conghui He

  26. arXiv:2501.00913  [pdf, ps, other

    cs.LG cs.AI

    $β$-DQN: Improving Deep Q-Learning By Evolving the Behavior

    Authors: Hongming Zhang, Fengshuo Bai, Chenjun Xiao, Chao Gao, Bo Xu, Martin Müller

    Abstract: While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $ε$-greedy. Motivated by this, we introduce $β$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $β$. This function estimates the probability that each action has been taken a… ▽ More

    Submitted 28 October, 2025; v1 submitted 1 January, 2025; originally announced January 2025.

    Comments: aamas 2025

  27. arXiv:2412.10713  [pdf, other

    cs.LG cs.AI cs.CR cs.RO

    RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors

    Authors: Fengshuo Bai, Runze Liu, Yali Du, Ying Wen, Yaodong Yang

    Abstract: Evaluating deep reinforcement learning (DRL) agents against targeted behavior attacks is critical for assessing their robustness. These attacks aim to manipulate the victim into specific behaviors that align with the attacker's objectives, often bypassing traditional reward-based defenses. Prior methods have primarily focused on reducing cumulative rewards; however, rewards are typically too gener… ▽ More

    Submitted 14 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  28. arXiv:2412.04573  [pdf, other

    cs.CL

    Give me Some Hard Questions: Synthetic Data Generation for Clinical QA

    Authors: Fan Bai, Keith Harrigian, Joel Stremmel, Hamid Hassanzadeh, Ardavan Saeedi, Mark Dredze

    Abstract: Clinical Question Answering (QA) systems enable doctors to quickly access patient information from electronic health records (EHRs). However, training these systems requires significant annotated data, which is limited due to the expertise needed and the privacy concerns associated with clinical data. This paper explores generating Clinical QA data using large language models (LLMs) in a zero-shot… ▽ More

    Submitted 5 December, 2024; originally announced December 2024.

    Comments: Accepted to ML4H 2024 Findings

  29. arXiv:2411.03670  [pdf, other

    cs.CV cs.AI

    Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

    Authors: Pedro R. A. S. Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger, Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Yong Xia, Zhaohu Xing, Lei Zhu , et al. (28 additional authors not shown)

    Abstract: How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone… ▽ More

    Submitted 19 January, 2025; v1 submitted 6 November, 2024; originally announced November 2024.

    Comments: Accepted to NeurIPS-2024

  30. arXiv:2408.04682  [pdf, other

    cs.CL cs.AI cs.LG

    ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

    Authors: Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang

    Abstract: Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful… ▽ More

    Submitted 16 April, 2025; v1 submitted 8 August, 2024; originally announced August 2024.

  31. arXiv:2407.21075  [pdf, other

    cs.AI cs.CL cs.LG

    Apple Intelligence Foundation Language Models

    Authors: Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek , et al. (130 additional authors not shown)

    Abstract: We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  32. arXiv:2407.13292  [pdf, other

    cs.SD cs.CL eess.AS

    Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

    Authors: Lukuan Dong, Donghong Qin, Fengbo Bai, Fanhua Song, Yan Liu, Chen Xu, Zhijian Ou

    Abstract: The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense… ▽ More

    Submitted 16 September, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted into ISCSLP 2024

  33. arXiv:2405.18718  [pdf, other

    cs.CL

    Efficient Model-agnostic Alignment via Bayesian Persuasion

    Authors: Fengshuo Bai, Mingzhi Wang, Zhaowei Zhang, Boyuan Chen, Yinda Xu, Ying Wen, Yaodong Yang

    Abstract: With recent advancements in large language models (LLMs), alignment has emerged as an effective technique for keeping LLMs consensus with human intent. Current methods primarily involve direct training through Supervised Fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), both of which require substantial computational resources and extensive ground truth data. This paper explo… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  34. arXiv:2405.18688  [pdf, other

    cs.LG cs.AI cs.CL

    Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

    Authors: Fengshuo Bai, Rui Zhao, Hongming Zhang, Sijia Cui, Ying Wen, Yaodong Yang, Bo Xu, Lei Han

    Abstract: Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. However, a notable limitation of PbRL is its dependency on substantial human feedback. This dependency stems from the learning loop, which entails accurate reward learning compounded with value/policy learning, necessitating a considerable number of samples. To boost the… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  35. arXiv:2404.00578  [pdf, other

    cs.CV

    M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

    Authors: Fan Bai, Yuxin Du, Tiejun Huang, Max Q. -H. Meng, Bo Zhao

    Abstract: Medical image analysis is essential to clinical diagnosis and treatment, which is increasingly supported by multi-modal large language models (MLLMs). However, previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. This paper aims to advance 3D medical image analysis with MLLMs. To this end, we present a large-scale… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: MLLM, 3D medical image analysis

  36. arXiv:2402.12907  [pdf, ps, other

    cs.AI cs.CY cs.GT cs.HC

    Roadmap on Incentive Compatibility for AI Alignment and Governance in Sociotechnical Systems

    Authors: Zhaowei Zhang, Fengshuo Bai, Mingzhi Wang, Haoyang Ye, Chengdong Ma, Yaodong Yang

    Abstract: The burgeoning integration of artificial intelligence (AI) into human society brings forth significant implications for societal governance and safety. While considerable strides have been made in addressing AI alignment challenges, existing methodologies primarily focus on technical facets, often neglecting the intricate sociotechnical nature of AI systems, which can lead to a misalignment betwee… ▽ More

    Submitted 16 June, 2025; v1 submitted 20 February, 2024; originally announced February 2024.

    ACM Class: I.2.m; K.4.m

  37. arXiv:2311.15111  [pdf, other

    cs.CV

    UAE: Universal Anatomical Embedding on Multi-modality Medical Images

    Authors: Xiaoyu Bai, Fan Bai, Xiaofei Huo, Jia Ge, Jingjing Lu, Xianghua Ye, Ke Yan, Yong Xia

    Abstract: Identifying specific anatomical structures (\textit{e.g.}, lesions or landmarks) in medical images plays a fundamental role in medical image analysis. Exemplar-based landmark detection methods are receiving increasing attention since they can detect arbitrary anatomical points in inference while do not need landmark annotations in training. They use self-supervised learning to acquire a discrimina… ▽ More

    Submitted 18 January, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

  38. arXiv:2311.13385  [pdf, other

    cs.CV

    SegVol: Universal and Interactive Volumetric Medical Image Segmentation

    Authors: Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao

    Abstract: Precise image segmentation provides clinical study with instructive information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of a 3D foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a 3D foundation segmentation model, named SegVol, supporting universal and… ▽ More

    Submitted 11 February, 2025; v1 submitted 22 November, 2023; originally announced November 2023.

    Comments: NeurIPS 2024 Spotlight

  39. KernelGPA: A Globally Optimal Solution to Deformable SLAM in Closed-form

    Authors: Fang Bai, Kanzhi Wu, Adrien Bartoli

    Abstract: We study the generalized Procrustes analysis (GPA), as a minimal formulation to the simultaneous localization and mapping (SLAM) problem. We propose KernelGPA, a novel global registration technique to solve SLAM in the deformable environment. We propose the concept of deformable transformation which encodes the entangled pose and deformation. We define deformable transformations using a kernel met… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Comments: This paper has been accepted for publication in the International Journal of Robotics Research, 2023. https://doi.org/10.1177/02783649231195380

    MSC Class: 68U05 ACM Class: G.1; I.3; I.4

    Journal ref: International Journal of Robotics Research, 2023

  40. arXiv:2310.16427  [pdf, other

    cs.CL

    PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization

    Authors: Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, Zhiting Hu

    Abstract: Highly effective, task-specific prompts are often heavily engineered by experts to integrate detailed instructions and domain insights based on a deep understanding of both instincts of large language models (LLMs) and the intricacies of the target task. However, automating the generation of such expert-level prompts remains elusive. Existing prompt optimization methods tend to overlook the depth… ▽ More

    Submitted 7 December, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

    Comments: 34 pages, 10 figures

  41. arXiv:2310.00378  [pdf, other

    cs.CL cs.AI cs.CY

    ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models

    Authors: Zhaowei Zhang, Fengshuo Bai, Jun Gao, Yaodong Yang

    Abstract: Personal values are a crucial factor behind human decision-making. Considering that Large Language Models (LLMs) have been shown to impact human decisions significantly, it is essential to make sure they accurately understand human values to ensure their safety. However, evaluating their grasp of these values is complex due to the value's intricate and adaptable nature. We argue that truly underst… ▽ More

    Submitted 17 June, 2024; v1 submitted 30 September, 2023; originally announced October 2023.

    ACM Class: I.2.m; K.4.m

  42. arXiv:2308.15056  [pdf

    cs.CV

    A Consumer-tier based Visual-Brain Machine Interface for Augmented Reality Glasses Interactions

    Authors: Yuying Jiang, Fan Bai, Zicheng Zhang, Xiaochen Ye, Zheng Liu, Zhiping Shi, Jianwei Yao, Xiaojun Liu, Fangkun Zhu, Junling Li Qian Guo, Xiaoan Wang, Junwen Luo

    Abstract: Objective.Visual-Brain Machine Interface(V-BMI) has provide a novel interaction technique for Augmented Reality (AR) industries. Several state-of-arts work has demonstates its high accuracy and real-time interaction capbilities. However, most of the studies employ EEGs devices that are rigid and difficult to apply in real-life AR glasseses application sceniraros. Here we develop a consumer-tier Vi… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: 15 pages,10 figures

  43. arXiv:2308.05137  [pdf, other

    cs.CV

    Discrepancy-based Active Learning for Weakly Supervised Bleeding Segmentation in Wireless Capsule Endoscopy Images

    Authors: Fan Bai, Xiaohan Xing, Yutian Shen, Han Ma, Max Q. -H. Meng

    Abstract: Weakly supervised methods, such as class activation maps (CAM) based, have been applied to achieve bleeding segmentation with low annotation efforts in Wireless Capsule Endoscopy (WCE) images. However, the CAM labels tend to be extremely noisy, and there is an irreparable gap between CAM labels and ground truths for medical images. This paper proposes a new Discrepancy-basEd Active Learning (DEAL)… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

    Comments: accepted by MICCAI 2022

  44. arXiv:2308.04911  [pdf, other

    cs.CV cs.AI

    SLPT: Selective Labeling Meets Prompt Tuning on Label-Limited Lesion Segmentation

    Authors: Fan Bai, Ke Yan, Xiaoyu Bai, Xinyu Mao, Xiaoli Yin, Jingren Zhou, Yu Shi, Le Lu, Max Q. -H. Meng

    Abstract: Medical image analysis using deep learning is often challenged by limited labeled data and high annotation costs. Fine-tuning the entire network in label-limited scenarios can lead to overfitting and suboptimal performance. Recently, prompt tuning has emerged as a more promising technique that introduces a few additional tunable parameters as prompts to a task-agnostic pre-trained model, and updat… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

    Comments: accepted by MICCAI 2023

  45. arXiv:2307.03535  [pdf, other

    cs.CV

    Matching in the Wild: Learning Anatomical Embeddings for Multi-Modality Images

    Authors: Xiaoyu Bai, Fan Bai, Xiaofei Huo, Jia Ge, Tony C. W. Mok, Zi Li, Minfeng Xu, Jingren Zhou, Le Lu, Dakai Jin, Xianghua Ye, Jingjing Lu, Ke Yan

    Abstract: Radiotherapists require accurate registration of MR/CT images to effectively use information from both modalities. In a typical registration pipeline, rigid or affine transformations are applied to roughly align the fixed and moving images before proceeding with the deformation step. While recent learning-based methods have shown promising results in the rigid/affine step, these methods often requ… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  46. arXiv:2306.03615  [pdf, other

    cs.LG

    PEARL: Zero-shot Cross-task Preference Alignment and Robust Reward Learning for Robotic Manipulation

    Authors: Runze Liu, Yali Du, Fengshuo Bai, Jiafei Lyu, Xiu Li

    Abstract: In preference-based Reinforcement Learning (RL), obtaining a large number of preference labels are both time-consuming and costly. Furthermore, the queried human preferences cannot be utilized for the new tasks. In this paper, we propose Zero-shot Cross-task Preference Alignment and Robust Reward Learning (PEARL), which learns policies from cross-task preference transfer without any human labels o… ▽ More

    Submitted 5 June, 2024; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: Accepted to ICML 2024

  47. arXiv:2305.14336  [pdf, other

    cs.CL

    Schema-Driven Information Extraction from Heterogeneous Tables

    Authors: Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Mark Dredze, Alan Ritter

    Abstract: In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse dom… ▽ More

    Submitted 20 November, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP 2024 Findings

  48. arXiv:2211.03885  [pdf, other

    cs.CV eess.IV

    Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report

    Authors: Andrey Ignatov, Radu Timofte, Shuai Liu, Chaoyu Feng, Furui Bai, Xiaotao Wang, Lei Lei, Ziyao Yi, Yan Xiang, Zibin Liu, Shaoqing Li, Keming Shi, Dehui Kong, Ke Xu, Minsu Kwon, Yaqi Wu, Jiesi Zheng, Zhihao Fan, Xun Wu, Feng Zhang, Albert No, Minhyeok Cho, Zewen Chen, Xiaze Zhang, Ran Li , et al. (13 additional authors not shown)

    Abstract: The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. Th… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

  49. arXiv:2210.11153  [pdf, other

    eess.IV cs.CV

    Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

    Authors: Marcos V. Conde, Radu Timofte, Yibin Huang, Jingyang Peng, Chang Chen, Cheng Li, Eduardo Pérez-Pellitero, Fenglong Song, Furui Bai, Shuai Liu, Chaoyu Feng, Xiaotao Wang, Lei Lei, Yu Zhu, Chenghua Li, Yingying Jiang, Yong A, Peisong Wang, Cong Leng, Jian Cheng, Xiaoyu Liu, Zhicun Yin, Zhilu Zhang, Junyi Li, Ming Liu , et al. (18 additional authors not shown)

    Abstract: Cameras capture sensor RAW images and transform them into pleasant RGB images, suitable for the human eyes, using their integrated Image Signal Processor (ISP). Numerous low-level vision tasks operate in the RAW domain (e.g. image denoising, white balance) due to its linear relationship with the scene irradiance, wide-range of information at 12bits, and sensor designs. Despite this, RAW image data… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: ECCV 2022 Advances in Image Manipulation (AIM) workshop

  50. arXiv:2210.08997  [pdf, other

    cs.CV cs.LG eess.IV

    AIM 2022 Challenge on Instagram Filter Removal: Methods and Results

    Authors: Furkan Kınlı, Sami Menteş, Barış Özcan, Furkan Kıraç, Radu Timofte, Yi Zuo, Zitao Wang, Xiaowen Zhang, Yu Zhu, Chenghua Li, Cong Leng, Jian Cheng, Shuai Liu, Chaoyu Feng, Furui Bai, Xiaotao Wang, Lei Lei, Tianzhi Ma, Zihan Gao, Wenxin He, Woon-Ha Yeo, Wang-Taek Oh, Young-Il Kim, Han-Cheol Ryu, Gang He , et al. (8 additional authors not shown)

    Abstract: This paper introduces the methods and the results of AIM 2022 challenge on Instagram Filter Removal. Social media filters transform the images by consecutive non-linear operations, and the feature maps of the original content may be interpolated into a different domain. This reduces the overall performance of the recent deep learning strategies. The main goal of this challenge is to produce realis… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

    Comments: 14 pages, 9 figures, Challenge report of AIM 2022 Instagram Filter Removal Challenge in conjunction with ECCV 2022