Skip to main content

Showing 1–50 of 55 results for author: Ouyang, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.04771  [pdf, ps, other

    cs.CV cs.CL

    MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    Authors: Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei , et al. (18 additional authors not shown)

    Abstract: Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training… ▽ More

    Submitted 9 April, 2026; v1 submitted 6 April, 2026; originally announced April 2026.

    Comments: Technical Report

  2. arXiv:2603.25040  [pdf, ps, other

    cs.LG cs.CL cs.CV

    Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

    Authors: Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang , et al. (152 additional authors not shown)

    Abstract: We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertis… ▽ More

    Submitted 2 April, 2026; v1 submitted 26 March, 2026; originally announced March 2026.

  3. arXiv:2602.20636  [pdf, ps, other

    cs.CV cs.AI

    SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

    Authors: Rulin Zhou, Guankun Wang, An Wang, Yujie Ma, Lixin Ouyang, Bolin Cui, Junyan Li, Chaowei Zhu, Mingyang Li, Ming Chen, Xiaopin Zhong, Peng Lu, Jiankun Wang, Xianming Liu, Hongliang Ren

    Abstract: Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatma… ▽ More

    Submitted 24 February, 2026; originally announced February 2026.

  4. arXiv:2602.13307  [pdf, ps, other

    cs.NI

    Cooperative Edge Caching with Large Language Model in Wireless Networks

    Authors: Ning Yang, Wentao Wang, Lingtao Ouyang, Haijun Zhang

    Abstract: Cooperative edge caching in overlapping zones couples Base Station (BS) decisions, making content replacement sensitive to spatial topology and temporal reuse. Conventional heuristics suffer from myopia, while Deep Reinforcement Learning relies on brittle numerical representations and needs prohibitive retraining under topological or traffic dynamics. This paper studies a centralized, cooperative… ▽ More

    Submitted 2 April, 2026; v1 submitted 9 February, 2026; originally announced February 2026.

  5. arXiv:2602.04705  [pdf, ps, other

    cs.CL

    ERNIE 5.0 Technical Report

    Authors: Haifeng Wang, Hua Wu, Tian Wu, Yu Sun, Jing Liu, Dianhai Yu, Yanjun Ma, Jingzhou He, Zhongjun He, Dou Hong, Qiwen Liu, Shuohuan Wang, Junyuan Shang, Zhenyu Zhang, Yuchen Ding, Jinle Zeng, Jiabin Yang, Liang Shen, Ruibiao Chen, Weichong Yin, Siyu Ding, Dai Dai, Shikun Feng, Siqi Bao, Bolei He , et al. (413 additional authors not shown)

    Abstract: In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practi… ▽ More

    Submitted 4 February, 2026; originally announced February 2026.

  6. arXiv:2602.01850  [pdf, ps, other

    cs.CV

    WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?

    Authors: Pei Li, Jiaxi Yin, Lei Ouyang, Shihan Pan, Ge Wang, Han Ding, Fei Wang

    Abstract: IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current pr… ▽ More

    Submitted 2 February, 2026; originally announced February 2026.

    Comments: Under Review. 28 pages, 9 figures, 6 tables

  7. arXiv:2512.10619  [pdf, ps, other

    cs.CV

    DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

    Authors: Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen, Junbo Niu, Yuan Qu, Bin Wang, Ka-Ho Chow, Conghui He, Wentao Zhang

    Abstract: Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard bench… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  8. arXiv:2511.18127  [pdf, ps, other

    cs.CV

    SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation

    Authors: Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

    Abstract: Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the firs… ▽ More

    Submitted 22 November, 2025; originally announced November 2025.

  9. arXiv:2511.17952  [pdf, ps, other

    cs.CV

    Multi-speaker Attention Alignment for Multimodal Social Interaction

    Authors: Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, Yoichi Sato

    Abstract: Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues: who is speaking, to whom, and with what gaze or gestures. While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the… ▽ More

    Submitted 22 November, 2025; originally announced November 2025.

  10. arXiv:2511.16221  [pdf, ps, other

    cs.CV cs.CL

    Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions

    Authors: Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, Yoichi Sato

    Abstract: Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

  11. arXiv:2510.27195  [pdf, ps, other

    cs.CV cs.CL cs.SI

    Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

    Authors: Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Yoichi Sato

    Abstract: As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic,… ▽ More

    Submitted 4 November, 2025; v1 submitted 31 October, 2025; originally announced October 2025.

    Comments: ICCV2025 Workshop

  12. arXiv:2509.22186  [pdf, ps, other

    cs.CV cs.CL

    MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

    Authors: Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang , et al. (36 additional authors not shown)

    Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsamp… ▽ More

    Submitted 29 September, 2025; v1 submitted 26 September, 2025; originally announced September 2025.

    Comments: Technical Report; GitHub Repo: https://github.com/opendatalab/MinerU Hugging Face Model: https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B Hugging Face Demo: https://huggingface.co/spaces/opendatalab/MinerU

  13. arXiv:2509.20367  [pdf, ps, other

    cs.CL cs.AI cs.CY

    Interpreting Public Sentiment in Diplomacy Events: A Counterfactual Analysis Framework Using Large Language Models

    Authors: Leyi Ouyang

    Abstract: Diplomatic events consistently prompt widespread public discussion and debate. Public sentiment plays a critical role in diplomacy, as a good sentiment provides vital support for policy implementation, helps resolve international issues, and shapes a nation's international image. Traditional methods for gauging public sentiment, such as large-scale surveys or manual content analysis of media, are… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

    Comments: 2 Figures, 7 Tables, 1 Algorithm

  14. arXiv:2508.03144  [pdf, ps, other

    cs.CV

    LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing

    Authors: Liangyang Ouyang, Jiafeng Mao

    Abstract: Text-driven image editing enables users to flexibly modify visual content through natural language instructions, and is widely applied to tasks such as semantic object replacement, insertion, and removal. While recent inversion-based editing methods using rectified flow models have achieved promising results in image quality, we identify a structural limitation in their editing behavior: the seman… ▽ More

    Submitted 21 August, 2025; v1 submitted 5 August, 2025; originally announced August 2025.

    Comments: Our implementation is available at https://github.com/oyly16/LORE

  15. arXiv:2507.21055  [pdf, ps, other

    cs.CY cs.AI cs.SI

    Can Memory-Augmented LLM Agents Aid Journalism in Interpreting and Framing News for Diverse Audiences?

    Authors: Leyi Ouyang

    Abstract: Modern news is often comprehensive, weaving together information from diverse domains, including technology, finance, and agriculture. This very comprehensiveness creates a challenge for interpretation, as audiences typically possess specialized knowledge related to their expertise, age, or standpoint. Consequently, a reader might fully understand the financial implications of a story but fail to… ▽ More

    Submitted 2 August, 2025; v1 submitted 30 April, 2025; originally announced July 2025.

    Comments: 9 pages, 4 figures, 5 tables

  16. arXiv:2506.08299  [pdf, ps, other

    cs.CV

    OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal

    Authors: Kangning Yang, Ling Ouyang, Huiming Sun, Jie Cai, Lan Fu, Jiaming Ding, Chiu Man Ho, Zibo Meng

    Abstract: Reflection removal technology plays a crucial role in photography and computer vision applications. However, existing techniques are hindered by the lack of high-quality in-the-wild datasets. In this paper, we propose a novel paradigm for collecting reflection datasets from a fresh perspective. Our approach is convenient, cost-effective, and scalable, while ensuring that the collected data pairs a… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  17. arXiv:2506.05489  [pdf, other

    cs.CV

    F2T2-HiT: A U-Shaped FFT Transformer and Hierarchical Transformer for Reflection Removal

    Authors: Jie Cai, Kangning Yang, Ling Ouyang, Lan Fu, Jiaming Ding, Huiming Sun, Chiu Man Ho, Zibo Meng

    Abstract: Single Image Reflection Removal (SIRR) technique plays a crucial role in image processing by eliminating unwanted reflections from the background. These reflections, often caused by photographs taken through glass surfaces, can significantly degrade image quality. SIRR remains a challenging problem due to the complex and varied reflections encountered in real-world scenarios. These reflections var… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  18. arXiv:2506.05482  [pdf, other

    cs.CV

    OpenRR-5k: A Large-Scale Benchmark for Reflection Removal in the Wild

    Authors: Jie Cai, Kangning Yang, Ling Ouyang, Lan Fu, Jiaming Ding, Jinglin Shen, Zibo Meng

    Abstract: Removing reflections is a crucial task in computer vision, with significant applications in photography and image enhancement. Nevertheless, existing methods are constrained by the absence of large-scale, high-quality, and diverse datasets. In this paper, we present a novel benchmark for Single Image Reflection Removal (SIRR). We have developed a large-scale dataset containing 5,300 high-quality,… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  19. arXiv:2506.05450  [pdf, other

    cs.CV

    Degradation-Aware Image Enhancement via Vision-Language Classification

    Authors: Jie Cai, Kangning Yang, Jiaming Ding, Lan Fu, Ling Ouyang, Jiang Li, Jinglin Shen, Zibo Meng

    Abstract: Image degradation is a prevalent issue in various real-world applications, affecting visual quality and downstream processing tasks. In this study, we propose a novel framework that employs a Vision-Language Model (VLM) to automatically classify degraded images into predefined categories. The VLM categorizes an input image into one of four degradation types: (A) super-resolution degradation (inclu… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  20. arXiv:2505.24389  [pdf, ps, other

    cs.CV

    Leadership Assessment in Pediatric Intensive Care Unit Team Training

    Authors: Liangyang Ouyang, Yuki Sakai, Ryosuke Furuta, Hisataka Nozawa, Hikoro Matsui, Yoichi Sato

    Abstract: This paper addresses the task of assessing PICU team's leadership skills by developing an automated analysis framework based on egocentric vision. We identify key behavioral cues, including fixation object, eye contact, and conversation patterns, as essential indicators of leadership assessment. In order to capture these multimodal signals, we employ Aria Glasses to record egocentric video, audio,… ▽ More

    Submitted 28 August, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

    Comments: This paper is accepted by EgoVis Workshop at CVPR 2025

  21. arXiv:2505.14346  [pdf, ps, other

    cs.CV

    Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

    Authors: Mingfang Zhang, Ryo Yonetani, Yifei Huang, Liangyang Ouyang, Ruicong Liu, Yoichi Sato

    Abstract: This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further compli… ▽ More

    Submitted 26 July, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: ICCV 2025

  22. arXiv:2412.07626  [pdf, other

    cs.CV cs.AI cs.IR

    OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

    Authors: Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He

    Abstract: Document content extraction is a critical task in computer vision, underpinning the data needs of large language models (LLMs) and retrieval-augmented generation (RAG) systems. Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing ben… ▽ More

    Submitted 25 March, 2025; v1 submitted 10 December, 2024; originally announced December 2024.

    Comments: Accepted by CVPR2025

  23. arXiv:2412.02592  [pdf, ps, other

    cs.CV

    OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

    Authors: Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang

    Abstract: Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect… ▽ More

    Submitted 30 August, 2025; v1 submitted 3 December, 2024; originally announced December 2024.

    Comments: Accepted by ICCV 2025

  24. arXiv:2410.21276  [pdf, other

    cs.CL cs.AI cs.CV cs.CY cs.LG cs.SD eess.AS

    GPT-4o System Card

    Authors: OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander MÄ…dry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis , et al. (395 additional authors not shown)

    Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  25. arXiv:2409.18839  [pdf, other

    cs.CV

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Authors: Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, Conghui He

    Abstract: Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution f… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

    Comments: MinerU Technical Report

  26. arXiv:2409.14301  [pdf, other

    cs.DC cs.SE

    Multi-Grained Specifications for Distributed System Model Checking and Verification

    Authors: Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu

    Abstract: This paper presents our experience specifying and verifying the correctness of ZooKeeper, a complex and evolving distributed coordination system. We use TLA+ to model fine-grained behaviors of ZooKeeper and use the TLC model checker to verify its correctness properties; we also check conformance between the model and code. The fundamental challenge is to balance the granularity of specifications a… ▽ More

    Submitted 27 September, 2024; v1 submitted 21 September, 2024; originally announced September 2024.

  27. arXiv:2409.03643  [pdf, other

    cs.CV cs.CL

    Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching

    Authors: Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, Conghui He

    Abstract: Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly… ▽ More

    Submitted 24 March, 2025; v1 submitted 5 September, 2024; originally announced September 2024.

    Comments: Accepted by CVPR2025

  28. arXiv:2407.07402  [pdf, other

    cs.CV

    ActionVOS: Actions as Prompts for Video Object Segmentation

    Authors: Liangyang Ouyang, Ruicong Liu, Yifei Huang, Ryosuke Furuta, Yoichi Sato

    Abstract: Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: This paper is accepted by ECCV2024. Code will be released at https://github.com/ut-vision/ActionVOS

  29. arXiv:2407.03320  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    Authors: Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao , et al. (2 additional authors not shown)

    Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. Th… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Technical Report. https://github.com/InternLM/InternLM-XComposer

  30. arXiv:2406.06068  [pdf, other

    cs.NI

    Instability of Self-Driving Satellite Mega-Constellation: From Theory to Practical Impacts on Network Lifetime and Capacity

    Authors: Yimei Chen, Yuanjie Li, Hewu Li, Lixin Liu, Li Ouyang, Jiabo Yang, Junyi Li, Jianping Wu, Qian Wu, Jun Liu, Zeqi Lai

    Abstract: Low Earth Orbit (LEO) satellite mega-constellations aim to enable high-speed Internet for numerous users anywhere on Earth. To safeguard their network infrastructure in congested outer space, they perform automatic orbital maneuvers to avoid collisions with external debris and satellites. However, our control-theoretic analysis and empirical validation using Starlink's space situational awareness… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  31. arXiv:2405.18315  [pdf, other

    cs.AI cs.PL

    DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data

    Authors: Bin Wang, Linke Ouyang, Fan Wu, Wenchang Ning, Xiao Han, Zhiyuan Zhao, Jiahui Peng, Yiying Jiang, Dahua Lin, Conghui He

    Abstract: In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly, requiring understanding and format conversion before it can be used by researchers or developers with different needs. To tackle this problem, this article introduces a framework called Dataset Description Language (DSDL) that aims to simplify dataset processing by p… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  32. arXiv:2404.06512  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow reso… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Code and models are publicly available at https://github.com/InternLM/InternLM-XComposer

  33. arXiv:2403.17297  [pdf, other

    cs.CL cs.AI

    InternLM2 Technical Report

    Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

    Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  34. arXiv:2401.16669  [pdf

    cs.LG cs.AI physics.ao-ph physics.geo-ph

    Improving Global Weather and Ocean Wave Forecast with Large Artificial Intelligence Models

    Authors: Fenghua Ling, Lin Ouyang, Boufeniza Redouane Larbi, Jing-Jia Luo, Tao Han, Xiaohui Zhong, Lei Bai

    Abstract: The rapid advancement of artificial intelligence technologies, particularly in recent years, has led to the emergence of several large parameter artificial intelligence weather forecast models. These models represent a significant breakthrough, overcoming the limitations of traditional numerical weather prediction models and indicating the emergence of profound potential tools for atmosphere-ocean… ▽ More

    Submitted 18 April, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

  35. arXiv:2401.16420  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XCo… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Code and models are available at https://github.com/InternLM/InternLM-XComposer

  36. arXiv:2311.16839  [pdf, other

    cs.CV cs.CL

    Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

    Authors: Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, Conghui He

    Abstract: Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem", in which the models generate textual descriptions that inaccurately depict or entirely fabricate content from associated images. This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which… ▽ More

    Submitted 6 February, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: Project Website: https://opendatalab.github.io/HA-DPO, Code: https://github.com/opendatalab/HA-DPO

  37. arXiv:2309.15112  [pdf, other

    cs.CV

    InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

    Authors: Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive rea… ▽ More

    Submitted 14 December, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Code and models are available at https://github.com/InternLM/InternLM-XComposer

  38. arXiv:2308.13566  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    MLLM-DataEngine: An Iterative Refinement Approach for MLLM

    Authors: Zhiyuan Zhao, Linke Ouyang, Bin Wang, Siyuan Huang, Pan Zhang, Xiaoyi Dong, Jiaqi Wang, Conghui He

    Abstract: Despite the great advance of Multimodal Large Language Models (MLLMs) in both instruction dataset building and benchmarking, the independence of training and evaluation makes current MLLMs hard to further improve their capability under the guidance of evaluation results with a relatively low human cost. In this paper, we propose MLLM-DataEngine, a novel closed-loop system that bridges data generat… ▽ More

    Submitted 11 September, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

    Comments: Code and models are available at https://github.com/opendatalab/MLLM-DataEngine

  39. arXiv:2307.14450  [pdf, other

    cs.IR

    Integrating Offline Reinforcement Learning with Transformers for Sequential Recommendation

    Authors: Xumei Xi, Yuke Zhao, Quan Liu, Liwen Ouyang, Yang Wu

    Abstract: We consider the problem of sequential recommendation, where the current recommendation is made based on past interactions. This recommendation task requires efficient processing of the sequential data and aims to provide recommendations that maximize the long-term reward. To this end, we train a farsighted recommender by using an offline RL algorithm with the policy network in our model architectu… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

  40. arXiv:2303.08774  [pdf, other

    cs.CL cs.AI

    GPT-4 Technical Report

    Authors: OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko , et al. (256 additional authors not shown)

    Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo… ▽ More

    Submitted 4 March, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: 100 pages; updated authors list; fixed author names and added citation

  41. arXiv:2302.02703  [pdf, other

    cs.DC

    Leveraging TLA+ Specifications to Improve the Reliability of the ZooKeeper Coordination Service

    Authors: Lingzhi Ouyang, Yu Huang, Binyu Huang, Xiaoxing Ma

    Abstract: ZooKeeper is a coordination service, widely used as a backbone of various distributed systems. Though its reliability is of critical importance, testing is insufficient for an industrial-strength system of the size and complexity of ZooKeeper, and deep bugs can still be found. To this end, we resort to formal TLA+ specifications to further improve the reliability of ZooKeeper. Our primary objectiv… ▽ More

    Submitted 16 October, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

  42. arXiv:2206.05802  [pdf, other

    cs.CL cs.LG

    Self-critiquing models for assisting human evaluators

    Authors: William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, Jan Leike

    Abstract: We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summari… ▽ More

    Submitted 13 June, 2022; v1 submitted 12 June, 2022; originally announced June 2022.

  43. arXiv:2203.04472  [pdf, other

    cs.SE

    BinMLM: Binary Authorship Verification with Flow-aware Mixture-of-Shared Language Model

    Authors: Qige Song, Yongzheng Zhang, Linshu Ouyang, Yige Chen

    Abstract: Binary authorship analysis is a significant problem in many software engineering applications. In this paper, we formulate a binary authorship verification task to accurately reflect the real-world working process of software forensic experts. It aims to determine whether an anonymous binary is developed by a specific programmer with a small set of support samples, and the actual developer may not… ▽ More

    Submitted 8 March, 2022; originally announced March 2022.

    Comments: 12 pages, 8 figures, 5 tables, accepted by Research Track of 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2022), the camera-ready version

  44. arXiv:2203.02155  [pdf, other

    cs.CL cs.AI cs.LG

    Training language models to follow instructions with human feedback

    Authors: Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

    Abstract: Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning wi… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

  45. arXiv:2112.09332  [pdf, other

    cs.CL cs.AI cs.LG

    WebGPT: Browser-assisted question-answering with human feedback

    Authors: Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, John Schulman

    Abstract: We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must coll… ▽ More

    Submitted 1 June, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: 32 pages

  46. arXiv:2111.10344  [pdf, other

    cs.LG

    Maximum Mean Discrepancy for Generalization in the Presence of Distribution and Missingness Shift

    Authors: Liwen Ouyang, Aaron Key

    Abstract: Covariate shifts are a common problem in predictive modeling on real-world problems. This paper proposes addressing the covariate shift problem by minimizing Maximum Mean Discrepancy (MMD) statistics between the training and test sets in either feature input space, feature representation space, or both. We designed three techniques that we call MMD Representation, MMD Mask, and MMD Hybrid to deal… ▽ More

    Submitted 1 March, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

    Comments: a short version accepted by NeurIPS DistShift Workshop 2021

  47. arXiv:2110.13799  [pdf, other

    cs.LG

    Neural PPO-Clip Attains Global Optimality: A Hinge Loss Perspective

    Authors: Nai-Chieh Huang, Ping-Chun Hsieh, Kuo-Hao Ho, Hsuan-Yu Yao, Kai-Chun Hu, Liang-Chun Ouyang, I-Chen Wu

    Abstract: Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical p… ▽ More

    Submitted 31 August, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: 33 pages, 1 figure

  48. arXiv:2109.10862  [pdf, other

    cs.CL cs.AI cs.LG

    Recursively Summarizing Books with Human Feedback

    Authors: Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano

    Abstract: A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist hum… ▽ More

    Submitted 27 September, 2021; v1 submitted 22 September, 2021; originally announced September 2021.

  49. arXiv:2109.04318  [pdf, other

    cs.LG stat.ML

    Estimation of Corporate Greenhouse Gas Emissions via Machine Learning

    Authors: You Han, Achintya Gopal, Liwen Ouyang, Aaron Key

    Abstract: As an important step to fulfill the Paris Agreement and achieve net-zero emissions by 2050, the European Commission adopted the most ambitious package of climate impact measures in April 2021 to improve the flow of capital towards sustainable activities. For these and other international measures to be successful, reliable data is key. The ability to see the carbon footprint of companies around th… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: Accepted for the Tackling Climate Change with Machine Learning Workshop at ICML 2021

  50. arXiv:2009.01325  [pdf, other

    cs.CL cs.AI cs.LG

    Learning to summarize from human feedback

    Authors: Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano

    Abstract: As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible t… ▽ More

    Submitted 15 February, 2022; v1 submitted 2 September, 2020; originally announced September 2020.

    Comments: NeurIPS 2020