Starred repositories
A Comprehensive Survey on Evaluating Reasoning Capabilities in Multimodal Large Language Models.
Official code and dataset for our ICCV 2025 paper: MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
The official implementation of “ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought”
The official implementation of "DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation". (arXiv 2601.22153)
VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning.
[ICLR 2026] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
moojink / openvla-oft
Forked from openvla/openvlaFine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
RLinf: Reinforcement Learning Infrastructure for Embodied and Agentic AI
Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge
Towards Efficient Multimodal Large Language Models: A Survey on Token Compression
Convert any PDF into a podcast episode!
[CVPR 2026] Official codes of "Monet: Reasoning in Latent Visual Space Beyond Image and Language"
Agent0 Series: Self-Evolving Agents from Zero Data
Official repo of "Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens"
[CVPR 2026 (Findings) 🔥🔥] Self Evolving Large Multimodal Models with Continuous Rewards
Official implementation of "Continuous Autoregressive Language Models"
Official codebase for the paper Latent Visual Reasoning
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Official repo for "PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning"
✨ [ICLR'26] WithAnyone is capable of generating high-quality, controllable, and ID consistent images
Official implementation of Spatial-Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model [ICLR2026]
Official repository for OmniVLA training and inference code