Skip to main content

Showing 1–17 of 17 results for author: Bardes, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.03208  [pdf, ps, other

    cs.LG

    Hierarchical Planning with Latent World Models

    Authors: Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, Nicolas Ballas

    Abstract: Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challe… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

  2. arXiv:2603.14482  [pdf, ps, other

    cs.CV

    V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

    Authors: Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

    Abstract: We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial a… ▽ More

    Submitted 17 March, 2026; v1 submitted 15 March, 2026; originally announced March 2026.

  3. arXiv:2512.24497  [pdf, ps, other

    cs.AI cs.LG cs.RO stat.ML

    What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

    Authors: Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

    Abstract: A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods… ▽ More

    Submitted 8 January, 2026; v1 submitted 30 December, 2025; originally announced December 2025.

    Comments: V2 of the article: - Added AdaLN-zero - Added table comparing JEPA-WMs with baselines with std translating per-seed variability only, no variability across epochs - Reordered figures in main body of the paper

  4. arXiv:2506.09985  [pdf, ps, other

    cs.AI cs.CV cs.LG cs.RO

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Authors: Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma , et al. (5 additional authors not shown)

    Abstract: A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 48 pages, 19 figures

  5. arXiv:2502.11831  [pdf, other

    cs.CV cs.AI

    Intuitive physics understanding emerges from self-supervised pretraining on natural videos

    Authors: Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun

    Abstract: We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object pe… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Comments: 24 pages,14 figures, 5 tables

  6. arXiv:2405.17247  [pdf, other

    cs.LG

    An Introduction to Vision-Language Modeling

    Authors: Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar MaƱas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie , et al. (16 additional authors not shown)

    Abstract: Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  7. arXiv:2404.08471  [pdf, other

    cs.CV cs.AI cs.LG

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Authors: Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas

    Abstract: This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datase… ▽ More

    Submitted 15 February, 2024; originally announced April 2024.

  8. arXiv:2403.00504  [pdf, other

    cs.CV cs.AI cs.LG

    Learning and Leveraging World Models in Visual Representation Learning

    Authors: Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, Yann LeCun

    Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict t… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

    Comments: 23 pages, 16 figures

  9. arXiv:2310.19909  [pdf, other

    cs.CV cs.LG

    Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

    Authors: Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein

    Abstract: Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performan… ▽ More

    Submitted 19 November, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted to NeurIPS 2023

  10. arXiv:2307.12698  [pdf, other

    cs.CV cs.AI cs.LG

    MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

    Authors: Adrien Bardes, Jean Ponce, Yann LeCun

    Abstract: Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and intro… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  11. arXiv:2304.12210  [pdf, other

    cs.LG cs.CV

    A Cookbook of Self-Supervised Learning

    Authors: Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, Micah Goldblum

    Abstract: Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning. Yet, much like cooking, training SSL methods is a delicate art with a high barrier to entry. While many components are familiar, successfully training a SSL method involves a dizzying set of choices from the pretext tasks to training hyper-parameters. Our goal is to lower the barrier… ▽ More

    Submitted 28 June, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

  12. arXiv:2304.11718  [pdf, other

    cs.CV cs.AI

    No Free Lunch in Self Supervised Representation Learning

    Authors: Ihab Bendidi, Adrien Bardes, Ethan Cohen, Alexis Lamiable, Guillaume Bollot, Auguste Genovesio

    Abstract: Self-supervised representation learning in computer vision relies heavily on hand-crafted image transformations to learn meaningful and invariant features. However few extensive explorations of the impact of transformation design have been conducted in the literature. In particular, the dependence of downstream performances to transformation design has been established, but not studied in depth. I… ▽ More

    Submitted 23 April, 2023; originally announced April 2023.

    MSC Class: I.5.1; I.4.10

  13. arXiv:2210.01571  [pdf, other

    cs.CV cs.AI cs.LG

    VICRegL: Self-Supervised Learning of Local Visual Features

    Authors: Adrien Bardes, Jean Ponce, Yann LeCun

    Abstract: Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted at NeurIPS 2022

  14. arXiv:2206.13378  [pdf, other

    cs.LG

    Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning

    Authors: Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, Pascal Vincent

    Abstract: One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentag… ▽ More

    Submitted 9 June, 2023; v1 submitted 27 June, 2022; originally announced June 2022.

    Comments: Accepted at TMLR 2023

  15. arXiv:2206.08954  [pdf, other

    cs.CV cs.LG

    Bag of Image Patch Embedding Behind the Success of Self-Supervised Learning

    Authors: Yubei Chen, Adrien Bardes, Zengyi Li, Yann LeCun

    Abstract: Self-supervised learning (SSL) has recently achieved tremendous empirical advancements in learning image representation. However, our understanding of the principle behind learning such a representation is still limited. This work shows that joint-embedding SSL approaches primarily learn a representation of image patches, which reflects their co-occurrence. Such a connection to co-occurrence model… ▽ More

    Submitted 12 June, 2023; v1 submitted 17 June, 2022; originally announced June 2022.

  16. On the duality between contrastive and non-contrastive self-supervised learning

    Authors: Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, Yann Lecun

    Abstract: Recent approaches in self-supervised learning of image representations can be categorized into different families of methods and, in particular, can be divided into contrastive and non-contrastive approaches. While differences between the two families have been thoroughly discussed to motivate new approaches, we focus more on the theoretical similarities between them. By designing contrastive and… ▽ More

    Submitted 26 June, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: The Eleventh International Conference on Learning Representations, 2023, Kigali, Rwanda

  17. arXiv:2105.04906  [pdf, other

    cs.CV cs.AI cs.LG

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    Authors: Adrien Bardes, Jean Ponce, Yann LeCun

    Abstract: Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this… ▽ More

    Submitted 28 January, 2022; v1 submitted 11 May, 2021; originally announced May 2021.

    Comments: Accepted at ICLR 2022