JEPAwiki
Image World Models (IWM): Learning and Leveraging World Models in Visual Representation Learning
Date2024-03-01
Modalityimage
AuthorsQuentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes + 2 more
Tagsworld-model, photometric-transformations, representation-learning, predictor-reuse
SourceFull text

Image World Models (IWM)

Extends JEPA beyond masked image modeling to predict the effect of global photometric transformations in latent space. Key insight: the JEPA predictor (world model) should not be discarded after pretraining — it can be fine-tuned for downstream tasks.

I-JEPA Architecture

Core idea

Standard I-JEPA predicts missing (masked) patches. IWM generalizes the prediction task: given an image and a description of a transformation (e.g., color jitter, blur, crop), predict what the transformed image's representation looks like in latent space. The transformation parameters serve as "actions" — making this a world model for image transformations.

Three key aspects of learning performant IWMs

  1. Conditioning: the predictor must be conditioned on the transformation parameters (the "action")
  2. Prediction difficulty: the transformations must be challenging enough to learn from — too easy and the model doesn't learn useful structure
  3. Capacity: the predictor's capacity controls the abstraction level of learned representations

The predictor should not be discarded

The standard practice in SSL is to discard the predictor after pretraining and only use the encoder. IWM shows this is wasteful:

  • Fine-tuning the world model on top of a frozen encoder outperforms encoder-only fine-tuning
  • Achieved at a fraction of the cost and number of fine-tuned parameters
  • Only the learned world model shows this behavior — a randomly initialized network of the same architecture does not
  • The world model can be fine-tuned to solve multiple tasks simultaneously (inspired by instruction tuning)

Invariance vs. equivariance trade-off

Predictor capacity controls abstraction:

  • Identity predictor (no predictor): learns invariant representations — encodes only what's shared between input and its transformation. Best for linear evaluation.
  • High-capacity predictor: learns equivariant representations — the encoder retains more information, and the predictor models the transformation. Best for world model fine-tuning.

This trade-off is unique to the JEPA framework and gives practitioners control over representation properties.

Significance in the JEPA timeline

IWM bridges the gap between SSL (where world models are discarded) and RL (where world models are leveraged for planning). It showed that JEPA's predictor is a genuine world model worth keeping, foreshadowing V-JEPA 2's use of the world model for robot planning.

Links

See also