JEPAwiki
I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Date2023-01-19
Modalityimage
AuthorsMahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski + 4 more
Tagsfoundational, masking, vision-transformer, self-supervised-learning
SourceAbstract only

I-JEPA

The first major concrete instantiation of JEPA for images. I-JEPA demonstrated that JEPA could learn highly semantic image representations without hand-crafted data augmentations — a key differentiator from contrastive methods like SimCLR or DINO.

I-JEPA Architecture

Core idea

From a single context block, predict the representations of various target blocks in the same image. The prediction happens entirely in latent space — no pixel reconstruction.

Key design choices

  • Masking strategy is critical: target blocks must be sampled at sufficiently large scale (semantic level), and the context block must be spatially distributed enough to be informative.
  • Non-generative: unlike MAE, I-JEPA does not reconstruct pixels. This forces the model to learn higher-level, more semantic features.
  • Vision Transformer backbone: scales well — ViT-Huge/14 trained on ImageNet with 16 A100 GPUs in under 72 hours.

Architecture details

  • Context encoder: ViT-Huge/14 (632M parameters), processes only visible patches
  • Target encoder: same architecture, updated via exponential moving average (EMA) of context encoder weights
  • Predictor: narrow ViT, takes context embeddings + target positional embeddings, outputs predicted target representations
  • Target sampling: large contiguous blocks (scale range ~0.15-0.7 of image), aspect ratio 0.75-1.5
  • Loss: L2 distance between predicted and target embeddings — no decoder, no pixel loss

Results

Task Metric I-JEPA (ViT-H/14) MAE (ViT-H/16) DINO (ViT-B/16)
ImageNet linear top-1 acc 77.3% 76.6% 78.2%
ImageNet 1% labels top-1 acc 67.0% 54.1%
Object counting MAE 60.0 64.7

I-JEPA is particularly strong in low-label regimes — with only 1% of ImageNet labels, it outperforms MAE by +13 points, demonstrating that latent prediction learns more transferable features than pixel reconstruction. Training takes under 72 hours on 16 A100 GPUs.

Significance in the JEPA timeline

This is the proof-of-concept that turned JEPA from a theoretical framework (LeCun's 2022 position paper) into a practical, scalable recipe. It established the three-component architecture (context encoder, target encoder with EMA, predictor) that every subsequent JEPA variant adopts. The key empirical discovery — that masking strategy matters more than model size for learning semantic features — shaped the entire field's approach.

Links

See also