I-JEPA
The first major concrete instantiation of JEPA for images. I-JEPA demonstrated that JEPA could learn highly semantic image representations without hand-crafted data augmentations — a key differentiator from contrastive methods like SimCLR or DINO.
Core idea
From a single context block, predict the representations of various target blocks in the same image. The prediction happens entirely in latent space — no pixel reconstruction.
Key design choices
- Masking strategy is critical: target blocks must be sampled at sufficiently large scale (semantic level), and the context block must be spatially distributed enough to be informative.
- Non-generative: unlike MAE, I-JEPA does not reconstruct pixels. This forces the model to learn higher-level, more semantic features.
- Vision Transformer backbone: scales well — ViT-Huge/14 trained on ImageNet with 16 A100 GPUs in under 72 hours.
Architecture details
- Context encoder: ViT-Huge/14 (632M parameters), processes only visible patches
- Target encoder: same architecture, updated via exponential moving average (EMA) of context encoder weights
- Predictor: narrow ViT, takes context embeddings + target positional embeddings, outputs predicted target representations
- Target sampling: large contiguous blocks (scale range ~0.15-0.7 of image), aspect ratio 0.75-1.5
- Loss: L2 distance between predicted and target embeddings — no decoder, no pixel loss
Results
| Task | Metric | I-JEPA (ViT-H/14) | MAE (ViT-H/16) | DINO (ViT-B/16) |
|---|---|---|---|---|
| ImageNet linear | top-1 acc | 77.3% | 76.6% | 78.2% |
| ImageNet 1% labels | top-1 acc | 67.0% | 54.1% | — |
| Object counting | MAE | 60.0 | 64.7 | — |
I-JEPA is particularly strong in low-label regimes — with only 1% of ImageNet labels, it outperforms MAE by +13 points, demonstrating that latent prediction learns more transferable features than pixel reconstruction. Training takes under 72 hours on 16 A100 GPUs.
Significance in the JEPA timeline
This is the proof-of-concept that turned JEPA from a theoretical framework (LeCun's 2022 position paper) into a practical, scalable recipe. It established the three-component architecture (context encoder, target encoder with EMA, predictor) that every subsequent JEPA variant adopts. The key empirical discovery — that masking strategy matters more than model size for learning semantic features — shaped the entire field's approach.
Links
See also
- 2307.12698 (MC-JEPA) — extends to motion+content
- 2506.09985 (V-JEPA 2) — extends to video
- masking-strategies — core concept page