I-JEPA

The first major concrete instantiation of JEPA for images. I-JEPA demonstrated that JEPA could learn highly semantic image representations without hand-crafted data augmentations — a key differentiator from contrastive methods like SimCLR or DINO.

I-JEPA Architecture

Core idea

From a single context block, predict the representations of various target blocks in the same image. The prediction happens entirely in latent space — no pixel reconstruction.

Key design choices

Masking strategy is critical: target blocks must be sampled at sufficiently large scale (semantic level), and the context block must be spatially distributed enough to be informative.
Non-generative: unlike MAE, I-JEPA does not reconstruct pixels. This forces the model to learn higher-level, more semantic features.
Vision Transformer backbone: scales well — ViT-Huge/14 trained on ImageNet with 16 A100 GPUs in under 72 hours.

Architecture details

Context encoder: ViT-Huge/14 (632M parameters), processes only visible patches
Target encoder: same architecture, updated via exponential moving average (EMA) of context encoder weights
Predictor: narrow ViT, takes context embeddings + target positional embeddings, outputs predicted target representations
Target sampling: large contiguous blocks (scale range ~0.15-0.7 of image), aspect ratio 0.75-1.5
Loss: L2 distance between predicted and target embeddings — no decoder, no pixel loss

Results

Task	Metric	I-JEPA (ViT-H/14)	MAE (ViT-H/16)	DINO (ViT-B/16)
ImageNet linear	top-1 acc	77.3%	76.6%	78.2%
ImageNet 1% labels	top-1 acc	67.0%	54.1%	—
Object counting	MAE	60.0	64.7	—

I-JEPA is particularly strong in low-label regimes — with only 1% of ImageNet labels, it outperforms MAE by +13 points, demonstrating that latent prediction learns more transferable features than pixel reconstruction. Training takes under 72 hours on 16 A100 GPUs.

Significance in the JEPA timeline

This is the proof-of-concept that turned JEPA from a theoretical framework (LeCun's 2022 position paper) into a practical, scalable recipe. It established the three-component architecture (context encoder, target encoder with EMA, predictor) that every subsequent JEPA variant adopts. The key empirical discovery — that masking strategy matters more than model size for learning semantic features — shaped the entire field's approach.

I-JEPA

Core idea

Key design choices

Architecture details

Results

Significance in the JEPA timeline

Links

See also