DINOv2

The scaled-up successor to DINO. DINOv2 is the primary comparison baseline for JEPA models on image tasks and is directly used as a frozen backbone in several JEPA variants.

I-JEPA Architecture

Core idea

Combines self-distillation (DINO) with masked image modeling (iBOT) at large scale. Trains on a curated dataset of 142M images (LVD-142M) with ViT models up to ViT-g (1B parameters). Produces general-purpose visual features that work across tasks without fine-tuning.

Key results

86.1% ImageNet linear evaluation (ViT-g) — slightly above JEPA models on this benchmark
Strong on dense tasks: depth estimation, segmentation, instance retrieval
Features transfer across domains without adaptation

Relationship to JEPA family

DINOv2 and JEPA are complementary approaches to the same goal (learning general visual features):

DINOv2 advantages:

Slightly better on ImageNet classification (86.1% vs 85.5% for V-JEPA 2.1)
Mature, widely-adopted, strong ecosystem

JEPA advantages:

No augmentation dependence
Naturally extends to video, audio, 3D, language
Enables world models and planning (DINOv2 cannot)
V-JEPA 2.1 surpasses DINOv2 on dense video tasks

DINOv2 as infrastructure for JEPA:

C-JEPA uses frozen DINOv2 ViT-S/14 features for its VideoSAUR object encoder
V-JEPA 2.1 uses LVD-142M (DINOv2's dataset) for image training
DINO-WM (baseline in C-JEPA paper) builds a world model on DINOv2 features

DINOv2

Core idea

Key results

Relationship to JEPA family

Links

See also