JEPAwiki
DINOv2: Learning Robust Visual Features without Supervision
Date2023-04-14
Modalityimage
AuthorsMaxime Oquab, Timothee Darcet, Theo Moutakanni, Huy Vo, Marc Szafraniec
Tagsself-supervised-learning, foundation-model, vision-transformer, related-work
SourceFull text

DINOv2

The scaled-up successor to DINO. DINOv2 is the primary comparison baseline for JEPA models on image tasks and is directly used as a frozen backbone in several JEPA variants.

I-JEPA Architecture

Core idea

Combines self-distillation (DINO) with masked image modeling (iBOT) at large scale. Trains on a curated dataset of 142M images (LVD-142M) with ViT models up to ViT-g (1B parameters). Produces general-purpose visual features that work across tasks without fine-tuning.

Key results

  • 86.1% ImageNet linear evaluation (ViT-g) — slightly above JEPA models on this benchmark
  • Strong on dense tasks: depth estimation, segmentation, instance retrieval
  • Features transfer across domains without adaptation

Relationship to JEPA family

DINOv2 and JEPA are complementary approaches to the same goal (learning general visual features):

DINOv2 advantages:

  • Slightly better on ImageNet classification (86.1% vs 85.5% for V-JEPA 2.1)
  • Mature, widely-adopted, strong ecosystem

JEPA advantages:

  • No augmentation dependence
  • Naturally extends to video, audio, 3D, language
  • Enables world models and planning (DINOv2 cannot)
  • V-JEPA 2.1 surpasses DINOv2 on dense video tasks

DINOv2 as infrastructure for JEPA:

  • C-JEPA uses frozen DINOv2 ViT-S/14 features for its VideoSAUR object encoder
  • V-JEPA 2.1 uses LVD-142M (DINOv2's dataset) for image training
  • DINO-WM (baseline in C-JEPA paper) builds a world model on DINOv2 features

Links

See also