DINOv2: Learning Robust Visual Features without Supervision
arXiv2304.07193
Date2023-04-14
Modalityimage
AuthorsMaxime Oquab, Timothee Darcet, Theo Moutakanni, Huy Vo, Marc Szafraniec
Tagsself-supervised-learning, foundation-model, vision-transformer, related-work
SourceFull text
DINOv2
The scaled-up successor to DINO. DINOv2 is the primary comparison baseline for JEPA models on image tasks and is directly used as a frozen backbone in several JEPA variants.
Core idea
Combines self-distillation (DINO) with masked image modeling (iBOT) at large scale. Trains on a curated dataset of 142M images (LVD-142M) with ViT models up to ViT-g (1B parameters). Produces general-purpose visual features that work across tasks without fine-tuning.
Key results
- 86.1% ImageNet linear evaluation (ViT-g) — slightly above JEPA models on this benchmark
- Strong on dense tasks: depth estimation, segmentation, instance retrieval
- Features transfer across domains without adaptation
Relationship to JEPA family
DINOv2 and JEPA are complementary approaches to the same goal (learning general visual features):
DINOv2 advantages:
- Slightly better on ImageNet classification (86.1% vs 85.5% for V-JEPA 2.1)
- Mature, widely-adopted, strong ecosystem
JEPA advantages:
- No augmentation dependence
- Naturally extends to video, audio, 3D, language
- Enables world models and planning (DINOv2 cannot)
- V-JEPA 2.1 surpasses DINOv2 on dense video tasks
DINOv2 as infrastructure for JEPA:
- C-JEPA uses frozen DINOv2 ViT-S/14 features for its VideoSAUR object encoder
- V-JEPA 2.1 uses LVD-142M (DINOv2's dataset) for image training
- DINO-WM (baseline in C-JEPA paper) builds a world model on DINOv2 features
Links
See also
- 2104.14294 (DINO) — the predecessor
- 2301.08243 (I-JEPA) — augmentation-free alternative
- 2602.11389 (C-JEPA) — uses DINOv2 as frozen backbone
- latent-prediction — JEPA vs contrastive paradigm comparison