DINO
A self-supervised method based on self-distillation with no labels. DINO is a key predecessor and comparison point for the JEPA family — it demonstrated that self-supervised ViTs produce features with emergent semantic segmentation properties.
Core idea
Train a student network to match the output of a teacher network (momentum-updated EMA copy) across different augmented views of the same image. Uses centering and sharpening to avoid collapse instead of negative samples.
Key differences from JEPA
| DINO | JEPA | |
|---|---|---|
| Paradigm | Joint-embedding (contrastive-like) | Joint-embedding predictive |
| Views | Augmented crops of same image | Masked regions of same image |
| Prediction | Match global features across views | Predict missing region features |
| Augmentations | Required (multi-crop, color jitter) | Not required (masking only) |
| Collapse prevention | Centering + sharpening | EMA / SIGReg |
Key results
- 80.1% ImageNet linear evaluation (ViT-Base)
- 78.3% ImageNet k-NN classification (small ViT)
- Self-supervised ViT features contain explicit semantic segmentation information
- Momentum encoder critical for training stability
Relationship to JEPA
DINO proved that self-supervised ViTs learn remarkable features, but relies on hand-crafted augmentations. I-JEPA was explicitly designed to achieve similar quality without augmentations, using masking-based prediction instead. DINO's EMA teacher mechanism was adopted by the JEPA family. DINOv2 features are used as the frozen backbone in C-JEPA's VideoSAUR encoder.
Links
See also
- 2304.07193 (DINOv2) — the scaled-up successor
- 2301.08243 (I-JEPA) — the augmentation-free alternative
- collapse-prevention — DINO's centering vs JEPA's EMA