JEPAwiki
DINO: Emerging Properties in Self-Supervised Vision Transformers
Date2021-04-29
Modalityimage
AuthorsMathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal
Tagsself-supervised-learning, vision-transformer, self-distillation, related-work
SourceAbstract only

DINO

A self-supervised method based on self-distillation with no labels. DINO is a key predecessor and comparison point for the JEPA family — it demonstrated that self-supervised ViTs produce features with emergent semantic segmentation properties.

Core idea

Train a student network to match the output of a teacher network (momentum-updated EMA copy) across different augmented views of the same image. Uses centering and sharpening to avoid collapse instead of negative samples.

Key differences from JEPA

DINO JEPA
Paradigm Joint-embedding (contrastive-like) Joint-embedding predictive
Views Augmented crops of same image Masked regions of same image
Prediction Match global features across views Predict missing region features
Augmentations Required (multi-crop, color jitter) Not required (masking only)
Collapse prevention Centering + sharpening EMA / SIGReg

Key results

  • 80.1% ImageNet linear evaluation (ViT-Base)
  • 78.3% ImageNet k-NN classification (small ViT)
  • Self-supervised ViT features contain explicit semantic segmentation information
  • Momentum encoder critical for training stability

Relationship to JEPA

DINO proved that self-supervised ViTs learn remarkable features, but relies on hand-crafted augmentations. I-JEPA was explicitly designed to achieve similar quality without augmentations, using masking-based prediction instead. DINO's EMA teacher mechanism was adopted by the JEPA family. DINOv2 features are used as the frozen backbone in C-JEPA's VideoSAUR encoder.

Links

See also