DINO

A self-supervised method based on self-distillation with no labels. DINO is a key predecessor and comparison point for the JEPA family — it demonstrated that self-supervised ViTs produce features with emergent semantic segmentation properties.

Core idea

Train a student network to match the output of a teacher network (momentum-updated EMA copy) across different augmented views of the same image. Uses centering and sharpening to avoid collapse instead of negative samples.

Key differences from JEPA

	DINO	JEPA
Paradigm	Joint-embedding (contrastive-like)	Joint-embedding predictive
Views	Augmented crops of same image	Masked regions of same image
Prediction	Match global features across views	Predict missing region features
Augmentations	Required (multi-crop, color jitter)	Not required (masking only)
Collapse prevention	Centering + sharpening	EMA / SIGReg

Key results

80.1% ImageNet linear evaluation (ViT-Base)
78.3% ImageNet k-NN classification (small ViT)
Self-supervised ViT features contain explicit semantic segmentation information
Momentum encoder critical for training stability

Relationship to JEPA

DINO proved that self-supervised ViTs learn remarkable features, but relies on hand-crafted augmentations. I-JEPA was explicitly designed to achieve similar quality without augmentations, using masking-based prediction instead. DINO's EMA teacher mechanism was adopted by the JEPA family. DINOv2 features are used as the frozen backbone in C-JEPA's VideoSAUR encoder.

DINO

Core idea

Key differences from JEPA

Key results

Relationship to JEPA

Links

See also