JEPAwiki
DINOv3
Date2025-08-13
Modalityimage
AuthorsOriane Simeoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre + 10 more
Tagsself-supervised-learning, foundation-model, vision-transformer, dense-features, related-work
SourceFull text

DINOv3

The third generation of the DINO self-supervised vision foundation model from Meta AI. DINOv3 represents the current state of the art in self-supervised visual representation learning, directly competing with and complementing the JEPA family.

Core idea

DINOv3 scales self-supervised vision transformers through three strategies:

  1. Scale data and model size: careful data preparation and optimization to train larger models on more diverse data (natural images, aerial/satellite imagery, medical images)
  2. Gram anchoring: a new regularization that solves a longstanding problem — dense feature quality degrades during long training schedules in large models (ViT-L and above). Gram anchoring prevents this degradation.
  3. Post-training enhancements: strategies for flexible resolution handling, model distillation across sizes, and optional text alignment

The dense feature degradation problem

A key finding: when training DINOv2 models beyond ViT-Large (~300M params) for extended schedules, dense feature quality gradually degrades — the very property that makes DINO features valuable erodes with more training. DINOv3's Gram anchoring directly addresses this, maintaining dense feature quality throughout training. This is relevant because V-JEPA 2.1 also focused on dense features via its Dense Predictive Loss — both approaches recognize dense feature quality as a critical frontier.

Key results

DINOv3 significantly outperforms DINOv2 and weakly-supervised models:

  • Outperforms specialized SOTA across a broad range of vision tasks without fine-tuning
  • High-quality dense features: outstanding performance on segmentation, depth estimation, object detection
  • Domain generality: works on natural images, aerial/satellite imagery, medical imaging, histopathology
  • Surpasses weakly-supervised foundation models (CLIP, SigLIP, Perception Encoder) on dense tasks
  • ImageNet linear probing reaches the accuracy plateau of recent years — SSL has caught up with supervised and weakly-supervised methods

DINOv3 vs JEPA: complementary or competitive?

DINOv3 and the JEPA family represent two parallel paths within Meta AI's self-supervised learning research:

DINOv3 JEPA family
Paradigm Self-distillation (invariance) Predictive (latent masking)
Modalities Images only Images, video, audio, 3D, language
Dense features Excellent (Gram anchoring) Excellent (Dense Predictive Loss in V-JEPA 2.1)
World models No Yes (V-JEPA 2-AC, LeWorldModel)
Planning No Yes (CEM in latent space)
Video understanding Not native Native (V-JEPA, V-JEPA 2)

Where DINOv3 is used inside JEPA

DINOv3/v2 features are not just competitors — they're infrastructure for JEPA:

  • C-JEPA uses frozen DINOv2 features for its VideoSAUR object encoder
  • V-JEPA 2.1 trains on LVD-142M, DINOv2's curated image dataset
  • REPA uses DINOv2 as the external alignment target for diffusion models

DINOv3's improved dense features could directly benefit these JEPA pipelines.

What JEPA does that DINO cannot

DINO (all versions) learns invariance — what stays the same across augmented views. JEPA learns prediction — what must be true about unseen parts. Only prediction supports:

  • Temporal dynamics (what happens next?)
  • Action conditioning (what happens if I act?)
  • Planning (which future is best?)

DINOv3 is the best static feature extractor. JEPA is the path to world models. They serve different purposes and are likely to coexist.

Links

See also