JEPAwiki
V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video
Date2024-04-12
Modalityvideo
AuthorsAdrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen + 4 more
Tagsvideo, self-supervised-learning, feature-prediction, foundational
SourceFull text

V-JEPA (Original)

The first video JEPA — a critical link between I-JEPA (images) and V-JEPA 2 (world models). V-JEPA demonstrated that feature prediction is an effective stand-alone objective for unsupervised learning from video.

Core idea

Train vision models solely using feature prediction — no pretrained image encoders, no text supervision, no negative examples, no pixel reconstruction. Combine masked modeling with joint-embedding predictive architecture on 2 million videos.

Key question answered

How effective is feature prediction as a stand-alone objective for unsupervised learning from video with modern tools?

Answer: very effective. Feature prediction leads to versatile representations that work well on both motion-based and appearance-based tasks, using a frozen backbone.

Key results

  • 81.9% on Kinetics-400 (ViT-H/16, frozen backbone)
  • 72.2% on Something-Something-v2 (motion understanding, +6% over competing methods)
  • 77.9% on ImageNet-1K
  • Superior to pixel-prediction approaches under frozen evaluation
  • Significantly shorter training schedules than pixel prediction methods
  • More label-efficient than pixel reconstruction approaches

Key findings

  1. Feature prediction produces versatile representations for both motion and appearance tasks without parameter adaptation
  2. Feature prediction is superior to pixel prediction under frozen evaluation, competitive under fine-tuning
  3. Feature prediction models are more label-efficient — the gap widens as labeled examples decrease

Significance in the JEPA timeline

The leap from images to video. V-JEPA showed that JEPA's latent prediction principle scales naturally to temporal data, learning strong motion representations without ever reconstructing pixels. This directly enabled V-JEPA 2's scaling to 1M+ hours and world modeling.

Links

See also