V-JEPA (Original)

The first video JEPA — a critical link between I-JEPA (images) and V-JEPA 2 (world models). V-JEPA demonstrated that feature prediction is an effective stand-alone objective for unsupervised learning from video.

Core idea

Train vision models solely using feature prediction — no pretrained image encoders, no text supervision, no negative examples, no pixel reconstruction. Combine masked modeling with joint-embedding predictive architecture on 2 million videos.

Key question answered

How effective is feature prediction as a stand-alone objective for unsupervised learning from video with modern tools?

Answer: very effective. Feature prediction leads to versatile representations that work well on both motion-based and appearance-based tasks, using a frozen backbone.

Key results

81.9% on Kinetics-400 (ViT-H/16, frozen backbone)
72.2% on Something-Something-v2 (motion understanding, +6% over competing methods)
77.9% on ImageNet-1K
Superior to pixel-prediction approaches under frozen evaluation
Significantly shorter training schedules than pixel prediction methods
More label-efficient than pixel reconstruction approaches

Key findings

Feature prediction produces versatile representations for both motion and appearance tasks without parameter adaptation
Feature prediction is superior to pixel prediction under frozen evaluation, competitive under fine-tuning
Feature prediction models are more label-efficient — the gap widens as labeled examples decrease

Significance in the JEPA timeline

The leap from images to video. V-JEPA showed that JEPA's latent prediction principle scales naturally to temporal data, learning strong motion representations without ever reconstructing pixels. This directly enabled V-JEPA 2's scaling to 1M+ hours and world modeling.

V-JEPA (Original)

Core idea

Key question answered

Key results

Key findings

Significance in the JEPA timeline

Links

See also