V-JEPA (Original)
The first video JEPA — a critical link between I-JEPA (images) and V-JEPA 2 (world models). V-JEPA demonstrated that feature prediction is an effective stand-alone objective for unsupervised learning from video.
Core idea
Train vision models solely using feature prediction — no pretrained image encoders, no text supervision, no negative examples, no pixel reconstruction. Combine masked modeling with joint-embedding predictive architecture on 2 million videos.
Key question answered
How effective is feature prediction as a stand-alone objective for unsupervised learning from video with modern tools?
Answer: very effective. Feature prediction leads to versatile representations that work well on both motion-based and appearance-based tasks, using a frozen backbone.
Key results
- 81.9% on Kinetics-400 (ViT-H/16, frozen backbone)
- 72.2% on Something-Something-v2 (motion understanding, +6% over competing methods)
- 77.9% on ImageNet-1K
- Superior to pixel-prediction approaches under frozen evaluation
- Significantly shorter training schedules than pixel prediction methods
- More label-efficient than pixel reconstruction approaches
Key findings
- Feature prediction produces versatile representations for both motion and appearance tasks without parameter adaptation
- Feature prediction is superior to pixel prediction under frozen evaluation, competitive under fine-tuning
- Feature prediction models are more label-efficient — the gap widens as labeled examples decrease
Significance in the JEPA timeline
The leap from images to video. V-JEPA showed that JEPA's latent prediction principle scales naturally to temporal data, learning strong motion representations without ever reconstructing pixels. This directly enabled V-JEPA 2's scaling to 1M+ hours and world modeling.
Links
See also
- 2301.08243 (I-JEPA) — the image foundation
- 2506.09985 (V-JEPA 2) — the scaled-up successor with world modeling
- 2603.14482 (V-JEPA 2.1) — the dense feature upgrade
- latent-prediction — the core principle