V-JEPA 2
A major milestone: the point where JEPA becomes an explicit world model for understanding, prediction, and planning. Demonstrates zero-shot robotic planning in unseen environments.
Core idea
A two-stage approach:
- V-JEPA 2 (pretraining): action-free joint-embedding predictive architecture pretrained on 1M+ hours of internet video and images
- V-JEPA 2-AC (post-training): action-conditioned world model post-trained on <62 hours of unlabeled robot video (Droid dataset)
Key results
- Motion understanding: 77.3% top-1 on Something-Something v2
- Action anticipation: 39.7 recall@5 on Epic-Kitchens-100 (SOTA, surpasses task-specific models)
- Video QA: SOTA at 8B scale (84.0 PerceptionTest, 76.9 TempCompass) after LLM alignment
- Zero-shot robot planning: pick-and-place on Franka arms in two different labs, no environment-specific data or task-specific training
Architecture
V-JEPA 2 encoder (pretraining):
- ViT family: ViT-L (300M), ViT-H (600M), ViT-g (1B parameters)
- Video tokenization: 3D tubelets (2x16x16), 3D RoPE positional embeddings
- Trained on VideoMix22M: 22M samples spanning SSv2, Kinetics, HowTo100M, YT-1B, ImageNet
- Loss: L1 between predictor output and EMA target encoder output on masked patches
- Progressive resolution training: 16 frames at 256x256, then cooldown at 64 frames 384x384 (8.4x speedup)
- Total: 252K iterations
V-JEPA 2-AC (post-training):
- 300M-parameter transformer predictor with block-causal attention — each patch at time t attends to patches, actions, and states from t and earlier
- Inputs: encoded video frames (from frozen V-JEPA 2 encoder) + 7D end-effector state (3D position + 3D orientation + 1D gripper) + 7D action vectors (state deltas)
- Loss: teacher-forcing L1 + rollout loss (single autoregressive step). The rollout loss is critical for stable multi-step prediction.
- Training data: only 62 hours of Droid robot video — teleoperated Franka Panda demonstrations, no task labels, no rewards, no success indicators
- Planning: CEM with 800 samples, 10 refinement iterations, L1-ball action constraint (13cm max displacement), receding horizon
Robot results (10 trials per task per lab)
| Task | Lab 1 | Lab 2 | Average |
|---|---|---|---|
| Reaching | 100% | 100% | 100% |
| Grasp (cup) | 70% | 60% | 65% |
| Grasp (box) | 30% | 20% | 25% |
| Reach with object (cup) | 90% | 60% | 75% |
| Pick-and-place (cup) | 80% | 80% | 80% |
| Pick-and-place (box) | 80% | 50% | 65% |
All results are zero-shot: no data from these specific robots, environments, or tasks. Planning takes ~16s per action (vs 4 minutes for Cosmos video generation–based planning, a 15x speedup).
Significance in the JEPA timeline
This is THE world-model milestone. It shows that self-supervised learning from web-scale data (1M+ hours of internet video) plus a small amount of unlabeled robot interaction data (62 hours) yields a world model capable of zero-shot planning in the physical world. The result is notable because: (1) the encoder sees only internet video during pretraining — never robot data, (2) the action-conditioned predictor trains on just 62 hours of unlabeled manipulation, (3) the system deploys zero-shot on Franka arms in two separate labs with no environment-specific adaptation.
Links
See also
- 2301.08243 (I-JEPA) — the image foundation
- 2307.12698 (MC-JEPA) — early motion+content work
- 2603.14482 (V-JEPA 2.1) — the dense feature upgrade
- 2501.14622 (ACT-JEPA) — earlier action-conditioned work
- collapse-prevention — how V-JEPA 2 avoids representation collapse