JEPAwiki
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Date2025-06-11
Modalityvideo
AuthorsMahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido + 25 more
Tagsvideo, world-model, robotics, planning, action-conditioned, large-scale
SourceFull text

V-JEPA 2

A major milestone: the point where JEPA becomes an explicit world model for understanding, prediction, and planning. Demonstrates zero-shot robotic planning in unseen environments.

V-JEPA 2 Architecture

Core idea

A two-stage approach:

  1. V-JEPA 2 (pretraining): action-free joint-embedding predictive architecture pretrained on 1M+ hours of internet video and images
  2. V-JEPA 2-AC (post-training): action-conditioned world model post-trained on <62 hours of unlabeled robot video (Droid dataset)

Key results

  • Motion understanding: 77.3% top-1 on Something-Something v2
  • Action anticipation: 39.7 recall@5 on Epic-Kitchens-100 (SOTA, surpasses task-specific models)
  • Video QA: SOTA at 8B scale (84.0 PerceptionTest, 76.9 TempCompass) after LLM alignment
  • Zero-shot robot planning: pick-and-place on Franka arms in two different labs, no environment-specific data or task-specific training

Architecture

V-JEPA 2 encoder (pretraining):

  • ViT family: ViT-L (300M), ViT-H (600M), ViT-g (1B parameters)
  • Video tokenization: 3D tubelets (2x16x16), 3D RoPE positional embeddings
  • Trained on VideoMix22M: 22M samples spanning SSv2, Kinetics, HowTo100M, YT-1B, ImageNet
  • Loss: L1 between predictor output and EMA target encoder output on masked patches
  • Progressive resolution training: 16 frames at 256x256, then cooldown at 64 frames 384x384 (8.4x speedup)
  • Total: 252K iterations

V-JEPA 2-AC (post-training):

  • 300M-parameter transformer predictor with block-causal attention — each patch at time t attends to patches, actions, and states from t and earlier
  • Inputs: encoded video frames (from frozen V-JEPA 2 encoder) + 7D end-effector state (3D position + 3D orientation + 1D gripper) + 7D action vectors (state deltas)
  • Loss: teacher-forcing L1 + rollout loss (single autoregressive step). The rollout loss is critical for stable multi-step prediction.
  • Training data: only 62 hours of Droid robot video — teleoperated Franka Panda demonstrations, no task labels, no rewards, no success indicators
  • Planning: CEM with 800 samples, 10 refinement iterations, L1-ball action constraint (13cm max displacement), receding horizon

Robot results (10 trials per task per lab)

Task Lab 1 Lab 2 Average
Reaching 100% 100% 100%
Grasp (cup) 70% 60% 65%
Grasp (box) 30% 20% 25%
Reach with object (cup) 90% 60% 75%
Pick-and-place (cup) 80% 80% 80%
Pick-and-place (box) 80% 50% 65%

All results are zero-shot: no data from these specific robots, environments, or tasks. Planning takes ~16s per action (vs 4 minutes for Cosmos video generation–based planning, a 15x speedup).

Significance in the JEPA timeline

This is THE world-model milestone. It shows that self-supervised learning from web-scale data (1M+ hours of internet video) plus a small amount of unlabeled robot interaction data (62 hours) yields a world model capable of zero-shot planning in the physical world. The result is notable because: (1) the encoder sees only internet video during pretraining — never robot data, (2) the action-conditioned predictor trains on just 62 hours of unlabeled manipulation, (3) the system deploys zero-shot on Franka arms in two separate labs with no environment-specific adaptation.

Links

See also