V-JEPA 2

A major milestone: the point where JEPA becomes an explicit world model for understanding, prediction, and planning. Demonstrates zero-shot robotic planning in unseen environments.

V-JEPA 2 Architecture

Core idea

A two-stage approach:

V-JEPA 2 (pretraining): action-free joint-embedding predictive architecture pretrained on 1M+ hours of internet video and images
V-JEPA 2-AC (post-training): action-conditioned world model post-trained on <62 hours of unlabeled robot video (Droid dataset)

Key results

Motion understanding: 77.3% top-1 on Something-Something v2
Action anticipation: 39.7 recall@5 on Epic-Kitchens-100 (SOTA, surpasses task-specific models)
Video QA: SOTA at 8B scale (84.0 PerceptionTest, 76.9 TempCompass) after LLM alignment
Zero-shot robot planning: pick-and-place on Franka arms in two different labs, no environment-specific data or task-specific training

Architecture

V-JEPA 2 encoder (pretraining):

ViT family: ViT-L (300M), ViT-H (600M), ViT-g (1B parameters)
Video tokenization: 3D tubelets (2x16x16), 3D RoPE positional embeddings
Trained on VideoMix22M: 22M samples spanning SSv2, Kinetics, HowTo100M, YT-1B, ImageNet
Loss: L1 between predictor output and EMA target encoder output on masked patches
Progressive resolution training: 16 frames at 256x256, then cooldown at 64 frames 384x384 (8.4x speedup)
Total: 252K iterations

V-JEPA 2-AC (post-training):

300M-parameter transformer predictor with block-causal attention — each patch at time t attends to patches, actions, and states from t and earlier
Inputs: encoded video frames (from frozen V-JEPA 2 encoder) + 7D end-effector state (3D position + 3D orientation + 1D gripper) + 7D action vectors (state deltas)
Loss: teacher-forcing L1 + rollout loss (single autoregressive step). The rollout loss is critical for stable multi-step prediction.
Training data: only 62 hours of Droid robot video — teleoperated Franka Panda demonstrations, no task labels, no rewards, no success indicators
Planning: CEM with 800 samples, 10 refinement iterations, L1-ball action constraint (13cm max displacement), receding horizon

Robot results (10 trials per task per lab)

Task	Lab 1	Lab 2	Average
Reaching	100%	100%	100%
Grasp (cup)	70%	60%	65%
Grasp (box)	30%	20%	25%
Reach with object (cup)	90%	60%	75%
Pick-and-place (cup)	80%	80%	80%
Pick-and-place (box)	80%	50%	65%

All results are zero-shot: no data from these specific robots, environments, or tasks. Planning takes ~16s per action (vs 4 minutes for Cosmos video generation–based planning, a 15x speedup).

Significance in the JEPA timeline

This is THE world-model milestone. It shows that self-supervised learning from web-scale data (1M+ hours of internet video) plus a small amount of unlabeled robot interaction data (62 hours) yields a world model capable of zero-shot planning in the physical world. The result is notable because: (1) the encoder sees only internet video during pretraining — never robot data, (2) the action-conditioned predictor trains on just 62 hours of unlabeled manipulation, (3) the system deploys zero-shot on Franka arms in two separate labs with no environment-specific adaptation.

V-JEPA 2

Core idea

Key results

Architecture

Robot results (10 trials per task per lab)

Significance in the JEPA timeline

Links

See also