World Models and Planning
A central ambition of the JEPA family is to learn world models — internal representations of environment dynamics that support prediction, reasoning, and planning. The concept traces back to Kenneth Craik (1943), a Scottish psychologist who proposed that humans maintain internal models predicting consequences of actions. LeCun's position paper formalized this as the core of his cognitive architecture: a differentiable world model that simulates relevant aspects of the world, enabling an agent to imagine courses of action and predict outcomes without dangerous real-world trial-and-error.
Saining Xie defines it precisely: "You have a state S_t and an action a_t. You want to learn a transition function F that takes your action together with your current state to predict the next state." But he insists this is a goal, not an algorithm: "All of us — whether working on LLMs, Video Diffusion Models, or Gaussian Splatting — are on the path toward the world model."
What makes JEPA a world model?
A world model must do three things: (1) encode the current state, (2) predict how the state evolves given actions, and (3) enable planning by searching over predicted futures. JEPA's latent prediction objective naturally addresses (1) and (2). The key insight is that prediction in representation space — rather than pixel space — yields compact, semantically meaningful state representations that are efficient to plan over.
Action conditioning
To go from a perceptual model to a world model, JEPA must be conditioned on actions. The family has explored several approaches:
V-JEPA 2-AC (2506.09985)
The most complete implementation. A 300M-parameter transformer predictor with block-causal attention, where each patch at time t attends to patches, actions, and states from t and earlier.
- Inputs: video frames (encoded by frozen V-JEPA 2 encoder), 7D end-effector states (position + orientation + gripper), 7D action vectors (state deltas)
- Action integration: actions and states are projected to the same dimension as visual tokens and attend via block-causal masking
- Training loss: teacher-forcing L1 loss + rollout loss (T=2 recurrent step). The rollout loss is critical for stable autoregressive prediction.
- Training data: only 62 hours of unlabeled Droid robot video — no rewards, task labels, or success indicators
LeWorldModel (2603.19312)
Uses Adaptive Layer Normalization (AdaLN) to inject actions. At each transformer layer, action embeddings modulate the hidden state through learned scale and bias parameters, initialized to zero for stable progressive conditioning.
- 15M parameters total, trainable on a single GPU
- History length of N=3 observations for temporal context
ACT-JEPA (2501.14622)
Dual prediction: simultaneously predicts action sequences (via action chunking) and abstract observation sequences. The JEPA objective on observations provides a richer training signal than action prediction alone.
C-JEPA (2602.11389)
Actions and proprioception are treated as separate entities in the object-centric representation, rather than concatenated with visual features. This preserves the structural independence of control inputs.
Planning via Cross-Entropy Method (CEM)
All planning-capable JEPA variants use the same core approach: Model Predictive Control (MPC) with the Cross-Entropy Method as the optimizer.
How it works
- Define energy: E(a_{1:H}) = ||predict(a_{1:H}, z_current) - z_goal||, where z_goal is the encoded goal image
- Sample: draw candidate action sequences from a Gaussian distribution (initially N(0, I))
- Evaluate: roll out each sequence through the world model in latent space
- Select: keep the top-k lowest-energy sequences
- Update: fit a new Gaussian to the elite set
- Repeat: iterate refinement (typically 10 iterations)
- Execute: apply the first action, observe new state, replan (receding horizon)
Speed comparison
| World Model | Planning time per action | Relative |
|---|---|---|
| [V-JEPA 2-AC](/wiki/papers/2506.09985) | 16 seconds | 1x |
| [LeWorldModel](/wiki/papers/2603.19312) | <1 second | 48x faster |
| [C-JEPA](/wiki/papers/2602.11389) | ~13 seconds | 8x faster than patch-based |
| Cosmos (video generation) | 4 minutes | 15x slower |
The speed advantage comes from planning in latent space rather than pixel space. LeWorldModel is fastest due to its tiny model size (15M params).
V-JEPA 2-AC planning details
- 800 candidate samples, 10 refinement iterations
- L1-ball action constraint: radius 0.075 (13 cm max end-effector displacement)
- Sub-goal hierarchies for multi-step tasks: e.g., pick-and-place uses 3 sequential goals (grasp -> move -> place) with automatic switching (4, 10, 4 timesteps)
- Zero-shot deployment: no environment-specific data, no task-specific training, no reward
C-JEPA planning efficiency
Uses only 1% of the latent features (6x128 object tokens vs 196x384 patch tokens) while achieving 88.67% success rate on Push-T — within 3% of the full patch-based model.
Robot results
V-JEPA 2-AC zero-shot manipulation
| Task | Lab 1 | Lab 2 | Average |
|---|---|---|---|
| Single-goal reaching | 100% | 100% | 100% |
| Grasp (cup) | 70% | 60% | 65% |
| Grasp (box) | 30% | 20% | 25% |
| Pick-and-place (cup) | 80% | 80% | 80% |
| Pick-and-place (box) | 80% | 50% | 65% |
Trained on 62 hours of Droid data, deployed zero-shot on Franka arms in two different labs.
LeWorldModel control
Competitive with DINO-WM on Push-T and OGBench tasks despite using only pixel inputs (no proprioception), 15M parameters, and single-GPU training. Plans 48x faster.
The trajectory: perception to planning
- I-JEPA (2023): learns static representations — no action, no planning
- V-JEPA / MC-JEPA (2023): learns temporal dynamics — still no action conditioning
- ACT-JEPA (2025): bridges to policy learning — action chunking + latent observation prediction
- V-JEPA 2-AC (2025): full world model — zero-shot robot planning from web video + 62h robot data
- C-JEPA (2026): efficient planning via object-centric representations — 1% of tokens, 8x faster
- LeWorldModel (2026): minimal world model — 2 loss terms, 15M params, 48x faster planning
- ThinkJEPA (2026): adds semantic reasoning — VLM-guided long-horizon prediction
Mode-1 and Mode-2: System 1 and System 2
LeCun's position paper describes two operating modes, inspired by Kahneman's dual-process theory:
- Mode-1 (reactive): perception → policy → action. No world model involved. Fast, habitual, like catching a ball. Analogous to System 1.
- Mode-2 (deliberative): perception → world model rollout → cost evaluation → action optimization. Slow, effortful, like planning a driving route. Analogous to System 2. This is Model-Predictive Control (MPC) with learned models.
A key insight: Mode-2 results can train Mode-1. After deliberative planning finds an optimal action, the reactive policy is trained to approximate it. This mirrors how humans automate initially deliberate skills — a new driver thinks through every action (Mode-2), but an experienced driver reacts instinctively (Mode-1).
V-JEPA 2-AC implements Mode-2 planning with CEM. No JEPA system has yet demonstrated the Mode-1/Mode-2 interaction where planning results train a reactive policy — this remains an open challenge from the position paper.
See also
- masking-strategies — masking drives what the world model learns
- collapse-prevention — keeping the latent space informative
- latent-prediction — the core principle underlying all world models
- vision-transformers — the backbone architecture