Latent Prediction
The defining principle of JEPA: predict target representations in latent space, never in pixel/input space. This single design choice separates JEPA from both generative models (MAE, video generation) and contrastive methods (SimCLR, DINO), and shapes every downstream property of the architecture.
The intuition (in plain terms)
LeCun's core insight, as described in his position paper: "Don't predict every detail. Predict the essence."
Consider predicting what happens when a car approaches a fork in the road. A generative model must predict every pixel — the texture of the asphalt, the exact position of every leaf on every tree, the precise pattern of shadows. An autoregressive model must assign probability to every possible next token. But the world-relevant prediction is simple: the car goes left or right.
JEPA predicts in a representation space where "car goes left" and "car goes right" are two points — ignoring the trees, the shadows, and the asphalt texture entirely. The representation captures what matters for understanding and planning, discarding what doesn't.
As Saining Xie puts it: "You can't build a general model by memorizing everything and reconstructing it all. You need to work in an abstract representation space, to make predictions in an abstract representation space."
Or more bluntly: "In my definition, representation is a world model. The most important part."
The core problem with prediction in input space
Every mainstream AI system — GPT, LLaMA, Sora, MAE — predicts in input space: the next text token, the next pixel, the missing patch. This seems natural but has a deep flaw for learning about the world.
The unpredictability problem
Consider predicting the next frame of a video. Most of the pixels are predictable (the wall stays the same color, objects continue on their trajectories). But some are fundamentally unpredictable: the exact texture of a leaf rustling, the precise speckle pattern of noise, which word a person will say next. An input-space predictor must assign probability to all of these details — spending most of its capacity modeling uncertainty that reveals nothing about the structure of the world.
This is LeCun's central critique of autoregressive LLMs for world modeling: next-token prediction forces the model to explain everything, including what's unexplainable. A language model predicting "The cat sat on the ___" must consider "mat," "chair," "roof," "table" — but the specific choice may be arbitrary. The world-model-relevant fact is that cats sit on surfaces. The specific surface is noise.
How autoregressive models handle this
LLMs deal with this by modeling a full probability distribution over next tokens. This works for language generation but is wasteful for learning world models:
- The model spends capacity on surface-level variation (synonyms, style, formatting)
- Discrete tokenization (BPE) compresses away fine-grained information
- Generation is sequential — each token depends on all previous, so prediction is inherently slow
- The "world model" is implicit in the weights, not a separate manipulable module you can use for planning
What JEPA does instead
JEPA predicts the representation of the masked region, as encoded by a target encoder. The target encoder has already abstracted away unpredictable details, so the predictor only needs to capture the predictable structure — objects, spatial relationships, motion patterns, semantics.
Loss = ||P(context_embedding) - target_encoder(masked_region)||
The target encoder (typically an EMA copy, or regularized via SIGReg) acts as an information bottleneck: it decides what's worth representing. Unpredictable noise gets compressed away. The predictor then only needs to predict what's actually predictable about the world — which is exactly what a world model should capture.
Important caveat: this abstraction is not free. Because there's no reconstruction anchor (no pixel loss), the target encoder could learn to map everything to the same constant vector — making prediction trivially perfect but representations useless. This is representation collapse, the central failure mode of JEPA, and the reason every JEPA variant needs an explicit collapse-prevention mechanism (EMA, SIGReg, or deep self-supervision). The abstraction is only beneficial when collapse is prevented.
This trade-off is a defining feature of the paradigm. Generative models (MAE, diffusion) don't collapse because reconstruction loss forces diverse outputs. JEPA sacrifices that safety net in exchange for abstract, planning-compatible representations — and must solve collapse as a separate engineering problem.
Properties that follow from latent prediction
1. Semantic by default
Because the target encoder compresses away noise, the prediction objective naturally favors semantic features. I-JEPA demonstrated this: without any data augmentation, latent prediction produces representations competitive with augmentation-heavy contrastive methods.
2. No augmentation dependence
Contrastive methods like SimCLR require hand-crafted augmentations (random crop, color jitter, etc.) to define what should be invariant. These augmentations bake in assumptions that may not hold for all downstream tasks. JEPA's masking-based approach avoids this — the only inductive bias is what gets masked.
3. Modality-general
Because JEPA operates on abstract representations rather than raw inputs, the same framework applies to any modality that can be tokenized and masked:
| Modality | Tokenization | Paper |
|---|---|---|
| Images | 2D patches (ViT) | [I-JEPA](/wiki/papers/2301.08243) |
| Video | 3D tubelets (2x16x16) | [V-JEPA 2](/wiki/papers/2506.09985) |
| Point clouds | FPS + k-NN patches | [Point-JEPA](/wiki/papers/2404.16432) |
| 3D scenes | FPS + k-NN blocks | [3D-JEPA](/wiki/papers/2409.15803) |
| Audio | Time-frequency patches | Audio-JEPA |
| Objects | Slot attention slots | [C-JEPA](/wiki/papers/2602.11389) |
4. Efficient planning
Predicting in latent space is orders of magnitude faster than generating pixels. This directly enables the planning capabilities of JEPA world models — V-JEPA 2-AC plans in 16 seconds vs. 4 minutes for video generation models.
5. Natural abstraction hierarchy
Latent predictions can be made at multiple levels of abstraction simultaneously. V-JEPA 2.1's deep self-supervision applies prediction objectives at 4 intermediate encoder layers, capturing both low-level spatial structure and high-level semantics.
Loss functions across the family
The prediction loss varies across JEPA variants, reflecting different trade-offs:
| Method | Loss | Rationale |
|---|---|---|
| [I-JEPA](/wiki/papers/2301.08243) | L2 (MSE) | Standard, symmetric |
| [V-JEPA 2](/wiki/papers/2506.09985) | L1 | More robust to outliers, sharper predictions |
| [Point-JEPA](/wiki/papers/2404.16432) | Smooth L1 | Combines L1 robustness with L2 smoothness near zero |
| [3D-JEPA](/wiki/papers/2409.15803) | Cosine similarity | Normalizes magnitude, focuses on direction |
| [LeWorldModel](/wiki/papers/2603.19312) | L2 (MSE) | Paired with SIGReg for stable training |
| [C-JEPA](/wiki/papers/2602.11389) | L2 (MSE) | On object-centric slots |
The five paradigms compared
| Autoregressive LLM | Generative (MAE/Diffusion) | Contrastive (DINO/CLIP) | JEPA | NEPA | |
|---|---|---|---|---|---|
| Predicts | Next token | Pixels | Nothing (matches views) | Latent repr. | Next embedding |
| Prediction space | Input (discrete) | Input (continuous) | N/A | Latent | Latent |
| Direction | Causal (L→R) | Bidirectional | N/A | Bidirectional | Causal (L→R) |
| Augmentations | N/A | Sometimes | Required | None | None |
| Low-level details | Must predict all | Must reconstruct all | Discards | Only if predictable | Only if predictable |
| Can generate | Yes (text, code) | Yes (images, video) | No | No | No |
| Can plan | Via CoT (slow) | Via generation (slow) | No | Yes (fast) | Not demonstrated |
| World model | Implicit | Implicit | None | Explicit (predictor) | Implicit |
| Collapse risk | None | None | Managed (negatives) | Yes (central) | Managed (stop-grad) |
| Modalities | Primarily language | Images, video, audio | Images, text | Any | Images |
The key insight: autoregressive and generative models pay a tax for operating in input space — they must model every detail, predictable or not. JEPA and NEPA avoid this tax by operating in latent space, at the cost of not being able to generate outputs. This is the right trade-off for world models (where you need fast, abstract prediction) but the wrong trade-off for content generation (where you need the output).
NEPA is particularly interesting because it applies the autoregressive prediction pattern (next-token, causal masking) but in embedding space — proving that the key ingredient is latent prediction, not the specific masking pattern. This suggests the autoregressive and JEPA paradigms may ultimately converge.
The convergence question
The JEPA family is moving toward the autoregressive world, and vice versa:
- JEPA → language: LLM-JEPA adds a JEPA objective to standard LLM training, improving finetuning on multiple model families
- JEPA → generation: VL-JEPA predicts text embeddings and uses a lightweight decoder only when text output is needed — a hybrid approach
- Autoregressive → JEPA: NEPA applies GPT-style causal prediction but in embedding space
- VLMs → JEPA: ThinkJEPA uses a VLM (Qwen3-VL) as a "thinker" to guide a JEPA world model with semantic reasoning
The open question: will these paradigms fully converge into a single framework, or will latent prediction and token prediction remain complementary tools for different aspects of intelligence?
See also
- collapse-prevention — the central challenge of latent prediction (the price of not reconstructing)
- masking-strategies — what gets masked determines what gets learned
- world-models-and-planning — latent prediction enables planning that autoregressive models can't match
- vision-transformers — the encoder architecture