JEPAwiki

Latent Prediction

The defining principle of JEPA: predict target representations in latent space, never in pixel/input space. This single design choice separates JEPA from both generative models (MAE, video generation) and contrastive methods (SimCLR, DINO), and shapes every downstream property of the architecture.

The intuition (in plain terms)

LeCun's core insight, as described in his position paper: "Don't predict every detail. Predict the essence."

Consider predicting what happens when a car approaches a fork in the road. A generative model must predict every pixel — the texture of the asphalt, the exact position of every leaf on every tree, the precise pattern of shadows. An autoregressive model must assign probability to every possible next token. But the world-relevant prediction is simple: the car goes left or right.

JEPA predicts in a representation space where "car goes left" and "car goes right" are two points — ignoring the trees, the shadows, and the asphalt texture entirely. The representation captures what matters for understanding and planning, discarding what doesn't.

As Saining Xie puts it: "You can't build a general model by memorizing everything and reconstructing it all. You need to work in an abstract representation space, to make predictions in an abstract representation space."

Or more bluntly: "In my definition, representation is a world model. The most important part."

The core problem with prediction in input space

Every mainstream AI system — GPT, LLaMA, Sora, MAE — predicts in input space: the next text token, the next pixel, the missing patch. This seems natural but has a deep flaw for learning about the world.

The unpredictability problem

Consider predicting the next frame of a video. Most of the pixels are predictable (the wall stays the same color, objects continue on their trajectories). But some are fundamentally unpredictable: the exact texture of a leaf rustling, the precise speckle pattern of noise, which word a person will say next. An input-space predictor must assign probability to all of these details — spending most of its capacity modeling uncertainty that reveals nothing about the structure of the world.

This is LeCun's central critique of autoregressive LLMs for world modeling: next-token prediction forces the model to explain everything, including what's unexplainable. A language model predicting "The cat sat on the ___" must consider "mat," "chair," "roof," "table" — but the specific choice may be arbitrary. The world-model-relevant fact is that cats sit on surfaces. The specific surface is noise.

How autoregressive models handle this

LLMs deal with this by modeling a full probability distribution over next tokens. This works for language generation but is wasteful for learning world models:

  • The model spends capacity on surface-level variation (synonyms, style, formatting)
  • Discrete tokenization (BPE) compresses away fine-grained information
  • Generation is sequential — each token depends on all previous, so prediction is inherently slow
  • The "world model" is implicit in the weights, not a separate manipulable module you can use for planning

What JEPA does instead

JEPA predicts the representation of the masked region, as encoded by a target encoder. The target encoder has already abstracted away unpredictable details, so the predictor only needs to capture the predictable structure — objects, spatial relationships, motion patterns, semantics.

Loss = ||P(context_embedding) - target_encoder(masked_region)||

The target encoder (typically an EMA copy, or regularized via SIGReg) acts as an information bottleneck: it decides what's worth representing. Unpredictable noise gets compressed away. The predictor then only needs to predict what's actually predictable about the world — which is exactly what a world model should capture.

Important caveat: this abstraction is not free. Because there's no reconstruction anchor (no pixel loss), the target encoder could learn to map everything to the same constant vector — making prediction trivially perfect but representations useless. This is representation collapse, the central failure mode of JEPA, and the reason every JEPA variant needs an explicit collapse-prevention mechanism (EMA, SIGReg, or deep self-supervision). The abstraction is only beneficial when collapse is prevented.

This trade-off is a defining feature of the paradigm. Generative models (MAE, diffusion) don't collapse because reconstruction loss forces diverse outputs. JEPA sacrifices that safety net in exchange for abstract, planning-compatible representations — and must solve collapse as a separate engineering problem.

Properties that follow from latent prediction

1. Semantic by default

Because the target encoder compresses away noise, the prediction objective naturally favors semantic features. I-JEPA demonstrated this: without any data augmentation, latent prediction produces representations competitive with augmentation-heavy contrastive methods.

2. No augmentation dependence

Contrastive methods like SimCLR require hand-crafted augmentations (random crop, color jitter, etc.) to define what should be invariant. These augmentations bake in assumptions that may not hold for all downstream tasks. JEPA's masking-based approach avoids this — the only inductive bias is what gets masked.

3. Modality-general

Because JEPA operates on abstract representations rather than raw inputs, the same framework applies to any modality that can be tokenized and masked:

Modality Tokenization Paper
Images 2D patches (ViT) [I-JEPA](/wiki/papers/2301.08243)
Video 3D tubelets (2x16x16) [V-JEPA 2](/wiki/papers/2506.09985)
Point clouds FPS + k-NN patches [Point-JEPA](/wiki/papers/2404.16432)
3D scenes FPS + k-NN blocks [3D-JEPA](/wiki/papers/2409.15803)
Audio Time-frequency patches Audio-JEPA
Objects Slot attention slots [C-JEPA](/wiki/papers/2602.11389)

4. Efficient planning

Predicting in latent space is orders of magnitude faster than generating pixels. This directly enables the planning capabilities of JEPA world models — V-JEPA 2-AC plans in 16 seconds vs. 4 minutes for video generation models.

5. Natural abstraction hierarchy

Latent predictions can be made at multiple levels of abstraction simultaneously. V-JEPA 2.1's deep self-supervision applies prediction objectives at 4 intermediate encoder layers, capturing both low-level spatial structure and high-level semantics.

Loss functions across the family

The prediction loss varies across JEPA variants, reflecting different trade-offs:

Method Loss Rationale
[I-JEPA](/wiki/papers/2301.08243) L2 (MSE) Standard, symmetric
[V-JEPA 2](/wiki/papers/2506.09985) L1 More robust to outliers, sharper predictions
[Point-JEPA](/wiki/papers/2404.16432) Smooth L1 Combines L1 robustness with L2 smoothness near zero
[3D-JEPA](/wiki/papers/2409.15803) Cosine similarity Normalizes magnitude, focuses on direction
[LeWorldModel](/wiki/papers/2603.19312) L2 (MSE) Paired with SIGReg for stable training
[C-JEPA](/wiki/papers/2602.11389) L2 (MSE) On object-centric slots

The five paradigms compared

Autoregressive LLM Generative (MAE/Diffusion) Contrastive (DINO/CLIP) JEPA NEPA
Predicts Next token Pixels Nothing (matches views) Latent repr. Next embedding
Prediction space Input (discrete) Input (continuous) N/A Latent Latent
Direction Causal (L→R) Bidirectional N/A Bidirectional Causal (L→R)
Augmentations N/A Sometimes Required None None
Low-level details Must predict all Must reconstruct all Discards Only if predictable Only if predictable
Can generate Yes (text, code) Yes (images, video) No No No
Can plan Via CoT (slow) Via generation (slow) No Yes (fast) Not demonstrated
World model Implicit Implicit None Explicit (predictor) Implicit
Collapse risk None None Managed (negatives) Yes (central) Managed (stop-grad)
Modalities Primarily language Images, video, audio Images, text Any Images

The key insight: autoregressive and generative models pay a tax for operating in input space — they must model every detail, predictable or not. JEPA and NEPA avoid this tax by operating in latent space, at the cost of not being able to generate outputs. This is the right trade-off for world models (where you need fast, abstract prediction) but the wrong trade-off for content generation (where you need the output).

NEPA is particularly interesting because it applies the autoregressive prediction pattern (next-token, causal masking) but in embedding space — proving that the key ingredient is latent prediction, not the specific masking pattern. This suggests the autoregressive and JEPA paradigms may ultimately converge.

The convergence question

The JEPA family is moving toward the autoregressive world, and vice versa:

  • JEPA → language: LLM-JEPA adds a JEPA objective to standard LLM training, improving finetuning on multiple model families
  • JEPA → generation: VL-JEPA predicts text embeddings and uses a lightweight decoder only when text output is needed — a hybrid approach
  • Autoregressive → JEPA: NEPA applies GPT-style causal prediction but in embedding space
  • VLMs → JEPA: ThinkJEPA uses a VLM (Qwen3-VL) as a "thinker" to guide a JEPA world model with semantic reasoning

The open question: will these paradigms fully converge into a single framework, or will latent prediction and token prediction remain complementary tools for different aspects of intelligence?

See also