JEPAwiki

JEPA: The Evolution from Perception to World Modeling

Joint-Embedding Predictive Architecture (JEPA) is a self-supervised learning framework proposed by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. As Saining Xie (co-founder of AMI Labs, the company building JEPA into a product) puts it: "JEPA is not a model. JEPA is not a specific algorithm. JEPA is a complete cognitive architecture." It represents a fundamentally different bet on how AI systems should learn about the world — one that breaks from the dominant autoregressive paradigm that powers today's LLMs.

What JEPA is

JEPA learns by predicting representations of missing or future inputs in a learned latent space. Given a partial observation (e.g., some visible patches of an image, or the first few frames of a video), a predictor network predicts what the representation of the missing part looks like — not the raw pixels, not discrete tokens, but a continuous embedding produced by a target encoder.

This is a form of self-supervised learning: the model creates its own training signal from unlabeled data by masking parts of the input and predicting their latent representations.

What JEPA is not

JEPA is not an autoregressive model. Unlike GPT, LLaMA, or any standard LLM, JEPA does not predict the next token in a sequence. It does not operate in discrete token space. It does not generate outputs one element at a time. This is a deliberate design choice, not a limitation.

JEPA is not a generative model. Unlike MAE, diffusion models, or video generators, JEPA never reconstructs the raw input. It never produces pixels. It does not need a decoder. This means JEPA cannot "show you" what it predicts — its predictions exist only in representation space.

JEPA is not a contrastive model. Unlike SimCLR, CLIP, or DINO, JEPA does not compare positive and negative pairs. It does not require data augmentations to create different "views." It does not push apart representations of different samples.

Why not just use autoregressive prediction?

LeCun's core argument against autoregressive token prediction for world modeling:

1. Token prediction wastes capacity on the unpredictable. When an LLM predicts the next token, it must assign probability to every possible continuation — including irrelevant surface variation (word choice, formatting, pixel noise). Most of the prediction budget is spent modeling uncertainty that has nothing to do with understanding the world. JEPA sidesteps this by predicting in a space where the target encoder has already abstracted away the unpredictable.

2. Discrete tokens are a lossy bottleneck. Tokenization (BPE for text, VQ-VAE for images) compresses continuous reality into a small discrete vocabulary. This discards fine-grained information that may matter for downstream tasks. JEPA predicts continuous embeddings, preserving the full richness of the representation.

3. Autoregressive generation is fundamentally sequential. Each token depends on all previous tokens, making generation inherently slow. JEPA's masked prediction is parallel — all masked regions are predicted simultaneously from the visible context.

4. Next-token prediction doesn't naturally give you a world model. An LLM can predict what text comes next, but it doesn't learn a compact, manipulable model of the world that supports planning. V-JEPA 2 demonstrated that JEPA naturally yields a world model: freeze the encoder, add action conditioning, and you get a robot controller that plans in latent space — something no autoregressive LLM has achieved from self-supervised pretraining alone.

5. Scaling laws may plateau differently. Autoregressive models improve by predicting more tokens from more data. JEPA improves by predicting better representations — the quality of what's predicted matters more than the quantity. LeJEPA showed that principled regularization (SIGReg) achieves 79% on ImageNet with ViT-H/14 using just 2 loss terms and ~50 lines of code, versus complex multi-term recipes. This doesn't match V-JEPA 2.1's 85.5% (which uses EMA + deep self-supervision at 2B scale), but it demonstrates that most of the gap can be closed with far less complexity — and the scaling behavior may differ at larger scales.

The counterargument is clear too: autoregressive models have scaled spectacularly and can be prompted in natural language. JEPA cannot generate text or images. The bet is that latent prediction will prove more efficient for learning world models, while autoregressive generation may remain the right tool for language production. LLM-JEPA and VL-JEPA show early signs that the two paradigms can be combined.

Complete timeline (18 papers)

Phase 1: Foundations (2022-2023)

  1. JEPA / H-JEPA (Jun 2022) — LeCun's position paper. Defines prediction in representation space; H-JEPA adds hierarchical, multi-timescale world modeling. Not on arxiv (OpenReview only).
  2. I-JEPA (Jan 2023) — First concrete success. Semantic image representations without augmentations. Proved JEPA is practical and scalable with ViT-Huge/14.
  3. MC-JEPA (Jul 2023) — Jointly learns optical flow (motion) and content features in a shared encoder. Early step toward dynamic understanding.

Phase 2: Modality expansion & world models (2024)

  1. IWM (Mar 2024) — Image World Models. Extends JEPA to predict photometric transformations in latent space. Key insight: the predictor (world model) should be reused, not discarded.
  2. V-JEPA (Apr 2024) — The leap from images to video. Feature prediction as a stand-alone objective on 2M videos. 81.9% K400, 72.2% SSv2 with frozen backbone.
  3. Audio-JEPA (2507.02915) — Proves JEPA is modality-general. Time-frequency masking on audio spectrograms. (Not on HF Papers)
  4. Point-JEPA (Apr 2024) — Adapts JEPA to point clouds with a sequencer for spatial ordering. 93.7% on ModelNet40.
  5. 3D-JEPA (Sep 2024) — Broader 3D representation learning with context-aware decoder. Superior efficiency (150 vs 300 epochs).

Phase 3: Action, language, and theory (2025)

  1. ACT-JEPA (Jan 2025) — Bridge to policy learning. Dual prediction of action sequences and latent observations via action chunking.
  2. V-JEPA 2 (Jun 2025) — THE world-model milestone. 1M+ hours pretraining, zero-shot robot planning on Franka arms. Understanding, prediction, and planning unified.
  3. LLM-JEPA (Sep 2025) — JEPA for large language models. Outperforms standard LLM training across Llama3, Gemma2, OpenELM, Olmo on multiple datasets.
  4. LeJEPA (Nov 2025) — The theory paper. Proves isotropic Gaussian is optimal for JEPA embeddings. Introduces SIGReg. Heuristic-free, ~50 lines of code. 79% ImageNet with ViT-H/14.
  5. VL-JEPA (Dec 2025) — Vision-language JEPA. Predicts text embeddings instead of tokens. 50% fewer parameters, 2.85x fewer decoding ops, 1.6B params.

Phase 4: Causal reasoning, dense features, and scaling (2026)

  1. EB-JEPA (Feb 2026) — Open-source library making JEPA accessible. Image SSL to video to planning, single-GPU training.
  2. C-JEPA (Feb 2026) — Object-centric causal reasoning. Object-level masking induces causal inductive bias. +21% on counterfactual reasoning, 1% token budget.
  3. V-JEPA 2.1 (Mar 2026) — Dense feature upgrade. All tokens contribute to loss. SOTA on robotics, depth, navigation. +23 mIoU on segmentation.
  4. LeWorldModel (Mar 2026) — Minimal stable JEPA from pixels. 2 loss terms, 15M params, single GPU. Plans 48x faster than foundation models.
  5. ThinkJEPA (Mar 2026) — JEPA + VLM reasoning. Dual-temporal pathway for long-horizon semantics.

Related work

  • NEPA (Dec 2025, Saining Xie) — Next-Embedding Predictive Autoregression. Causal (GPT-style) instead of masked (BERT-style) prediction in embedding space. 85.3% ImageNet ViT-L. Validates that embedding-space prediction is the key, not the masking pattern.

JEPA vs the autoregressive world

The AI landscape in 2023-2026 is dominated by autoregressive transformers — GPT-4, LLaMA, Gemini — that predict the next token. JEPA represents an alternative path:

Autoregressive LLMs JEPA
Predicts Next discrete token Continuous latent embedding
Space Input/token space Learned representation space
Generation Yes (text, pixels) No (representations only)
Planning Via chain-of-thought (text) Via latent rollout (fast)
World model Implicit in weights Explicit (predictor network)
Collapse risk None Central challenge
Augmentations N/A Not needed
Modalities Primarily language Any tokenizable input
Waste Predicts unpredictable details Abstracts away noise
Speed Sequential generation Parallel prediction

Neither paradigm is strictly superior. Autoregressive models excel at language generation and in-context learning. JEPA excels at learning representations, building world models, and planning. The frontier — LLM-JEPA, VL-JEPA, ThinkJEPA — is where the two paradigms meet.

Key themes

  • Prediction in latent space: the unifying principle that separates JEPA from autoregressive and generative approaches
  • Masking strategy drives what you learn: from patches to objects to all-token losses to causal ordering
  • collapse-prevention is the central challenge: the price of abandoning reconstruction — solved by EMA, SIGReg, deep self-supervision
  • Modality generality: images -> video -> audio -> 3D -> point clouds -> language -> vision-language
  • From perception to planning: the path autoregressive models haven't taken — I-JEPA (static) -> V-JEPA 2 (zero-shot robot control)
  • The predictor is a world model: IWM showed it shouldn't be discarded; V-JEPA 2 showed it enables planning; VL-JEPA showed it can replace autoregressive generation

The trajectory

The JEPA family traces an arc that autoregressive models have not: static perception -> dynamic understanding -> world modeling -> causal reasoning -> language-guided planning. Each step adds a capability while preserving the core principle of latent prediction. The open question is whether this arc converges with the autoregressive path — or renders it unnecessary for embodied intelligence.