JEPA: The Evolution from Perception to World Modeling

Joint-Embedding Predictive Architecture (JEPA) is a self-supervised learning framework proposed by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. As Saining Xie (co-founder of AMI Labs, the company building JEPA into a product) puts it: "JEPA is not a model. JEPA is not a specific algorithm. JEPA is a complete cognitive architecture." It represents a fundamentally different bet on how AI systems should learn about the world — one that breaks from the dominant autoregressive paradigm that powers today's LLMs.

What JEPA is

The core mechanism: split an input into two parts — context and target — encode each part separately, then train a predictor to align the embedding vectors of the two parts in latent space.

Concretely:

An input (image, video, point cloud, audio, ...) is divided into context (visible) and target (masked/unseen) parts. What these "parts" are depends on the modality — patches for images, temporal segments for video, spatial regions for 3D, time-frequency blocks for audio.
A context encoder processes the visible parts into an embedding vector.
A target encoder (an EMA copy of the context encoder) processes the masked parts into a separate embedding vector.
A predictor takes the context embedding and predicts what the target embedding should be.
The loss directly aligns the predicted target embedding with the actual target embedding — minimizing the distance between two vectors in representation space.

The model never reconstructs pixels, never generates tokens, never compares positive/negative pairs. It only learns to make the embedding vectors of different parts of the same input agree through a predictor — and in doing so, it learns abstract, semantic representations of the world.

This is a form of self-supervised learning: the model creates its own training signal from unlabeled data by masking parts of the input and predicting their latent representations.

What JEPA is not

JEPA is not an autoregressive model. Unlike GPT, LLaMA, or any standard LLM, JEPA does not predict the next token in a sequence. It does not operate in discrete token space. It does not generate outputs one element at a time. This is a deliberate design choice, not a limitation.

JEPA is not a generative model. Unlike MAE, diffusion models, or video generators, JEPA never reconstructs the raw input. It never produces pixels. It does not need a decoder. This means JEPA cannot "show you" what it predicts — its predictions exist only in representation space.

JEPA is not a contrastive model. Unlike SimCLR, CLIP, or DINO, JEPA does not compare positive and negative pairs. It does not require data augmentations to create different "views." It does not push apart representations of different samples.

Why not just use autoregressive prediction?

LeCun's core argument against autoregressive token prediction for world modeling:

1. Token prediction wastes capacity on the unpredictable. When an LLM predicts the next token, it must assign probability to every possible continuation — including irrelevant surface variation (word choice, formatting, pixel noise). Most of the prediction budget is spent modeling uncertainty that has nothing to do with understanding the world. JEPA sidesteps this by predicting in a space where the target encoder has already abstracted away the unpredictable.

2. Discrete tokens are a lossy bottleneck. Tokenization (BPE for text, VQ-VAE for images) compresses continuous reality into a small discrete vocabulary. This discards fine-grained information that may matter for downstream tasks. JEPA predicts continuous embeddings, preserving the full richness of the representation.

3. Autoregressive generation is fundamentally sequential. Each token depends on all previous tokens, making generation inherently slow. JEPA's masked prediction is parallel — all masked regions are predicted simultaneously from the visible context.

4. Next-token prediction doesn't naturally give you a world model. An LLM can predict what text comes next, but it doesn't learn a compact, manipulable model of the world that supports planning. V-JEPA 2 demonstrated that JEPA naturally yields a world model: freeze the encoder, add action conditioning, and you get a robot controller that plans in latent space — something no autoregressive LLM has achieved from self-supervised pretraining alone.

5. Scaling laws may plateau differently. Autoregressive models improve by predicting more tokens from more data. JEPA improves by predicting better representations — the quality of what's predicted matters more than the quantity. LeJEPA showed that principled regularization (SIGReg) achieves 79% on ImageNet with ViT-H/14 using just 2 loss terms and ~50 lines of code, versus complex multi-term recipes. This doesn't match V-JEPA 2.1's 85.5% (which uses EMA + deep self-supervision at 2B scale), but it demonstrates that most of the gap can be closed with far less complexity — and the scaling behavior may differ at larger scales.

The counterargument is clear too: autoregressive models have scaled spectacularly and can be prompted in natural language. JEPA cannot generate text or images. The bet is that latent prediction will prove more efficient for learning world models, while autoregressive generation may remain the right tool for language production. LLM-JEPA and VL-JEPA show early signs that the two paradigms can be combined.

Complete timeline (18 papers)

Phase 1: Foundations (2022-2023)

JEPA / H-JEPA (Jun 2022) — LeCun's position paper. Defines prediction in representation space; H-JEPA adds hierarchical, multi-timescale world modeling. Not on arxiv (OpenReview only).
I-JEPA (Jan 2023) — First concrete success. Semantic image representations without augmentations. Proved JEPA is practical and scalable with ViT-Huge/14.
MC-JEPA (Jul 2023) — Jointly learns optical flow (motion) and content features in a shared encoder. Early step toward dynamic understanding.

Phase 2: Modality expansion & world models (2024)

IWM (Mar 2024) — Image World Models. Extends JEPA to predict photometric transformations in latent space. Key insight: the predictor (world model) should be reused, not discarded.
V-JEPA (Apr 2024) — The leap from images to video. Feature prediction as a stand-alone objective on 2M videos. 81.9% K400, 72.2% SSv2 with frozen backbone.
Audio-JEPA (2507.02915) — Proves JEPA is modality-general. Time-frequency masking on audio spectrograms. (Not on HF Papers)
Point-JEPA (Apr 2024) — Adapts JEPA to point clouds with a sequencer for spatial ordering. 93.7% on ModelNet40.
3D-JEPA (Sep 2024) — Broader 3D representation learning with context-aware decoder. Superior efficiency (150 vs 300 epochs).

Phase 3: Action, language, and theory (2025)

ACT-JEPA (Jan 2025) — Bridge to policy learning. Dual prediction of action sequences and latent observations via action chunking.
V-JEPA 2 (Jun 2025) — THE world-model milestone. 1M+ hours pretraining, zero-shot robot planning on Franka arms. Understanding, prediction, and planning unified.
LLM-JEPA (Sep 2025) — JEPA for large language models. Outperforms standard LLM training across Llama3, Gemma2, OpenELM, Olmo on multiple datasets.
LeJEPA (Nov 2025) — The theory paper. Proves isotropic Gaussian is optimal for JEPA embeddings. Introduces SIGReg. Heuristic-free, ~50 lines of code. 79% ImageNet with ViT-H/14.
VL-JEPA (Dec 2025) — Vision-language JEPA. Predicts text embeddings instead of tokens. 50% fewer parameters, 2.85x fewer decoding ops, 1.6B params.

Phase 4: Causal reasoning, dense features, and scaling (2026)

EB-JEPA (Feb 2026) — Open-source library making JEPA accessible. Image SSL to video to planning, single-GPU training.
C-JEPA (Feb 2026) — Object-centric causal reasoning. Object-level masking induces causal inductive bias. +21% on counterfactual reasoning, 1% token budget.
V-JEPA 2.1 (Mar 2026) — Dense feature upgrade. All tokens contribute to loss. SOTA on robotics, depth, navigation. +23 mIoU on segmentation.
LeWorldModel (Mar 2026) — Minimal stable JEPA from pixels. 2 loss terms, 15M params, single GPU. Plans 48x faster than foundation models.
ThinkJEPA (Mar 2026) — JEPA + VLM reasoning. Dual-temporal pathway for long-horizon semantics.

Related work

NEPA (Dec 2025, Saining Xie) — Next-Embedding Predictive Autoregression. Causal (GPT-style) instead of masked (BERT-style) prediction in embedding space. 85.3% ImageNet ViT-L. Validates that embedding-space prediction is the key, not the masking pattern.

JEPA vs the autoregressive world

The AI landscape in 2023-2026 is dominated by autoregressive transformers — GPT-4, LLaMA, Gemini — that predict the next token. JEPA represents an alternative path:

	Autoregressive LLMs	JEPA
Predicts	Next discrete token	Continuous latent embedding
Space	Input/token space	Learned representation space
Generation	Yes (text, pixels)	No (representations only)
Planning	Via chain-of-thought (text)	Via latent rollout (fast)
World model	Implicit in weights	Explicit (predictor network)
Collapse risk	None	Central challenge
Augmentations	N/A	Not needed
Modalities	Primarily language	Any tokenizable input
Waste	Predicts unpredictable details	Abstracts away noise
Speed	Sequential generation	Parallel prediction

Neither paradigm is strictly superior. Autoregressive models excel at language generation and in-context learning. JEPA excels at learning representations, building world models, and planning. The frontier — LLM-JEPA, VL-JEPA, ThinkJEPA — is where the two paradigms meet.

Key themes

Prediction in latent space: the unifying principle that separates JEPA from autoregressive and generative approaches
Masking strategy drives what you learn: from patches to objects to all-token losses to causal ordering
collapse-prevention is the central challenge: the price of abandoning reconstruction — solved by EMA, SIGReg, deep self-supervision
Modality generality: images -> video -> audio -> 3D -> point clouds -> language -> vision-language
From perception to planning: the path autoregressive models haven't taken — I-JEPA (static) -> V-JEPA 2 (zero-shot robot control)
The predictor is a world model: IWM showed it shouldn't be discarded; V-JEPA 2 showed it enables planning; VL-JEPA showed it can replace autoregressive generation

The trajectory

The JEPA family traces an arc that autoregressive models have not: static perception -> dynamic understanding -> world modeling -> causal reasoning -> language-guided planning. Each step adds a capability while preserving the core principle of latent prediction. The open question is whether this arc converges with the autoregressive path — or renders it unnecessary for embodied intelligence.