JEPAwiki

A Path Towards Autonomous Machine Intelligence

Yann LeCun's 2022 position paper — the conceptual root of the entire JEPA family. This 62-page document is not a traditional research paper but a blueprint for autonomous intelligence, proposing how to build machines that learn like animals: through observation, world models, and intrinsic motivation, rather than through massive labeled datasets or reward engineering.

The three challenges

LeCun identifies three open problems that AI must solve:

  1. How can machines learn to represent the world largely by observation? Real-world interaction is expensive and dangerous. Agents should learn as much as possible passively, minimizing the need for trial-and-error.

  2. How can machines reason and plan in ways compatible with gradient-based learning? Our best learning methods use gradients, which requires differentiable architectures — hard to reconcile with symbolic reasoning.

  3. How can machines learn hierarchical representations at multiple levels of abstraction and time scales? Long-term planning requires decomposing complex actions into sequences of simpler ones at different granularities.

The cognitive architecture

The paper proposes a modular, fully differentiable architecture with six components:

Module Function Trainable?
Configurator Executive control — configures all other modules for the current task Yes
Perception Estimates current world state from sensors Yes
World Model Predicts future world states given actions (the "simulator") Yes
Cost Computes scalar "energy" measuring agent discomfort Partially
Actor Proposes action sequences, optimizes via world model Yes
Short-term Memory Stores states and costs for planning and critic training No (storage)

The cost module has two sub-components:

  • Intrinsic Cost: hard-wired, immutable — basic drives (pain, hunger, curiosity, social bonding)
  • Trainable Critic: learns to predict future intrinsic costs — the learned value function

Mode-1 vs Mode-2 (System 1 vs System 2)

The architecture supports two operating modes, explicitly inspired by Kahneman:

  • Mode-1 (reactive): perception → policy → action. No world model involved. Fast, habitual, like reflexes. Analogous to System 1.
  • Mode-2 (deliberative): perception → world model rollout → cost evaluation → action optimization. Slow, effortful, like planning. Analogous to System 2. This is Model-Predictive Control (MPC) with learned models.

A key insight: Mode-2 results can train Mode-1. After deliberative planning finds an optimal action sequence, the reactive policy module is trained to approximate it — amortized inference. This mirrors how humans automate initially deliberate skills.

JEPA: The Joint Embedding Predictive Architecture

The centerpiece of the paper. JEPA is defined formally:

E(x, y, z) = D(s_y, Pred(s_x, z))

Where:

  • s_x = Enc_x(x) — encoding of input x
  • s_y = Enc_y(y) — encoding of target y
  • Pred(s_x, z) — predictor mapping from x-representation to y-representation, conditioned on latent z
  • D — distance function (the energy)
  • z — latent variable capturing information needed to predict s_y that is not in s_x

Why non-generative?

JEPA does not predict y itself — only the representation of y. This means:

  1. The y-encoder can discard irrelevant details (texture, noise, lighting), making the representation more abstract and predictable
  2. Multi-modal dependencies are handled two ways: encoder invariance (many y's map to the same s_y) and latent variable (z captures which of several outcomes occurs)
  3. The model cannot generate outputs — but it gains a powerful way to represent uncertainty without explicit probability distributions

The car-at-a-fork example

LeCun's canonical example: a video of a car approaching a fork in the road.

  • s_x and s_y represent the car's position, velocity, orientation — ignoring irrelevant details like trees and sidewalk texture
  • z is a binary latent: does the car go left (z=0) or right (z=1)?
  • The energy landscape has two low-energy states (left and right), with high energy between them
  • No probability distribution is needed — just an energy function with multiple minima

Training JEPA (the four criteria)

Non-contrastive training requires four simultaneous objectives:

  1. Maximize information in s_x — prevent the x-encoder from collapsing to constants
  2. Maximize information in s_y — prevent the y-encoder from collapsing
  3. Make s_y predictable from s_x — the prediction loss D(s_y, Pred(s_x, z))
  4. Minimize information in z — prevent the latent from trivially copying s_y (which would make energy zero everywhere = flat energy surface = collapse)

Criteria 1+2 prevent informational collapse. Criterion 4 prevents latent bypass collapse. Together, they force the model to learn representations where the world is predictable — abstracting away unpredictable details into encoder invariances rather than latent variables.

The paper discusses VICReg as a concrete instantiation: variance (prevent constant embeddings), invariance (prediction loss), covariance (decorrelate dimensions).

Hierarchical JEPA (H-JEPA)

The most ambitious part of the proposal. Stack multiple JEPAs hierarchically:

  • JEPA-1: low-level representations, short-term predictions (milliseconds to seconds)
  • JEPA-2: higher-level representations from JEPA-1 outputs, longer-term predictions (seconds to minutes)
  • JEPA-N: increasingly abstract, increasingly long-horizon

Each level:

  • Has its own encoder, predictor, and latent variables
  • Operates at a coarser time scale (temporal pooling between levels)
  • Discards more details — higher levels are more abstract
  • Enables prediction over longer horizons because abstraction makes the future more predictable

The driving example

  • Low level: predict car trajectory over the next few seconds (requires detailed position/velocity)
  • Mid level: predict approximate route to destination (ignores exact trajectory, models traffic lights/other cars)
  • High level: predict arrival time (ignores route details, just "I'll arrive in 30 minutes")

Each level trades detail for prediction horizon. H-JEPA formalizes this as a learnable hierarchy.

Why not autoregressive? Why not generative?

The paper is explicitly opposed to:

Autoregressive token prediction: forces the model to predict every detail of the next token, including unpredictable surface variation. Wastes capacity. Cannot naturally represent multi-modal futures (only via probability distributions over all possible tokens).

Generative models (VAE, GAN, diffusion): must produce a complete output y, requiring all details. Cannot eliminate irrelevant information. "Generative latent-variable models are not capable of eliminating irrelevant details, other than by pushing them into a latent variable. This is because they do not produce abstract (and invariant) representations of y."

Contrastive methods: work but become inefficient in high dimensions because they require negative samples whose number must grow exponentially with the dimension of the representation.

JEPA avoids all three problems by operating in representation space with non-contrastive training.

The energy-based perspective

The entire framework is cast as an Energy-Based Model (EBM):

  • Low energy = compatible, plausible state
  • High energy = incompatible, implausible
  • No normalization constant needed (unlike probabilistic models)
  • Planning = finding action sequences that minimize energy
  • Learning = shaping the energy landscape so plausible states have low energy

LeCun argues this is preferable to probabilistic modeling because: "Probabilistic models are intractable in high-dimensional continuous domains." EBMs avoid the need for partition functions, sampling, and marginalization.

What this paper got right (validated by subsequent work)

Prediction Validated by
Non-generative latent prediction works [I-JEPA](/wiki/papers/2301.08243) (2023)
Extends naturally to video [V-JEPA](/wiki/papers/2404.08471) (2024)
Enables robot planning [V-JEPA 2](/wiki/papers/2506.09985) (2025)
Heuristic-free training is possible [LeJEPA](/wiki/papers/2511.08544) (2025)
Works across modalities Point-JEPA, Audio-JEPA, LLM-JEPA
Dense features matter [V-JEPA 2.1](/wiki/papers/2603.14482) (2026)
Minimal objectives suffice [LeWorldModel](/wiki/papers/2603.19312) (2026)

What remains open

  • H-JEPA is not yet implemented: no paper has demonstrated the full hierarchical multi-timescale architecture. ThinkJEPA's dual-temporal pathway is the closest attempt.
  • Configurator module: no implementation exists. Current JEPA models are not dynamically reconfigurable for different tasks.
  • Intrinsic motivation: no JEPA model uses intrinsic cost for self-motivated exploration.
  • Mode-1/Mode-2 interaction: no system has demonstrated amortized inference from deliberative planning to reactive policy.
  • Language integration: how natural language interfaces with the world model remains unclear. LLM-JEPA and VL-JEPA are early steps.

Links

See also