JEPAwiki

Yann LeCun's Billion Dollar Bet (Welch Labs, Part 1)

A polished, ~37-minute video explainer by Welch Labs (Sam Baskin, Pranav Gundu, Stephen Welch) tracing the historical arc of joint embedding architectures up to JEPA. Part 1 of a planned 2-part series; Part 2 will dive into VJEPA 2, VL-JEPA, and LeWorldModel implementations and how they stack up against LLM- and VLA-driven approaches.

The video is unusual in that it includes fresh on-camera commentary from Yann LeCun and Stéphane Deny, plus archival footage going back to LeCun's 1989 work on convolutional nets. As such, it is one of the better single sources for the origin story of joint-embedding self-supervised learning and how that lineage led to JEPA.

Source: YouTube

What it covers

Section Topic
0:00 Intro — LeCun's $1B bet on JEPA
2:28 The problem with deep learning (label dependence)
4:17 "Intelligence is a cake" — LeCun's slide
5:15 The rise of generative AI (GPT lineage)
8:00 The blurry-images problem in generative video
11:16 Why so blurry? (the unpredictability tax)
13:30 Do our models need to be generative?
15:16 Siamese networks (1990s, Bell Labs)
17:53 Representation collapse
19:54 Yann's epiphany & Barlow Twins
27:22 DINO
34:09 But is JEPA good?

The cake metaphor

LeCun's now-iconic 2015 slide, repeated almost verbatim on camera:

"If intelligence is a cake, the bulk of the cake is self-supervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning."

The talk uses this as the throughline: self-supervised learning won in language (next-token prediction) but not in vision — hence the search for an alternative.

The 17-year-old driving argument

LeCun's data-efficiency complaint, on camera:

"How is it that we have those millions of hours of training data where we can train a level-2 system with it — which is what Tesla is doing — but [we are] nowhere near level 3, 4, 5… yet a 17-year-old can learn to drive in a few hours of practice. Like, how does that happen? Shouldn't we figure out what the secret is there? My guess about it is the secret is world models."

This is the cleanest statement of the motivation for JEPA's world-model thesis.

Why generative video is blurry (the talk's clearest explanation)

The video gives the most accessible explanation of the unpredictability problem I've seen anywhere:

  • An LLM completing "the ball bounced to the ___" can independently update probabilities for "left", "right", "wall" — it has discrete output slots for each token.
  • A pixel-prediction model trained on videos of a ball bouncing in different directions has no such slots. Forced to predict a single output frame for a given input, the optimal solution is to predict the average of possible outcomes — a blurry, washed-out mess.

The talk uses this to motivate non-generative joint embedding: you cannot enumerate the possible next frames of a video the way you can enumerate the possible next tokens of text, because the space is roughly 10^(15M) for HD video — more configurations than atoms in the observable universe.

The Barlow Twins origin story

The most historically significant section. Around 19:54, LeCun and Stéphane Deny (Meta postdoc, 2020) describe how Barlow Twins came together:

  1. By the late 2010s, joint-embedding methods kept hitting representation collapse, and existing fixes (contrastive negatives) didn't scale well to high dimensions.
  2. Stéphane Deny suggested applying a 1961 hypothesis from theoretical neuroscientist Horace Barlow: animal vision systems operate by reducing redundancy between neurons.
  3. Translated to neural nets: instead of just maximizing similarity between embeddings of two views (which is collapse-prone), also penalize redundancy between dimensions of the embedding.

The mathematical move (in their own words on camera): compute the cross-correlation matrix between the two encoders' output activations across a batch, and push it toward the identity matrix — diagonal entries (corresponding neurons across the two views) high, off-diagonal entries (different neurons) zero.

Result on ImageNet linear probe (2021):

Method ImageNet linear probe Year Notes
AlexNet (fully supervised) 59.3% 2012 The 2012 baseline
Barlow Twins (frozen + linear probe) 73.2% 2021 Self-supervised, +13.9 over supervised AlexNet
ViT (fully supervised) 88.6% 2020 Google's ViT paper
DINOv3 88.4% 2025 First self-supervised model to match weakly-supervised SOTA

Barlow Twins led to VICReg (a simpler reformulation by the same group), which the 2022 position paper cites as the canonical instantiation of the four-criteria training recipe, and which is still used as a building block in current JEPA variants (e.g. EB-JEPA).

The DINO lineage as the parallel branch

The video makes the parallel-tracks structure of self-supervised vision explicit. There were two contemporary efforts at FAIR:

  • NYC group (LeCun, Deny, Bardes, Ballas...) — Barlow Twins → VICReg → I-JEPA → V-JEPA → JEPA family
  • Paris group (Caron, Bojanowski, Joulin...) — DINO v1 → DINOv2 → DINOv3

Both are joint-embedding architectures. DINO uses self-distillation with a momentum teacher; the JEPA branch uses explicit predictors and (later) information-theoretic regularizers. Both prove that "joint embedding was better for representation learning" — but only the JEPA branch keeps an explicit predictor that can be reused as a world model for planning.

The world-model definition (LeCun, on camera)

"JPA means joint embedding predictive architecture. You take an observation in the world and then the next observation in the world, you run them through encoders — so this is like a joint-embedding type architecture — and then you have a predictor that tries to predict the state at time t+1 from the state at time t, and you might condition this on an action. And now you have a world model."

He frames the lineage in three layers, in increasing novelty:

  1. Classical optimal control (Soviet Union late 50s, US early 60s): build a model s_{t+1} = f(s_t, a_t), optimize action sequences.
  2. Modern twist: learn the model from data via machine learning.
  3. JEPA twist: also learn a representation of the input — an abstract state — and learn the transition model in that representation space, not in pixel space.

The closing position (LeCun, on camera)

The video ends with LeCun's "controversial" agentic-systems argument:

"I do not understand how you can even think of building an agentic system without that agentic system having the ability of predicting the consequences of its actions. And a VLA doesn't do that… If you really want to build reliable agentic systems, they absolutely have to be able to predict the consequences of their actions so that they can plan a sequence of actions… The inference process now becomes a search as opposed to just an autoregressive prediction."

This is the cleanest one-sentence summary of why he is willing to bet a billion dollars on JEPA over LLM-style autoregressive agents.

Significance for this wiki

The talk is the best public explanation I know of for two pieces of the JEPA story that the academic papers leave underspecified:

  1. The Barlow Twins origin story, including Stéphane Deny's role and the connection to Horace Barlow's 1961 neuroscience work. The Barlow Twins paper itself is dry; this video makes the conceptual move readable.
  2. The historical continuity with 1990s Siamese networks (Bell Labs, signature verification). The video makes it clear that JEPA is not a 2022 invention but a 30-year research program where LeCun has consistently bet on joint embedding over generation.

See also

  • lecun-position-paper — the 62-page blueprint the talk culminates in
  • saining-xie-interview — the parallel "world model" perspective from the AMI Labs co-founder
  • collapse-prevention — Barlow Twins' role in solving the central JEPA training problem
  • DINO — the parallel branch of joint-embedding self-supervised vision
  • DINOv3 — the August 2025 result that "first time a self-supervised model has reached comparable results to weakly and supervised models on image classification"