Saining Xie Interview (7 hours, 2025)

A 7-hour interview with Saining Xie — co-founder and CTO of AMI Labs, former FAIR researcher, creator of DiT (the architecture behind Sora), and co-author of NEPA, REPA, and MAE. This interview is the most comprehensive public articulation of the JEPA vision from the perspective of someone building a company around it.

Source: YouTube

Saining's three stages with JEPA

The most important passage for understanding how serious researchers think about JEPA:

Doubting JEPA: "I questioned JEPA... JEPA seemed like yet another self-supervised learning algorithm."
Understanding JEPA: "JEPA actually goes deeper than we imagined. There's a lot of underlying logic inside it, many mathematical principles."
Becoming JEPA: "JEPA is not a model. JEPA is not a specific algorithm. JEPA is a complete cognitive architecture. It's a cognitive system."

He explicitly frames JEPA not as I-JEPA or V-JEPA (specific papers) but as the full 2022 position paper vision: world understanding + prediction + planning. "JEPA is a very, very vast ocean. In this ocean there can be many, many ships sailing on it."

What a world model actually is

Saining gives a precise definition:

"You have a state S_t and an action a_t. You want to learn a transition function F that takes your action together with your current state to predict the next state."

But he insists this is a goal, not an algorithm: "All of us — whether you're working on LLMs, Video Diffusion Models, or Gaussian Splatting — all of us are on the path toward the world model."

He traces the idea to Kenneth Craik (1943), a Scottish philosopher who first proposed that humans maintain internal world models predicting consequences of actions.

Five requirements for a world model

Crediting Yann LeCun:

Understand the physical world
Have sufficiently large associative memory
Be able to reason and plan
Do counterfactual reasoning / causal inference
Be sufficiently controllable and safe

Representation IS the world model

"In my definition, representation is a world model. The most important part. It's not all of it. It's the most important part."

This is a strong claim: the quality of the representation determines the quality of the world model. Everything else (planning, control, language) is built on top.

The case against LLMs as world models

Saining gives the sharpest critique of LLMs-as-intelligence in the interview:

"LLMs will never die, but will eventually fade. Old soldiers never die, they just fade away... It's a very good tool. I use LLMs every day. But it's not the foundation for building a general intelligence system."

Specific arguments

Language is a communication tool, not a thinking map: "Language is not a thinking map. Language is not even a decision-making tool. It's a form of communication."

Language is actually supervised learning: "A language model is actually not self-supervised learning. It's actually strongly supervised... language is what humans over thousands of years of civilization processed and stored in tokenized form." This is anti-Bitter Lesson — language is the most cleverly engineered human knowledge representation, not raw unsupervised data.

Serializing video into tokens is wasteful: "The modeling technique of language models cannot resolve the cognition of continuous spatial signals."

LLMs are a crutch: "Language is actually an opiate. You add more language, you'll always feel happier... but it's a shortcut. If you keep using it, you can't train your leg muscles."

CoT is stage-specific: "Chain-of-thought is a product of this stage... everything about LLMs is a fairly stage-specific product."

Silicon Valley is hypnotized: "People are already deeply mired in, already hypnotized by Large Language Models."

The origin of DiT and REPA

Saining reveals that REPA came from a fundamental observation:

"We wanted to look at the representation a diffusion model learns, how it compares to what a self-supervised learning model learns... a generative model can learn a decent representation, but this representation was much, much worse than the representation from self-supervised learning."

This finding — that generation produces inferior representations compared to self-supervised methods — directly supports the JEPA thesis. REPA is the practical fix: align the generative model's internal states with external SSL representations.

On the extension to representation autoencoders: "Why do we need to use this indirect way to do alignment? What if we can directly use this powerful representation as an encoder for your generative model?"

DiT: born from representation research

The Diffusion Transformer (DiT, the architecture behind Sora) was born from representation learning research, not generation research:

"We needed to compare at the representation level against ViT-based systems... which is why we used ViT for this Diffusion Model."

The paper was rejected at CVPR ("not enough novelty"), later accepted as Oral elsewhere, and eventually adopted by OpenAI's Sora team. Saining notes the irony: "All these generative models, 90% is still a data problem."

Pixels are wrong too

A radical claim:

"Pixels themselves might also be wrong. Pixels are also not Bitter Lesson enough... Pixels are a human-defined regular grid... The real Bitter Lesson says I don't need to make it for humans to see."

This aligns with JEPA's core principle: don't predict pixels, predict abstract representations. But Saining goes further — even inputting pixels may be suboptimal.

AMI Labs: building the JEPA vision

Origin

Yann LeCun told Saining: "I've already decided... what I want to do now should be done outside. I want to start and build a company." Saining realized it was "completely aligned with what I'd imagined."

Mission

"Building the predictive brain... working at the most foundational layer."

The bet

"My current bet is there's only one thing in this world that is important: how to learn this representation. When you have a good enough representation, handling other problems on top of it is simple. Your Language Model will gradually degrade to a simple communication interface."

Structure

6 co-founders. Yann LeCun is Executive Chairman. VP of World Models is Mike (from the original JEPA team at Meta). Target fundraising ~$1B, starting team of ~25.

Product vision

Two product outlets:

AI glasses/wearables: always-on personal assistants that require world models for real-time scene understanding
Robotics: "the brain is the missing piece" — robots have bodies but no world model to control them

On Yann LeCun

"Very principled... when he says something is right, he truly believes in what he says."

"Yann is more of a visionary, and I'm more grounded, someone who can actually execute."

"Yann really practices what he preaches. He himself is pretty JEPA as a person. He consistently holds fast to his logical principles."

At Meta, people told Yann to stop publicly criticizing LLMs. "Yann couldn't accept this at all. He said my integrity as a scientist cannot accept this."

Yann's analogy about their work: "There's always a small group of people who can clearly see the trajectory of the world's development... Back then with deep learning, people were doing other things... and now, what you're doing is..."

Forward-looking predictions

World model scaling will be different: "Will have a very different Scaling Law... the model won't be that large, doesn't need many training parameters, because you don't need to remember everything."

Video is the data source: "Video is still the best hope we have right now."

The second half of pre-training: The world model is the "second half of pre-training" nobody is building. Inputs will be "continuous-space signals, high-dimensional, potentially noisy signals."

Squirrel intelligence is the hard problem: Cites Rich Sutton — "Building the intelligence of a squirrel is the hard problem. Once you have a squirrel's intelligence... writing code, going to Mars — those would be the easy ones."

LeJEPA validation: "Recently LeJEPA showed with rigorous proof that if you want a good representation agnostic to downstream tasks, it must be an isotropic Gaussian distribution." — confirming LeJEPA's theory.

The Wittgenstein critique

Saining objects to people citing Wittgenstein's "the limits of my language mean the limits of my world" to endorse LLMs:

"Later Wittgenstein proposed language games — language itself has no inherent meaning. The reason they acquire meaning is because they are connected to real-world practice."

Language without grounding is empty — exactly the JEPA argument for learning from observation rather than from text.