REPA (Representation Alignment)

A paper by Saining Xie et al. showing that aligning diffusion transformer internal representations with pretrained vision encoder features (like DINOv2) dramatically speeds up training and improves generation quality. REPA is important for the JEPA story because it validates the core JEPA thesis from the generative side: good representations are the bottleneck, and they don't come for free from pixel-level objectives.

Core idea

Diffusion transformers must learn both (1) good internal representations and (2) how to denoise. The denoising objective alone is a weak learning signal for representations — the model spends most of training slowly discovering semantic structure that pretrained vision encoders already have.

REPA adds a simple regularization: align the diffusion model's noisy hidden states with clean representations from a frozen external encoder (typically DINOv2).

Loss = L_flow (denoising) + λ · L_align (match DINOv2 features)

The model has two learning signals:

Standard flow/diffusion loss (reconstruct from noise)
Representation alignment loss (match pretrained features)

Key results

17.5x faster convergence: SiT-XL matches its 7M-step performance in under 400K steps
SOTA generation quality: FID=1.42 on ImageNet (with classifier-free guidance)
Works across DiT and SiT architectures
Only early transformer layers need alignment — later layers focus on high-frequency generation details

The empirical finding that motivated REPA

The paper shows three key observations about diffusion transformers:

They do learn meaningful discriminative representations (linear probing works), but these representations are significantly weaker than DINOv2's
There is already some natural alignment between diffusion model features and DINOv2, but it's weak
This alignment improves with longer training and larger models — but very slowly

REPA accelerates what would eventually happen naturally.

Connection to JEPA

REPA validates LeCun's argument from the opposite direction:

LeCun's claim: pixel-level prediction (reconstruction, generation) is a weak objective for learning representations. Models waste capacity on unpredictable details.

REPA's evidence: diffusion models trained on pixel-level denoising learn poor internal representations. Injecting externally-learned representations (from self-supervised methods like DINO that don't reconstruct pixels) dramatically helps.

This is exactly what JEPA predicts: the bottleneck in generative models is representation learning, and representation learning is better done by non-generative methods.

However, REPA still depends on external encoders — which creates problems:

Stronger encoders don't always help more (scaling issues)
Doesn't generalize across modalities (hurts video/audio)
Requires training and maintaining a separate model

Self-Flow addresses these limitations by integrating self-supervised representation learning inside the flow model.

Limitations

Requires a pretrained external encoder (DINOv2 is the strongest choice, but this dependency is fragile)
Scaling behavior is unpredictable: stronger encoders sometimes give diminishing or negative returns
Only demonstrated on image generation — external alignment hurts video and audio generation
Doesn't make diffusion models into world models — still generative, still slow for planning

REPA (Representation Alignment)

Core idea

Key results

The empirical finding that motivated REPA

Connection to JEPA

Limitations

Links

See also