JEPAwiki
REPA: Representation Alignment for Generation — Training Diffusion Transformers Is Easier Than You Think
Date2024-10-09
Modalityimage
AuthorsSihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong + 3 more
Tagsdiffusion, representation-alignment, training-efficiency, generation, related-work
SourceFull text

REPA (Representation Alignment)

A paper by Saining Xie et al. showing that aligning diffusion transformer internal representations with pretrained vision encoder features (like DINOv2) dramatically speeds up training and improves generation quality. REPA is important for the JEPA story because it validates the core JEPA thesis from the generative side: good representations are the bottleneck, and they don't come for free from pixel-level objectives.

Core idea

Diffusion transformers must learn both (1) good internal representations and (2) how to denoise. The denoising objective alone is a weak learning signal for representations — the model spends most of training slowly discovering semantic structure that pretrained vision encoders already have.

REPA adds a simple regularization: align the diffusion model's noisy hidden states with clean representations from a frozen external encoder (typically DINOv2).

Loss = L_flow (denoising) + λ · L_align (match DINOv2 features)

The model has two learning signals:

  • Standard flow/diffusion loss (reconstruct from noise)
  • Representation alignment loss (match pretrained features)

Key results

  • 17.5x faster convergence: SiT-XL matches its 7M-step performance in under 400K steps
  • SOTA generation quality: FID=1.42 on ImageNet (with classifier-free guidance)
  • Works across DiT and SiT architectures
  • Only early transformer layers need alignment — later layers focus on high-frequency generation details

The empirical finding that motivated REPA

The paper shows three key observations about diffusion transformers:

  1. They do learn meaningful discriminative representations (linear probing works), but these representations are significantly weaker than DINOv2's
  2. There is already some natural alignment between diffusion model features and DINOv2, but it's weak
  3. This alignment improves with longer training and larger models — but very slowly

REPA accelerates what would eventually happen naturally.

Connection to JEPA

REPA validates LeCun's argument from the opposite direction:

LeCun's claim: pixel-level prediction (reconstruction, generation) is a weak objective for learning representations. Models waste capacity on unpredictable details.

REPA's evidence: diffusion models trained on pixel-level denoising learn poor internal representations. Injecting externally-learned representations (from self-supervised methods like DINO that don't reconstruct pixels) dramatically helps.

This is exactly what JEPA predicts: the bottleneck in generative models is representation learning, and representation learning is better done by non-generative methods.

However, REPA still depends on external encoders — which creates problems:

  • Stronger encoders don't always help more (scaling issues)
  • Doesn't generalize across modalities (hurts video/audio)
  • Requires training and maintaining a separate model

Self-Flow addresses these limitations by integrating self-supervised representation learning inside the flow model.

Limitations

  • Requires a pretrained external encoder (DINOv2 is the strongest choice, but this dependency is fragile)
  • Scaling behavior is unpredictable: stronger encoders sometimes give diminishing or negative returns
  • Only demonstrated on image generation — external alignment hurts video and audio generation
  • Doesn't make diffusion models into world models — still generative, still slow for planning

Links

See also

  • 2603.06507 (Self-Flow) — removes the external encoder dependency
  • 2104.14294 (DINO), 2304.07193 (DINOv2) — the encoders REPA aligns to
  • 2512.16922 (NEPA) — by the same author (Saining Xie), applying embedding prediction to vision
  • jepa-vs-alternatives — why JEPA argues pixel-level objectives are insufficient
  • latent-prediction — the core principle REPA validates from the generative side