JEPAwiki
Self-Flow: Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Date2026-03-09
Modalityimage/video/audio
AuthorsHila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell + 4 more
Tagsflow-matching, self-supervised, representation-learning, generation, multi-modal, related-work
SourceFull text

Self-Flow

A Black Forest Labs paper that integrates self-supervised representation learning inside flow matching models, eliminating the need for external encoders like DINO or CLIP. Self-Flow directly addresses a limitation of REPA: external alignment doesn't scale predictably and fails across modalities.

Core idea

Flow models (diffusion models) don't learn strong semantic representations on their own — their denoising objective focuses on low-level reconstruction rather than high-level structure. REPA showed that borrowing representations from external encoders helps enormously. Self-Flow asks: can we make the flow model learn these representations itself?

Dual-Timestep Scheduling

The key mechanism. Apply two different noise levels to different subsets of input tokens in the same image:

  • Some tokens get heavy noise (high timestep)
  • Other tokens get light noise (low timestep)

This creates an information asymmetry: the model sees some parts clearly and others barely at all. Two forward passes:

  1. Mixed input (heterogeneous noise) → produces representations
  2. Clean input (all tokens at lower noise) → produces target representations

The self-supervised loss: predict the clean representations from the noisy ones. Combined with the standard flow matching loss, the model learns both generation and semantic representations simultaneously.

Why this matters for JEPA

Self-Flow is significant for the JEPA story because it shows the JEPA principle — learn by predicting representations, not by reconstructing pixels — infiltrating the generative modeling world:

  1. The self-supervised loss in Self-Flow is JEPA-like: predict a clean representation from a corrupted input, without pixel reconstruction. The information asymmetry from dual-timestep scheduling is analogous to JEPA's masking.

  2. External encoders have scaling problems: REPA showed that stronger external encoders sometimes give diminishing or negative returns — more powerful DINO doesn't always help more. Self-Flow avoids this by making the model its own teacher (echoing JEPA's EMA teacher principle).

  3. Modality generality: external alignment (REPA) works for images but hurts video and audio generation. Self-Flow works across all three in a single model — the same modality generality that JEPA achieves naturally.

Key results

  • 2.8x faster convergence than REPA on text-to-image generation
  • REPA plateaus; Self-Flow continues to improve
  • Works across image, video, and audio in a single jointly-trained model
  • Agnostic to autoencoder choice (SD, FLUX.2, Wan2.2, Songbloom)
  • Improves structural coherence (faces, hands), text rendering accuracy, and temporal consistency

How Self-Flow relates to REPA

REPA Self-Flow
Representation source External encoder (DINOv2) Self-supervised (internal)
Scaling behavior Diminishing returns with stronger encoders Follows expected scaling laws
Modality support Images only (hurts video/audio) Image, video, audio
Extra models needed Yes (frozen encoder) None
Convergence 17.5x faster than vanilla 2.8x faster than REPA

Links

See also