Self-Flow
A Black Forest Labs paper that integrates self-supervised representation learning inside flow matching models, eliminating the need for external encoders like DINO or CLIP. Self-Flow directly addresses a limitation of REPA: external alignment doesn't scale predictably and fails across modalities.
Core idea
Flow models (diffusion models) don't learn strong semantic representations on their own — their denoising objective focuses on low-level reconstruction rather than high-level structure. REPA showed that borrowing representations from external encoders helps enormously. Self-Flow asks: can we make the flow model learn these representations itself?
Dual-Timestep Scheduling
The key mechanism. Apply two different noise levels to different subsets of input tokens in the same image:
- Some tokens get heavy noise (high timestep)
- Other tokens get light noise (low timestep)
This creates an information asymmetry: the model sees some parts clearly and others barely at all. Two forward passes:
- Mixed input (heterogeneous noise) → produces representations
- Clean input (all tokens at lower noise) → produces target representations
The self-supervised loss: predict the clean representations from the noisy ones. Combined with the standard flow matching loss, the model learns both generation and semantic representations simultaneously.
Why this matters for JEPA
Self-Flow is significant for the JEPA story because it shows the JEPA principle — learn by predicting representations, not by reconstructing pixels — infiltrating the generative modeling world:
-
The self-supervised loss in Self-Flow is JEPA-like: predict a clean representation from a corrupted input, without pixel reconstruction. The information asymmetry from dual-timestep scheduling is analogous to JEPA's masking.
-
External encoders have scaling problems: REPA showed that stronger external encoders sometimes give diminishing or negative returns — more powerful DINO doesn't always help more. Self-Flow avoids this by making the model its own teacher (echoing JEPA's EMA teacher principle).
-
Modality generality: external alignment (REPA) works for images but hurts video and audio generation. Self-Flow works across all three in a single model — the same modality generality that JEPA achieves naturally.
Key results
- 2.8x faster convergence than REPA on text-to-image generation
- REPA plateaus; Self-Flow continues to improve
- Works across image, video, and audio in a single jointly-trained model
- Agnostic to autoencoder choice (SD, FLUX.2, Wan2.2, Songbloom)
- Improves structural coherence (faces, hands), text rendering accuracy, and temporal consistency
How Self-Flow relates to REPA
| REPA | Self-Flow | |
|---|---|---|
| Representation source | External encoder (DINOv2) | Self-supervised (internal) |
| Scaling behavior | Diminishing returns with stronger encoders | Follows expected scaling laws |
| Modality support | Images only (hurts video/audio) | Image, video, audio |
| Extra models needed | Yes (frozen encoder) | None |
| Convergence | 17.5x faster than vanilla | 2.8x faster than REPA |
Links
See also
- 2410.06940 (REPA) — the external alignment approach Self-Flow improves upon
- latent-prediction — the JEPA principle that Self-Flow adopts for generation
- 2104.14294 (DINO), 2304.07193 (DINOv2) — the external encoders REPA relies on
- jepa-vs-alternatives — JEPA vs diffusion comparison