Self-Flow

A Black Forest Labs paper that integrates self-supervised representation learning inside flow matching models, eliminating the need for external encoders like DINO or CLIP. Self-Flow directly addresses a limitation of REPA: external alignment doesn't scale predictably and fails across modalities.

Core idea

Flow models (diffusion models) don't learn strong semantic representations on their own — their denoising objective focuses on low-level reconstruction rather than high-level structure. REPA showed that borrowing representations from external encoders helps enormously. Self-Flow asks: can we make the flow model learn these representations itself?

Dual-Timestep Scheduling

The key mechanism. Apply two different noise levels to different subsets of input tokens in the same image:

Some tokens get heavy noise (high timestep)
Other tokens get light noise (low timestep)

This creates an information asymmetry: the model sees some parts clearly and others barely at all. Two forward passes:

Mixed input (heterogeneous noise) → produces representations
Clean input (all tokens at lower noise) → produces target representations

The self-supervised loss: predict the clean representations from the noisy ones. Combined with the standard flow matching loss, the model learns both generation and semantic representations simultaneously.

Why this matters for JEPA

Self-Flow is significant for the JEPA story because it shows the JEPA principle — learn by predicting representations, not by reconstructing pixels — infiltrating the generative modeling world:

The self-supervised loss in Self-Flow is JEPA-like: predict a clean representation from a corrupted input, without pixel reconstruction. The information asymmetry from dual-timestep scheduling is analogous to JEPA's masking.
External encoders have scaling problems: REPA showed that stronger external encoders sometimes give diminishing or negative returns — more powerful DINO doesn't always help more. Self-Flow avoids this by making the model its own teacher (echoing JEPA's EMA teacher principle).
Modality generality: external alignment (REPA) works for images but hurts video and audio generation. Self-Flow works across all three in a single model — the same modality generality that JEPA achieves naturally.

Key results

2.8x faster convergence than REPA on text-to-image generation
REPA plateaus; Self-Flow continues to improve
Works across image, video, and audio in a single jointly-trained model
Agnostic to autoencoder choice (SD, FLUX.2, Wan2.2, Songbloom)
Improves structural coherence (faces, hands), text rendering accuracy, and temporal consistency

How Self-Flow relates to REPA

	REPA	Self-Flow
Representation source	External encoder (DINOv2)	Self-supervised (internal)
Scaling behavior	Diminishing returns with stronger encoders	Follows expected scaling laws
Modality support	Images only (hurts video/audio)	Image, video, audio
Extra models needed	Yes (frozen encoder)	None
Convergence	17.5x faster than vanilla	2.8x faster than REPA

Self-Flow

Core idea

Dual-Timestep Scheduling

Why this matters for JEPA

Key results

How Self-Flow relates to REPA

Links

See also