JEPAwiki

Collapse Prevention

The central training challenge for all JEPA variants. Without explicit reconstruction (unlike MAE/generative models), JEPA must prevent the encoder from mapping all inputs to a constant representation (representation collapse). Every paper in the family addresses this problem, and the solutions have evolved significantly.

The problem

JEPA's prediction loss alone incentivizes collapse: if the encoder maps everything to the same embedding, prediction is trivially perfect (predict the constant, always correct). Unlike:

  • Contrastive methods (which use negative samples to push apart representations)
  • Generative methods (which reconstruct pixels — a constant output can't reconstruct diverse inputs)

JEPA needs a different mechanism. The history of the family can be read as an evolving answer to this single question.

LeCun's position paper formalized this as four criteria that must be satisfied simultaneously during JEPA training:

  1. Maximize information content of the context representation (prevent the x-encoder from collapsing)
  2. Maximize information content of the target representation (prevent the y-encoder from collapsing)
  3. Make the target predictable from the context (the prediction loss)
  4. Minimize information content of the latent variable (prevent the latent from trivially copying the target)

Criteria 1+2 prevent informational collapse (constant representations). Criterion 4 prevents latent bypass collapse (the predictor ignoring the context and just copying the latent). Together, they force the model to learn representations where the world is predictable — abstracting away unpredictable details into encoder invariances.

Approach 1: Exponential Moving Average (EMA) target encoder

Used by: I-JEPA, V-JEPA 2, V-JEPA 2.1

The most common approach in the family. The key idea: use two copies of the encoder.

  1. Context encoder (θ): processes visible patches, trained via gradient descent
  2. Target encoder (θ̄): processes masked regions, updated as an exponential moving average of the context encoder
θ̄ ← α · θ̄ + (1 - α) · θ

Combined with stop-gradient on the target encoder output: sg(E_θ̄(y)). Gradients only flow through the context encoder.

Why it prevents collapse

The target encoder moves slowly (high α, e.g., 0.99925 in V-JEPA 2.1), providing a stable, slowly-evolving prediction target. If the context encoder starts collapsing, the target encoder still produces diverse outputs (it lags behind), creating a loss signal that pushes the context encoder away from the trivial solution.

Practical details

  • V-JEPA 2 uses a fixed EMA coefficient (simplified from the original ramped schedule where α increases over training)
  • V-JEPA 2.1 uses α = 0.99925 — meaning the target encoder moves ~0.075% toward the context encoder per step
  • The α range that works in practice is roughly 0.99 to 0.9999. Below 0.99, targets are too volatile (oscillate with training noise). Above 0.9999, targets barely move and training stalls. The optimal value depends on batch size and learning rate — larger batches tolerate lower α.
  • EMA requires careful tuning: too low and targets are unstable; too high and targets don't track the encoder

Limitations

  • Adds complexity: two encoder copies in memory
  • Requires stop-gradient (non-standard optimization)
  • Sensitive to the momentum parameter
  • Can still collapse with poor hyperparameter choices

Approach 2: SIGReg (Sketched-Isotropic-Gaussian Regularizer)

Used by: LeWorldModel

The most principled approach. SIGReg directly regularizes the embedding distribution to match an isotropic Gaussian N(0, I), making collapse impossible.

How SIGReg works

Given embeddings Z ∈ R^(N×B×d) collected over history length N, batch size B, and embedding dimension d:

  1. Project: sample M random unit-norm directions u^(m) ∈ S^(d-1) (default M=1024)
  2. Test normality: for each projection h^(m) = Z · u^(m), compute the Epps-Pulley univariate normality test statistic T(h^(m))
  3. Aggregate: SIGReg(Z) = (1/M) Σ T(h^(m))
  4. Optimize: add λ · SIGReg(Z) to the loss (default λ=0.1)

Mathematical foundation

By the Cramér-Wold theorem, a multivariate distribution is Gaussian if and only if all its univariate projections are Gaussian. SIGReg exploits this by testing many random projections — if they all look Gaussian, the full distribution is Gaussian. Convergence: SIGReg(Z) → 0 if and only if P_Z → N(0, I).

The Epps-Pulley statistic uses a quadrature scheme with T nodes uniformly distributed in [0.2, 4] for efficient computation.

Why isotropic Gaussian prevents collapse

  • Non-constant: Gaussian distributions have positive variance in every direction
  • Feature decorrelation: isotropy means no two dimensions carry redundant information
  • Smooth: gradients are well-behaved, enabling stable optimization

Key advantages over EMA

EMA SIGReg
Stop-gradient needed Yes No
EMA momentum tuning Yes No
Extra encoder copy Yes No
Pretrained encoder needed Sometimes No
Tunable hyperparameters 6+ (in comparable methods) 1 (just λ)
Training stability Good with tuning Excellent, smooth convergence

LeWorldModel showed that SIGReg alone enables stable end-to-end JEPA training from raw pixels — the first time this was achieved without any of the standard heuristics.

Tuning λ

The single hyperparameter λ balances prediction accuracy vs. distribution regularity. LeWorldModel proposes bisection search with O(log n) cost — far cheaper than grid-searching 6+ hyperparameters.

Approach 3: Deep self-supervision

Used by: V-JEPA 2.1

Instead of only preventing collapse at the encoder output, apply the self-supervised objective at multiple intermediate layers:

  • 4 supervision points at equally spaced encoder layers (e.g., [12, 24, 36, 48] for ViT-G with 48 layers)
  • Both prediction loss and context loss computed at each level
  • Intermediate features concatenated via channel dimension and fused with lightweight MLP

Why it helps

Collapse can happen at intermediate layers even if the output layer is healthy. Deep self-supervision ensures every layer produces non-trivial, useful features. This is especially important for dense tasks where features from intermediate layers matter.

Approach 4: VICReg-style regularization

A related family of approaches (used in some JEPA-adjacent work):

  • Variance term: prevents collapse to a single point by maintaining minimum variance per dimension
  • Invariance term: the prediction loss itself
  • Covariance term: decorrelates embedding dimensions to prevent "dimensional collapse" (all information packed into a few dimensions)

VICReg is simpler than EMA but more heuristic than SIGReg. It requires tuning the relative weights of all three terms.

Approach 5: Distillation (V-JEPA 2.1 fine-tuning)

For the distillation/fine-tuning phase of V-JEPA 2.1, the EMA encoder is replaced by a frozen teacher — a pre-trained V-JEPA 2.1 model from the primary training phase. This eliminates the need for EMA during fine-tuning and provides a strong, fixed target.

Two-stage distillation:

  1. Primary stage: frozen low-resolution teacher (16 frames, 256x256)
  2. Cooldown stage: frozen high-resolution teacher (64 frames, 384x384)
  3. Student initialized with EMA weights from stage 1

Comparison

Method Simplicity Stability Dense feature quality End-to-end from pixels
EMA Medium Good (tuned) Good Requires heuristics
SIGReg High Excellent Competitive Yes
Deep self-supervision Low Good Best With EMA
VICReg Medium Medium
Distillation Medium Good Best (fine-tune) Requires pretrained model

Practical guidance: which method to use

Starting a new JEPA project from scratch? Use SIGReg (LeJEPA recipe). It's the simplest to implement (~50 lines), has 1 hyperparameter, and trains stably without EMA or stop-gradient. You'll get competitive representations without any of the heuristic tuning.

Scaling to production / maximum performance? Use EMA + deep self-supervision (V-JEPA 2.1 recipe). The complexity is higher (6+ hyperparameters), but this produces the best features on dense tasks — the +23 mIoU segmentation improvement over V-JEPA 2 comes from deep self-supervision, not from scale alone.

Training a world model for planning? LeWorldModel showed SIGReg is sufficient for stable end-to-end world model training from pixels. The 48x planning speedup comes from the small model (15M params), not the collapse prevention method — but SIGReg enables the small model to train stably.

Fine-tuning a pretrained JEPA? Use distillation with a frozen teacher. EMA is unnecessary when you have a strong pretrained model to anchor the target encoder.

The convergence question

LeWorldModel's contribution is showing that the minimal solution (SIGReg alone) is sufficient — you can train JEPA stably from scratch without EMA, stop-gradient, or pretrained encoders. V-JEPA 2.1's contribution is showing that the maximal solution (EMA + deep self-supervision + distillation) produces the best features. The field is converging on understanding the precise trade-offs between simplicity and performance.

See also