Masking Strategies

Masking is the core mechanism that drives JEPA learning. The choice of what to mask, how much, and at what granularity fundamentally shapes what representations the model learns. This page documents every masking approach used across the family, with concrete parameters.

Patch-level masking (standard)

Used by I-JEPA, V-JEPA, V-JEPA 2, V-JEPA 2.1.

The input is divided into a regular grid of patches. A subset is masked (hidden from the encoder) and must be predicted by the predictor from the visible context.

I-JEPA's key insight

The masking strategy is the primary design choice. Two requirements discovered empirically:

Target blocks must be large (semantic scale) — small targets can be predicted from local texture, yielding low-level features
Context must be spatially distributed — a single contiguous context block doesn't provide enough global information

Multiblock masking (V-JEPA 2)

Multiple target blocks are sampled per image/video. The context is everything not masked. Random patches are dropped from the input, and learnable mask tokens (Δy) indicate the positions of dropped patches. These are concatenated with encoder output before the predictor.

V-JEPA 2.1 masking parameters

Concrete parameters from the training recipe:

Spatial mask scale: [0.15, 0.7] (fraction of spatial extent)
Temporal mask scale: [1.0, 1.0] (full temporal span per mask)
Mask aspect ratio: [0.75, 1.5]

Multi-block sampling (3D-JEPA)

Used by 3D-JEPA for point clouds.

A structured sampling approach:

4 target blocks selected via FPS (Farthest Point Sampling), scale range [0.15, 0.2]
1 large context block, scale range [0.85, 1.0]
Overlapping tokens between context and target are removed

Combined with a context-aware decoder that feeds context information via cross-attention at every decoder layer (not just the first). This prevents the encoder from memorizing position-to-target mappings and forces semantic feature learning.

Sequencer-based masking (Point-JEPA)

Used by Point-JEPA for point clouds.

Point clouds lack the natural 2D grid structure of images, so spatial proximity must be computed explicitly. Point-JEPA introduces a greedy sequencer that:

Orders patch embeddings by spatial proximity (greedy algorithm based on coordinate sums)
Uses the ordering to efficiently compute proximity-based context/target selection via indices
Shares computation between context and target selection

Optimal parameters discovered via ablation:

Target ratio: [0.15, 0.2] (fraction of patches as targets)
Context ratio: [0.4, 0.75] (fraction of patches as context)

Benefits from deeper predictors — more layers helps for the complex spatial reasoning required.

Object-level masking (C-JEPA)

Used by C-JEPA. See object-centric-representations for full details.

Instead of masking spatial patches, mask entire objects. This is qualitatively different:

Forces the model to reason about inter-object interactions (not spatial interpolation)
Prevents shortcut solutions (can't copy from nearby patches when the whole object is gone)
Induces causal inductive bias via latent interventions
Uses only 1% of the features needed by patch-based methods for comparable planning

Masking protocol

Selected objects masked across the full history window T
Exception: one "identity anchor" at t₀ preserves object identity
Masked tokens: φ(z_{t₀}^i) + e_τ (linear projection of anchor + temporal encoding)
Future tokens always masked for prediction

Masking budget matters

Budget (of 4 objects)	Effect
1/4 masked	Mild regularization, some improvement
2/4 masked	Good balance for weaker encoders (SAVi)
3/4 masked	Strong causal signal
4/4 masked	Best with strong encoder (VideoSAUR), too aggressive for weak ones

Object-level masking consistently outperforms token-level and tube-level masking of equivalent budget.

Time-frequency masking (Audio-JEPA)

Used by Audio-JEPA (2507.02915, not on HF Papers).

Adapted for audio spectrograms. Time-frequency-aware masking patterns account for the different structure of audio vs. visual data — temporal patterns (rhythm, speech) and frequency patterns (pitch, timbre) require different masking scales.

Dense predictive loss (V-JEPA 2.1)

Used by V-JEPA 2.1. A departure from standard masking.

In standard JEPA, only masked tokens contribute to the loss. In V-JEPA 2.1, all tokens contribute — both visible context and masked tokens:

L = L_predict (masked tokens) + L_ctx (context tokens)

Context loss with distance weighting

The context loss uses a dynamic weighting scheme:

λ_i = λ / √(d_min(i, M))

Where d_min(i, M) is the spatio-temporal distance from context token i to the closest masked token. Tokens near masked regions get higher weight, enforcing local continuity between visible and predicted representations.

Fixed λ values: 0.5 for video, 0.7 for images.

Impact

The Dense Predictive Loss is the primary innovation behind V-JEPA 2.1's dramatic improvements in dense tasks:

+23.4 mIoU on ADE20K segmentation (vs V-JEPA 2)
+27.6 mIoU on Cityscapes
+20.7 mIoU on VOC12

By making visible tokens participate in the loss, the model is forced to produce spatially grounded representations everywhere, not just at masked locations.

Deep self-supervision (V-JEPA 2.1)

While not strictly a masking strategy, deep self-supervision extends the masking-based objective to multiple encoder layers. V-JEPA 2.1 applies prediction at 4 intermediate layers (equally spaced, e.g., [12, 24, 36, 48] for ViT-G):

Outputs from 3 intermediate layers + final layer are concatenated along the channel dimension
A lightweight MLP fuses multi-level representations before the predictor
The predictor produces 4 separate outputs, one per supervision level
Both prediction loss and context loss are applied at each level

This prevents feature collapse at intermediate layers and produces representations useful at multiple abstraction levels simultaneously.

Evolution: what masking teaches us

The progression across the family reveals a key insight: masking strategy is not a training detail — it's the primary lever for controlling what the model learns.

Patch masking  →  Learns spatial structure and appearance
Object masking →  Learns causal interactions
Dense loss     →  Learns spatially grounded features
Deep supervision → Learns multi-scale representations

Each innovation in masking unlocked a qualitatively new capability that wasn't achievable by simply scaling the previous approach.