Vision Transformers in JEPA
Every JEPA variant uses a Vision Transformer (ViT) as its encoder backbone. The specific architecture choices — model size, patch size, positional encoding, and how inputs are tokenized — vary across the family and have significant impact on performance.
Encoder scales
| Model | Params | Width | Depth | Heads | MLP | Used by |
|---|---|---|---|---|---|---|
| ViT-Tiny | ~5M | 192 | 12 | 3 | — | [LeWorldModel](/wiki/papers/2603.19312) |
| ViT-Small | 22M | 384 | 12 | 12 | 1536 | Predictors (V-JEPA 2) |
| ViT-L | 300M | 1024 | 24 | 16 | 4096 | [V-JEPA 2](/wiki/papers/2506.09985) |
| ViT-H | 600M | 1280 | 32 | 16 | 5120 | [V-JEPA 2](/wiki/papers/2506.09985) |
| ViT-g | 1B | 1408 | 40 | 22 | 6144 | [V-JEPA 2](/wiki/papers/2506.09985), [V-JEPA 2.1](/wiki/papers/2603.14482) |
| ViT-G | 2B | — | — | — | — | [V-JEPA 2.1](/wiki/papers/2603.14482) |
The scaling from 300M to 2B parameters is a major part of the V-JEPA story. V-JEPA 2 showed consistent gains: data scaling (2M to 22M samples) added +1.0 point, model scaling (300M to 1B) added +1.5 points, longer training added +0.8 points, higher resolution added +0.7 points.
Input tokenization
Images
Standard 2D patch embedding. A 2D convolution with kernel and stride equal to the patch size (typically 16x16) projects each patch to an embedding vector.
Video
3D tubelets: a 3D convolution with kernel 2x16x16 (temporal x height x width) extracts spatiotemporal tokens. This captures 2 consecutive frames per token, giving the encoder temporal information from the start.
Used by V-JEPA 2 and V-JEPA 2.1. V-JEPA 2.1 adds modality-learnable tokens — extra embeddings that tell the model whether the input is an image or video, enabling unified image+video training with the same encoder.
Point clouds
FPS + k-NN + mini-PointNet: Farthest Point Sampling selects center points, k-Nearest Neighbors groups local neighborhoods, and a small PointNet encodes each group into a patch embedding.
Used by Point-JEPA and 3D-JEPA.
Object-centric
Slot attention: aggregates patch-level features (from a frozen DINOv2 backbone) into a fixed number of object slots. Each slot captures one object's representation.
Used by C-JEPA (via VideoSAUR or SAVi encoders, typically 4-7 slots with 128-dimensional embeddings).
Positional encoding
3D Rotary Position Embeddings (3D RoPE)
Used by V-JEPA 2 and V-JEPA 2.1. Applied separately to temporal, height, and width axes. More stable than absolute sinusoidal embeddings for large models and enables flexible resolution/frame count at inference.
Learnable temporal positional encoding
Used by C-JEPA for temporal positions of object tokens.
Predictor architectures
The predictor is always smaller than the encoder — it should be expressive enough to model relationships but not so powerful that it bypasses the encoder.
| Method | Predictor | Size | Key feature |
|---|---|---|---|
| [I-JEPA](/wiki/papers/2301.08243) | Shallow ViT | Small | Standard masked prediction |
| [V-JEPA 2](/wiki/papers/2506.09985) | ViT-S | 22M (12 blocks) | Block-causal attention for action conditioning |
| [V-JEPA 2.1](/wiki/papers/2603.14482) | ViT | 24 blocks | Multi-level outputs for deep self-supervision |
| [LeWorldModel](/wiki/papers/2603.19312) | Transformer | 6 layers, 10M | AdaLN for action injection, 10% dropout |
| [C-JEPA](/wiki/papers/2602.11389) | Masked ViT | 6 layers | Bidirectional attention over object tokens |
| [ThinkJEPA](/wiki/papers/2603.22281) | ViT + FiLM | — | FiLM conditioning from VLM guidance |
Context-aware decoder (3D-JEPA)
3D-JEPA uses a novel decoder where context information is fed via cross-attention at every decoder layer (not just the first). This prevents the encoder from memorizing position information and forces it to learn semantic features.
Progressive resolution training
A key practical innovation from V-JEPA 2, also used in V-JEPA 2.1:
- Primary phase: train at low resolution (16 frames, 256x256) for most iterations
- Cooldown phase: increase to high resolution (64 frames, 384x384) for final iterations
This provides an 8.4x training speedup — the model develops most of its representational capacity cheaply at low resolution, then sharpens at high resolution. V-JEPA 2.1 extends this with a distillation protocol where a frozen low-resolution teacher supervises a high-resolution student.
See also
- latent-prediction — what the encoder is trained to support
- masking-strategies — how input tokens are selected for prediction
- collapse-prevention — keeping encoder representations non-trivial