JEPAwiki

Vision Transformers in JEPA

Every JEPA variant uses a Vision Transformer (ViT) as its encoder backbone. The specific architecture choices — model size, patch size, positional encoding, and how inputs are tokenized — vary across the family and have significant impact on performance.

ViT Scaling

Encoder scales

Model Params Width Depth Heads MLP Used by
ViT-Tiny ~5M 192 12 3 [LeWorldModel](/wiki/papers/2603.19312)
ViT-Small 22M 384 12 12 1536 Predictors (V-JEPA 2)
ViT-L 300M 1024 24 16 4096 [V-JEPA 2](/wiki/papers/2506.09985)
ViT-H 600M 1280 32 16 5120 [V-JEPA 2](/wiki/papers/2506.09985)
ViT-g 1B 1408 40 22 6144 [V-JEPA 2](/wiki/papers/2506.09985), [V-JEPA 2.1](/wiki/papers/2603.14482)
ViT-G 2B [V-JEPA 2.1](/wiki/papers/2603.14482)

The scaling from 300M to 2B parameters is a major part of the V-JEPA story. V-JEPA 2 showed consistent gains: data scaling (2M to 22M samples) added +1.0 point, model scaling (300M to 1B) added +1.5 points, longer training added +0.8 points, higher resolution added +0.7 points.

Tokenization

Input tokenization

Images

Standard 2D patch embedding. A 2D convolution with kernel and stride equal to the patch size (typically 16x16) projects each patch to an embedding vector.

Video

3D tubelets: a 3D convolution with kernel 2x16x16 (temporal x height x width) extracts spatiotemporal tokens. This captures 2 consecutive frames per token, giving the encoder temporal information from the start.

Used by V-JEPA 2 and V-JEPA 2.1. V-JEPA 2.1 adds modality-learnable tokens — extra embeddings that tell the model whether the input is an image or video, enabling unified image+video training with the same encoder.

Point clouds

FPS + k-NN + mini-PointNet: Farthest Point Sampling selects center points, k-Nearest Neighbors groups local neighborhoods, and a small PointNet encodes each group into a patch embedding.

Used by Point-JEPA and 3D-JEPA.

Object-centric

Slot attention: aggregates patch-level features (from a frozen DINOv2 backbone) into a fixed number of object slots. Each slot captures one object's representation.

Used by C-JEPA (via VideoSAUR or SAVi encoders, typically 4-7 slots with 128-dimensional embeddings).

Positional encoding

3D Rotary Position Embeddings (3D RoPE)

Used by V-JEPA 2 and V-JEPA 2.1. Applied separately to temporal, height, and width axes. More stable than absolute sinusoidal embeddings for large models and enables flexible resolution/frame count at inference.

Learnable temporal positional encoding

Used by C-JEPA for temporal positions of object tokens.

Predictor architectures

The predictor is always smaller than the encoder — it should be expressive enough to model relationships but not so powerful that it bypasses the encoder.

Method Predictor Size Key feature
[I-JEPA](/wiki/papers/2301.08243) Shallow ViT Small Standard masked prediction
[V-JEPA 2](/wiki/papers/2506.09985) ViT-S 22M (12 blocks) Block-causal attention for action conditioning
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT 24 blocks Multi-level outputs for deep self-supervision
[LeWorldModel](/wiki/papers/2603.19312) Transformer 6 layers, 10M AdaLN for action injection, 10% dropout
[C-JEPA](/wiki/papers/2602.11389) Masked ViT 6 layers Bidirectional attention over object tokens
[ThinkJEPA](/wiki/papers/2603.22281) ViT + FiLM FiLM conditioning from VLM guidance

Context-aware decoder (3D-JEPA)

3D-JEPA uses a novel decoder where context information is fed via cross-attention at every decoder layer (not just the first). This prevents the encoder from memorizing position information and forces it to learn semantic features.

Progressive resolution training

A key practical innovation from V-JEPA 2, also used in V-JEPA 2.1:

  1. Primary phase: train at low resolution (16 frames, 256x256) for most iterations
  2. Cooldown phase: increase to high resolution (64 frames, 384x384) for final iterations

This provides an 8.4x training speedup — the model develops most of its representational capacity cheaply at low resolution, then sharpens at high resolution. V-JEPA 2.1 extends this with a distillation protocol where a frozen low-resolution teacher supervises a high-resolution student.

See also