Vision Transformers in JEPA

Every JEPA variant uses a Vision Transformer (ViT) as its encoder backbone. The specific architecture choices — model size, patch size, positional encoding, and how inputs are tokenized — vary across the family and have significant impact on performance.

ViT Scaling

Encoder scales

Model	Params	Width	Depth	Heads	MLP	Used by
ViT-Tiny	~5M	192	12	3	—	[LeWorldModel](/wiki/papers/2603.19312)
ViT-Small	22M	384	12	12	1536	Predictors (V-JEPA 2)
ViT-L	300M	1024	24	16	4096	[V-JEPA 2](/wiki/papers/2506.09985)
ViT-H	600M	1280	32	16	5120	[V-JEPA 2](/wiki/papers/2506.09985)
ViT-g	1B	1408	40	22	6144	[V-JEPA 2](/wiki/papers/2506.09985), [V-JEPA 2.1](/wiki/papers/2603.14482)
ViT-G	2B	—	—	—	—	[V-JEPA 2.1](/wiki/papers/2603.14482)

The scaling from 300M to 2B parameters is a major part of the V-JEPA story. V-JEPA 2 showed consistent gains: data scaling (2M to 22M samples) added +1.0 point, model scaling (300M to 1B) added +1.5 points, longer training added +0.8 points, higher resolution added +0.7 points.

Tokenization

Input tokenization

Images

Standard 2D patch embedding. A 2D convolution with kernel and stride equal to the patch size (typically 16x16) projects each patch to an embedding vector.

Video

3D tubelets: a 3D convolution with kernel 2x16x16 (temporal x height x width) extracts spatiotemporal tokens. This captures 2 consecutive frames per token, giving the encoder temporal information from the start.

Used by V-JEPA 2 and V-JEPA 2.1. V-JEPA 2.1 adds modality-learnable tokens — extra embeddings that tell the model whether the input is an image or video, enabling unified image+video training with the same encoder.

Point clouds

FPS + k-NN + mini-PointNet: Farthest Point Sampling selects center points, k-Nearest Neighbors groups local neighborhoods, and a small PointNet encodes each group into a patch embedding.

Used by Point-JEPA and 3D-JEPA.

Object-centric

Slot attention: aggregates patch-level features (from a frozen DINOv2 backbone) into a fixed number of object slots. Each slot captures one object's representation.

Used by C-JEPA (via VideoSAUR or SAVi encoders, typically 4-7 slots with 128-dimensional embeddings).

Positional encoding

3D Rotary Position Embeddings (3D RoPE)

Used by V-JEPA 2 and V-JEPA 2.1. Applied separately to temporal, height, and width axes. More stable than absolute sinusoidal embeddings for large models and enables flexible resolution/frame count at inference.

Learnable temporal positional encoding

Used by C-JEPA for temporal positions of object tokens.

Predictor architectures

The predictor is always smaller than the encoder — it should be expressive enough to model relationships but not so powerful that it bypasses the encoder.

Method	Predictor	Size	Key feature
[I-JEPA](/wiki/papers/2301.08243)	Shallow ViT	Small	Standard masked prediction
[V-JEPA 2](/wiki/papers/2506.09985) (pretraining)	ViT-S	22M (12 blocks)	Standard masked prediction
[V-JEPA 2-AC](/wiki/papers/2506.09985) (planning)	Transformer	300M (24 layers, 1024-dim)	Block-causal attention for action conditioning
[V-JEPA 2.1](/wiki/papers/2603.14482)	ViT	24 blocks	Multi-level outputs for deep self-supervision
[LeWorldModel](/wiki/papers/2603.19312)	Transformer	6 layers, 10M	AdaLN for action injection, 10% dropout
[C-JEPA](/wiki/papers/2602.11389)	Masked ViT	6 layers	Bidirectional attention over object tokens
[ThinkJEPA](/wiki/papers/2603.22281)	ViT + FiLM	—	FiLM conditioning from VLM guidance

Context-aware decoder (3D-JEPA)

3D-JEPA uses a novel decoder where context information is fed via cross-attention at every decoder layer (not just the first). This prevents the encoder from memorizing position information and forces it to learn semantic features.

Progressive resolution training

A key practical innovation from V-JEPA 2, also used in V-JEPA 2.1:

Primary phase: train at low resolution (16 frames, 256x256) for most iterations
Cooldown phase: increase to high resolution (64 frames, 384x384) for final iterations

This provides an 8.4x training speedup — the model develops most of its representational capacity cheaply at low resolution, then sharpens at high resolution. V-JEPA 2.1 extends this with a distillation protocol where a frozen low-resolution teacher supervises a high-resolution student.