Training Recipes and Datasets

JEPA models span a wide range of training scales — from LeWorldModel (15M params, single GPU, hours) to V-JEPA 2 (1B params, 60 GPU-years). This page documents the concrete training details across the family.

Training Scale

Dataset inventory

Internet-scale video

VideoMix22M (VM22M) — used by V-JEPA 2:

Dataset	Samples	Hours	Type	Weight
Something-Something v2	168K	168	Ego-video (actions)	0.056
Kinetics 400/600/700	733K	614	Exo-video (actions)	0.188
HowTo100M	1.1M	134,000	Instructional	0.318
YT-Temporal-1B (curated)	19M	1,600,000	General	0.188
ImageNet	1M	—	Images	0.250
Total	22M	>1M hours

Data curation is important: cluster-based retrieval on YT-1B using target distributions (Kinetics, SSv2, COIN, EpicKitchen) improved results by +1.4 points.

VisionMix163M — used by V-JEPA 2.1:

Dataset	Samples	Weight	Change from VM22M
SSv2	168K	0.170	+3x weight
Kinetics	733K	0.010	Reduced
HowTo100M	1.1M	0.100	Reduced
YT-1B	19M	0.720	+3.8x weight
LVD-142M	142M	—	Replaces ImageNet

Key change: V-JEPA 2.1 shifted weight heavily toward YT-1B and away from curated action datasets, and replaced ImageNet with the much larger LVD-142M image dataset.

Robot interaction data

Droid dataset — used by V-JEPA 2-AC:

62 hours of unlabeled manipulation video (after filtering clips <4 seconds)
Franka Emika Panda 7-DoF arm with two-finger gripper
Teleoperated demonstrations, 3-4 second clips
No task labels, rewards, or success indicators
Only left extrinsic camera views used

3D/Point cloud data

ShapeNet: used by Point-JEPA and 3D-JEPA for pretraining
ModelNet40: classification benchmark (40 object categories)
ScanObjectNN: real-world 3D object recognition

Simulation environments

CLEVRER: physics reasoning with simple objects (C-JEPA)
Push-T: 2D manipulation (C-JEPA, LeWorldModel)
OGBench: 3D control tasks (LeWorldModel)

Training schedules

V-JEPA 2 (large-scale)

Phase 1 (warmup):    12K iters, LR: 0 → constant
Phase 1 (constant): 228K iters, constant LR
Phase 2 (cooldown):  12K iters, LR → 0, higher resolution
Total:              252K iterations

Progressive resolution: 16 frames 256x256 -> 64 frames 384x384
Optimizer: AdamW
Training compute: ~60 GPU-years at full resolution; progressive training reduces to ~7 GPU-years (8.4x speedup)

V-JEPA 2.1 (two-phase)

Primary phase:  135K iters, 16 frames 256x256
                Video batch: 128, Image batch: 2304
                LR warmup 1e-4 → 5.25e-4 over 12K iters
Cooldown phase: 12K iters, 64 frames 384x384 / 512x512
                LR: 6e-4 → 1e-6

EMA coefficient: 0.99925
Weight decay: 0.04
Separate image/video data workers with gradient aggregation

LeWorldModel (minimal)

ViT-Tiny encoder (~5M params) + 6-layer predictor (~10M params)
Total: 15M parameters, single GPU, few hours
AdamW optimizer, LR = 5e-4, batch size 256, 30 epochs
Only 1 hyperparameter to tune (SIGReg weight λ=0.1)
Bisection search for λ: O(log n) tuning cost

C-JEPA

Adam optimizer, LR = 5e-4, batch size 256, 30 epochs
Frozen VideoSAUR encoder (pretrained on DINOv2 features)
Only the predictor is trained

Key training insights

What scales well

Data: 2M → 22M samples = +1.0 point average (V-JEPA 2)
Model: 300M → 1B params = +1.5 points (V-JEPA 2)
Training length: 90K → 252K iters = +0.8 points (V-JEPA 2)
Resolution: 256 → 384, 16 → 64 frames = +0.7 points (V-JEPA 2)

What reduces cost

Progressive resolution training: 8.4x speedup (V-JEPA 2)
Distillation from low-res teacher to high-res student (V-JEPA 2.1)
Frozen encoder, train only predictor: V-JEPA 2-AC, C-JEPA
Minimal architecture: LeWorldModel achieves competitive results with 15M params

Training stability tricks

Fixed (not ramped) EMA coefficient: simplified recipe in V-JEPA 2
Fixed weight decay: simplified from ramp-up schedule
SIGReg: eliminates need for EMA entirely (LeWorldModel)
Zero initialization of action conditioning (AdaLN) for stable progressive integration
10% predictor dropout in LeWorldModel