Training Recipes and Datasets
JEPA models span a wide range of training scales — from LeWorldModel (15M params, single GPU, hours) to V-JEPA 2 (1B params, 60 GPU-years). This page documents the concrete training details across the family.
Dataset inventory
Internet-scale video
VideoMix22M (VM22M) — used by V-JEPA 2:
| Dataset | Samples | Hours | Type | Weight |
|---|---|---|---|---|
| Something-Something v2 | 168K | 168 | Ego-video (actions) | 0.056 |
| Kinetics 400/600/700 | 733K | 614 | Exo-video (actions) | 0.188 |
| HowTo100M | 1.1M | 134,000 | Instructional | 0.318 |
| YT-Temporal-1B (curated) | 19M | 1,600,000 | General | 0.188 |
| ImageNet | 1M | — | Images | 0.250 |
| Total | 22M | >1M hours |
Data curation is important: cluster-based retrieval on YT-1B using target distributions (Kinetics, SSv2, COIN, EpicKitchen) improved results by +1.4 points.
VisionMix163M — used by V-JEPA 2.1:
| Dataset | Samples | Weight | Change from VM22M |
|---|---|---|---|
| SSv2 | 168K | 0.170 | +3x weight |
| Kinetics | 733K | 0.010 | Reduced |
| HowTo100M | 1.1M | 0.100 | Reduced |
| YT-1B | 19M | 0.720 | +3.8x weight |
| LVD-142M | 142M | — | Replaces ImageNet |
Key change: V-JEPA 2.1 shifted weight heavily toward YT-1B and away from curated action datasets, and replaced ImageNet with the much larger LVD-142M image dataset.
Robot interaction data
Droid dataset — used by V-JEPA 2-AC:
- 62 hours of unlabeled manipulation video (after filtering clips <4 seconds)
- Franka Emika Panda 7-DoF arm with two-finger gripper
- Teleoperated demonstrations, 3-4 second clips
- No task labels, rewards, or success indicators
- Only left extrinsic camera views used
3D/Point cloud data
- ShapeNet: used by Point-JEPA and 3D-JEPA for pretraining
- ModelNet40: classification benchmark (40 object categories)
- ScanObjectNN: real-world 3D object recognition
Simulation environments
- CLEVRER: physics reasoning with simple objects (C-JEPA)
- Push-T: 2D manipulation (C-JEPA, LeWorldModel)
- OGBench: 3D control tasks (LeWorldModel)
Training schedules
V-JEPA 2 (large-scale)
Phase 1 (warmup): 12K iters, LR: 0 → constant
Phase 1 (constant): 228K iters, constant LR
Phase 2 (cooldown): 12K iters, LR → 0, higher resolution
Total: 252K iterations
- Progressive resolution: 16 frames 256x256 -> 64 frames 384x384
- Optimizer: AdamW
- Training compute: ~60 GPU-years at full resolution; progressive training reduces to ~7 GPU-years (8.4x speedup)
V-JEPA 2.1 (two-phase)
Primary phase: 135K iters, 16 frames 256x256
Video batch: 128, Image batch: 2304
LR warmup 1e-4 → 5.25e-4 over 12K iters
Cooldown phase: 12K iters, 64 frames 384x384 / 512x512
LR: 6e-4 → 1e-6
- EMA coefficient: 0.99925
- Weight decay: 0.04
- Separate image/video data workers with gradient aggregation
LeWorldModel (minimal)
- ViT-Tiny encoder (~5M params) + 6-layer predictor (~10M params)
- Total: 15M parameters, single GPU, few hours
- AdamW optimizer, LR = 5e-4, batch size 256, 30 epochs
- Only 1 hyperparameter to tune (SIGReg weight λ=0.1)
- Bisection search for λ: O(log n) tuning cost
C-JEPA
- Adam optimizer, LR = 5e-4, batch size 256, 30 epochs
- Frozen VideoSAUR encoder (pretrained on DINOv2 features)
- Only the predictor is trained
Key training insights
What scales well
- Data: 2M → 22M samples = +1.0 point average (V-JEPA 2)
- Model: 300M → 1B params = +1.5 points (V-JEPA 2)
- Training length: 90K → 252K iters = +0.8 points (V-JEPA 2)
- Resolution: 256 → 384, 16 → 64 frames = +0.7 points (V-JEPA 2)
What reduces cost
- Progressive resolution training: 8.4x speedup (V-JEPA 2)
- Distillation from low-res teacher to high-res student (V-JEPA 2.1)
- Frozen encoder, train only predictor: V-JEPA 2-AC, C-JEPA
- Minimal architecture: LeWorldModel achieves competitive results with 15M params
Training stability tricks
- Fixed (not ramped) EMA coefficient: simplified recipe in V-JEPA 2
- Fixed weight decay: simplified from ramp-up schedule
- SIGReg: eliminates need for EMA entirely (LeWorldModel)
- Zero initialization of action conditioning (AdaLN) for stable progressive integration
- 10% predictor dropout in LeWorldModel
See also
- collapse-prevention — training stability mechanisms
- vision-transformers — encoder architectures and scales
- world-models-and-planning — what training enables