JEPAwiki

Training Recipes and Datasets

JEPA models span a wide range of training scales — from LeWorldModel (15M params, single GPU, hours) to V-JEPA 2 (1B params, 60 GPU-years). This page documents the concrete training details across the family.

Training Scale

Dataset inventory

Internet-scale video

VideoMix22M (VM22M) — used by V-JEPA 2:

Dataset Samples Hours Type Weight
Something-Something v2 168K 168 Ego-video (actions) 0.056
Kinetics 400/600/700 733K 614 Exo-video (actions) 0.188
HowTo100M 1.1M 134,000 Instructional 0.318
YT-Temporal-1B (curated) 19M 1,600,000 General 0.188
ImageNet 1M Images 0.250
Total 22M >1M hours

Data curation is important: cluster-based retrieval on YT-1B using target distributions (Kinetics, SSv2, COIN, EpicKitchen) improved results by +1.4 points.

VisionMix163M — used by V-JEPA 2.1:

Dataset Samples Weight Change from VM22M
SSv2 168K 0.170 +3x weight
Kinetics 733K 0.010 Reduced
HowTo100M 1.1M 0.100 Reduced
YT-1B 19M 0.720 +3.8x weight
LVD-142M 142M Replaces ImageNet

Key change: V-JEPA 2.1 shifted weight heavily toward YT-1B and away from curated action datasets, and replaced ImageNet with the much larger LVD-142M image dataset.

Robot interaction data

Droid dataset — used by V-JEPA 2-AC:

  • 62 hours of unlabeled manipulation video (after filtering clips <4 seconds)
  • Franka Emika Panda 7-DoF arm with two-finger gripper
  • Teleoperated demonstrations, 3-4 second clips
  • No task labels, rewards, or success indicators
  • Only left extrinsic camera views used

3D/Point cloud data

  • ShapeNet: used by Point-JEPA and 3D-JEPA for pretraining
  • ModelNet40: classification benchmark (40 object categories)
  • ScanObjectNN: real-world 3D object recognition

Simulation environments

Training schedules

V-JEPA 2 (large-scale)

Phase 1 (warmup):    12K iters, LR: 0 → constant
Phase 1 (constant): 228K iters, constant LR
Phase 2 (cooldown):  12K iters, LR → 0, higher resolution
Total:              252K iterations
  • Progressive resolution: 16 frames 256x256 -> 64 frames 384x384
  • Optimizer: AdamW
  • Training compute: ~60 GPU-years at full resolution; progressive training reduces to ~7 GPU-years (8.4x speedup)

V-JEPA 2.1 (two-phase)

Primary phase:  135K iters, 16 frames 256x256
                Video batch: 128, Image batch: 2304
                LR warmup 1e-4 → 5.25e-4 over 12K iters
Cooldown phase: 12K iters, 64 frames 384x384 / 512x512
                LR: 6e-4 → 1e-6
  • EMA coefficient: 0.99925
  • Weight decay: 0.04
  • Separate image/video data workers with gradient aggregation

LeWorldModel (minimal)

  • ViT-Tiny encoder (~5M params) + 6-layer predictor (~10M params)
  • Total: 15M parameters, single GPU, few hours
  • AdamW optimizer, LR = 5e-4, batch size 256, 30 epochs
  • Only 1 hyperparameter to tune (SIGReg weight λ=0.1)
  • Bisection search for λ: O(log n) tuning cost

C-JEPA

  • Adam optimizer, LR = 5e-4, batch size 256, 30 epochs
  • Frozen VideoSAUR encoder (pretrained on DINOv2 features)
  • Only the predictor is trained

Key training insights

What scales well

  • Data: 2M → 22M samples = +1.0 point average (V-JEPA 2)
  • Model: 300M → 1B params = +1.5 points (V-JEPA 2)
  • Training length: 90K → 252K iters = +0.8 points (V-JEPA 2)
  • Resolution: 256 → 384, 16 → 64 frames = +0.7 points (V-JEPA 2)

What reduces cost

  • Progressive resolution training: 8.4x speedup (V-JEPA 2)
  • Distillation from low-res teacher to high-res student (V-JEPA 2.1)
  • Frozen encoder, train only predictor: V-JEPA 2-AC, C-JEPA
  • Minimal architecture: LeWorldModel achieves competitive results with 15M params

Training stability tricks

  • Fixed (not ramped) EMA coefficient: simplified recipe in V-JEPA 2
  • Fixed weight decay: simplified from ramp-up schedule
  • SIGReg: eliminates need for EMA entirely (LeWorldModel)
  • Zero initialization of action conditioning (AdaLN) for stable progressive integration
  • 10% predictor dropout in LeWorldModel

See also