JEPAwiki

Benchmarks and Results

A consolidated reference of quantitative results across the JEPA family, organized by task category. All numbers are from the respective papers; error bars are reported where available in the original work. Results without error bars should be interpreted as point estimates from single training runs unless noted otherwise.

Key Results

Video understanding

Action recognition (top-1 accuracy)

Model SSv2 K400 Diving-48
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g 75.3% 86.6%
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 77.3% 87.3%
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G 77.7% 87.7% 89.2%
InternVideo2s2-1B 69.7% 89.4%

SSv2 (Something-Something v2) is the key motion-understanding benchmark — it requires temporal reasoning, not just appearance. V-JEPA 2.1's ViT-G holds the JEPA SOTA.

Action anticipation (Epic-Kitchens-100, Recall@5)

Model Verb Noun Action
PlausiVL (prior SOTA) 55.6 54.2 27.6
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 63.6 57.1 39.7
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G 64.3 59.9 40.8

V-JEPA models surpass task-specific baselines by large margins (+12 points on action R@5 for V-JEPA 2 over PlausiVL).

Video question answering (with LLM alignment)

Model PerceptionTest TempCompass TemporalBench TOMATO
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 + Llama 3.1 8B 84.0 76.9 36.7 40.3
PerceptionLM 8B 82.7 72.7 28.3 33.2

SOTA at the 8B parameter scale across multiple VQA benchmarks.

Short-term object anticipation (Ego4D)

Model mAP All AP (bbox)
Prior SOTA ~5.7
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G 7.71 50.7

+35% relative improvement over prior SOTA.

Image understanding

Classification (ImageNet, top-1 accuracy)

Model Accuracy
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 85.1%
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G 85.5%
DINOv2 ViT-g 86.1%

JEPA approaches but doesn't quite match DINOv2 on ImageNet — expected since DINOv2 is trained specifically for image classification.

Dense understanding (linear probe)

Task V-JEPA 2 ViT-g V-JEPA 2.1 ViT-G Improvement
ADE20K segmentation (mIoU) 24.5 47.9 +23.4
Cityscapes segmentation (mIoU) 45.9 73.5 +27.6
VOC12 segmentation (mIoU) 64.3 85.0 +20.7
NYUv2 depth (RMSE, lower=better) 0.307 SOTA
KITTI depth (RMSE) 2.461

The Dense Predictive Loss in V-JEPA 2.1 is transformative for dense tasks — +20-28 mIoU improvement in segmentation over V-JEPA 2.

Video object segmentation (J&F-Mean)

Model DAVIS-17 YouTube-VOS
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G 69.0 72.7

3D understanding

Point cloud classification (ModelNet40)

Model Linear SVM End-to-end
[Point-JEPA](/wiki/papers/2404.16432) 93.7±0.2% 93.8±0.2%
[3D-JEPA](/wiki/papers/2409.15803) (300 epochs) 94.49%

ScanObjectNN (real-world 3D)

Model PB_T50_RS
[3D-JEPA](/wiki/papers/2409.15803) (150 epochs) 88.65%
[3D-JEPA](/wiki/papers/2409.15803) (300 epochs) 89.52%

3D-JEPA achieves strong results with half the pretraining epochs of competing methods.

Few-shot learning

Model 5-way 10-shot 10-way 10-shot
[Point-JEPA](/wiki/papers/2404.16432) SOTA SOTA
[3D-JEPA](/wiki/papers/2409.15803) (150 epochs) 97.6±2.0% 98.8±0.4%

Reasoning

Visual QA (CLEVRER)

Model Overall Counterfactual
[C-JEPA](/wiki/papers/2602.11389) (VideoSAUR, 4-mask) 89.40% 68.81%
Same, no object masking 82.79% 47.68%
SlotFormer 79.44% 47.29%

Object-level masking gives +21% absolute on counterfactual reasoning.

Robot control

Manipulation success rate

Task [V-JEPA 2-AC](/wiki/papers/2506.09985) Octo (BC)
Reaching 100%
Grasp (cup) 65% ~15-70%
Pick-and-place (cup) 80%
Pick-and-place (box) 65%

Planning efficiency

Model Params Time per action Push-T success
[V-JEPA 2-AC](/wiki/papers/2506.09985) 300M 16s
[C-JEPA](/wiki/papers/2602.11389) ~10M ~13s 88.67%
[LeWorldModel](/wiki/papers/2603.19312) 15M <1s Competitive
Cosmos (video gen) Large 240s

Robotic navigation (Tartan Drive)

Model ATE Planning time
NWM ~5.7 103.2s
[V-JEPA 2.1](/wiki/papers/2603.14482) 5.687 10.6s (10x faster)

Trajectory prediction

Hand manipulation (EgoDex)

Model ADE ↓ FDE ↓ Accuracy ↑
V-JEPA predictor baseline Higher Higher Lower
VLM-only baseline Higher Higher Lower
[ThinkJEPA](/wiki/papers/2603.22281) ~0.07 ~0.064 73.878

ThinkJEPA outperforms both pure JEPA and pure VLM approaches, especially on long-horizon rollouts.

See also