Benchmarks and Results
A consolidated reference of quantitative results across the JEPA family, organized by task category. All numbers are from the respective papers; error bars are reported where available in the original work. Results without error bars should be interpreted as point estimates from single training runs unless noted otherwise.

Video understanding
Action recognition (top-1 accuracy)
| Model |
SSv2 |
K400 |
Diving-48 |
| [V-JEPA 2](/wiki/papers/2506.09985) ViT-g |
75.3% |
86.6% |
— |
| [V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 |
77.3% |
87.3% |
— |
| [V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G |
77.7% |
87.7% |
89.2% |
| InternVideo2s2-1B |
69.7% |
89.4% |
— |
SSv2 (Something-Something v2) is the key motion-understanding benchmark — it requires temporal reasoning, not just appearance. V-JEPA 2.1's ViT-G holds the JEPA SOTA.
Action anticipation (Epic-Kitchens-100, Recall@5)
| Model |
Verb |
Noun |
Action |
| PlausiVL (prior SOTA) |
55.6 |
54.2 |
27.6 |
| [V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 |
63.6 |
57.1 |
39.7 |
| [V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G |
64.3 |
59.9 |
40.8 |
V-JEPA models surpass task-specific baselines by large margins (+12 points on action R@5 for V-JEPA 2 over PlausiVL).
Video question answering (with LLM alignment)
| Model |
PerceptionTest |
TempCompass |
TemporalBench |
TOMATO |
| [V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 + Llama 3.1 8B |
84.0 |
76.9 |
36.7 |
40.3 |
| PerceptionLM 8B |
82.7 |
72.7 |
28.3 |
33.2 |
SOTA at the 8B parameter scale across multiple VQA benchmarks.
Short-term object anticipation (Ego4D)
| Model |
mAP All |
AP (bbox) |
| Prior SOTA |
~5.7 |
— |
| [V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G |
7.71 |
50.7 |
+35% relative improvement over prior SOTA.
Image understanding
Classification (ImageNet, top-1 accuracy)
| Model |
Accuracy |
| [V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 |
85.1% |
| [V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G |
85.5% |
| DINOv2 ViT-g |
86.1% |
JEPA approaches but doesn't quite match DINOv2 on ImageNet — expected since DINOv2 is trained specifically for image classification.
Dense understanding (linear probe)
| Task |
V-JEPA 2 ViT-g |
V-JEPA 2.1 ViT-G |
Improvement |
| ADE20K segmentation (mIoU) |
24.5 |
47.9 |
+23.4 |
| Cityscapes segmentation (mIoU) |
45.9 |
73.5 |
+27.6 |
| VOC12 segmentation (mIoU) |
64.3 |
85.0 |
+20.7 |
| NYUv2 depth (RMSE, lower=better) |
— |
0.307 |
SOTA |
| KITTI depth (RMSE) |
— |
2.461 |
— |
The Dense Predictive Loss in V-JEPA 2.1 is transformative for dense tasks — +20-28 mIoU improvement in segmentation over V-JEPA 2.
Video object segmentation (J&F-Mean)
| Model |
DAVIS-17 |
YouTube-VOS |
| [V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G |
69.0 |
72.7 |
3D understanding
Point cloud classification (ModelNet40)
| Model |
Linear SVM |
End-to-end |
| [Point-JEPA](/wiki/papers/2404.16432) |
93.7±0.2% |
93.8±0.2% |
| [3D-JEPA](/wiki/papers/2409.15803) (300 epochs) |
— |
94.49% |
ScanObjectNN (real-world 3D)
| Model |
PB_T50_RS |
| [3D-JEPA](/wiki/papers/2409.15803) (150 epochs) |
88.65% |
| [3D-JEPA](/wiki/papers/2409.15803) (300 epochs) |
89.52% |
3D-JEPA achieves strong results with half the pretraining epochs of competing methods.
Few-shot learning
| Model |
5-way 10-shot |
10-way 10-shot |
| [Point-JEPA](/wiki/papers/2404.16432) |
SOTA |
SOTA |
| [3D-JEPA](/wiki/papers/2409.15803) (150 epochs) |
97.6±2.0% |
98.8±0.4% |
Reasoning
Visual QA (CLEVRER)
| Model |
Overall |
Counterfactual |
| [C-JEPA](/wiki/papers/2602.11389) (VideoSAUR, 4-mask) |
89.40% |
68.81% |
| Same, no object masking |
82.79% |
47.68% |
| SlotFormer |
79.44% |
47.29% |
Object-level masking gives +21% absolute on counterfactual reasoning.
Robot control
Manipulation success rate
| Task |
[V-JEPA 2-AC](/wiki/papers/2506.09985) |
Octo (BC) |
| Reaching |
100% |
— |
| Grasp (cup) |
65% |
~15-70% |
| Pick-and-place (cup) |
80% |
— |
| Pick-and-place (box) |
65% |
— |
Planning efficiency
| Model |
Params |
Time per action |
Push-T success |
| [V-JEPA 2-AC](/wiki/papers/2506.09985) |
300M |
16s |
— |
| [C-JEPA](/wiki/papers/2602.11389) |
~10M |
~13s |
88.67% |
| [LeWorldModel](/wiki/papers/2603.19312) |
15M |
<1s |
Competitive |
| Cosmos (video gen) |
Large |
240s |
— |
Robotic navigation (Tartan Drive)
| Model |
ATE |
Planning time |
| NWM |
~5.7 |
103.2s |
| [V-JEPA 2.1](/wiki/papers/2603.14482) |
5.687 |
10.6s (10x faster) |
Trajectory prediction
Hand manipulation (EgoDex)
| Model |
ADE ↓ |
FDE ↓ |
Accuracy ↑ |
| V-JEPA predictor baseline |
Higher |
Higher |
Lower |
| VLM-only baseline |
Higher |
Higher |
Lower |
| [ThinkJEPA](/wiki/papers/2603.22281) |
~0.07 |
~0.064 |
73.878 |
ThinkJEPA outperforms both pure JEPA and pure VLM approaches, especially on long-horizon rollouts.
See also