Benchmarks and Results

A consolidated reference of quantitative results across the JEPA family, organized by task category. All numbers are from the respective papers; error bars are reported where available in the original work. Results without error bars should be interpreted as point estimates from single training runs unless noted otherwise.

Key Results

Video understanding

Action recognition (top-1 accuracy)

Model	SSv2	K400	Diving-48
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g	75.3%	86.6%	—
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384	77.3%	87.3%	—
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G	77.7%	87.7%	89.2%
InternVideo2s2-1B	69.7%	89.4%	—

SSv2 (Something-Something v2) is the key motion-understanding benchmark — it requires temporal reasoning, not just appearance. V-JEPA 2.1's ViT-G holds the JEPA SOTA.

Action anticipation (Epic-Kitchens-100, Recall@5)

Model	Verb	Noun	Action
PlausiVL (prior SOTA)	55.6	54.2	27.6
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384	63.6	57.1	39.7
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G	64.3	59.9	40.8

V-JEPA models surpass task-specific baselines by large margins (+12 points on action R@5 for V-JEPA 2 over PlausiVL).

Video question answering (with LLM alignment)

Model	PerceptionTest	TempCompass	TemporalBench	TOMATO
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384 + Llama 3.1 8B	84.0	76.9	36.7	40.3
PerceptionLM 8B	82.7	72.7	28.3	33.2

SOTA at the 8B parameter scale across multiple VQA benchmarks.

Short-term object anticipation (Ego4D)

Model	mAP All	AP (bbox)
Prior SOTA	~5.7	—
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G	7.71	50.7

+35% relative improvement over prior SOTA.

Image understanding

Classification (ImageNet, top-1 accuracy)

Model	Accuracy
[V-JEPA 2](/wiki/papers/2506.09985) ViT-g384	85.1%
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G	85.5%
DINOv2 ViT-g	86.1%

JEPA approaches but doesn't quite match DINOv2 on ImageNet — expected since DINOv2 is trained specifically for image classification.

Dense understanding (linear probe)

Task	V-JEPA 2 ViT-g	V-JEPA 2.1 ViT-G	Improvement
ADE20K segmentation (mIoU)	24.4	47.9	+23.4
Cityscapes segmentation (mIoU)	45.9	73.5	+27.6
VOC12 segmentation (mIoU)	63.9	85.0	+20.7
NYUv2 depth (RMSE, lower=better)	—	0.307	SOTA
KITTI depth (RMSE)	—	2.461	—

The Dense Predictive Loss in V-JEPA 2.1 is transformative for dense tasks — +20-28 mIoU improvement in segmentation over V-JEPA 2.

Video object segmentation (J&F-Mean)

Model	DAVIS-17	YouTube-VOS
[V-JEPA 2.1](/wiki/papers/2603.14482) ViT-G	69.0	72.7

3D understanding

Point cloud classification (ModelNet40)

Model	Linear SVM	End-to-end
[Point-JEPA](/wiki/papers/2404.16432)	93.7±0.2%	93.8±0.2%
[3D-JEPA](/wiki/papers/2409.15803) (300 epochs)	—	94.0%

ScanObjectNN (real-world 3D)

Model	PB_T50_RS
[3D-JEPA](/wiki/papers/2409.15803) (150 epochs)	88.65%
[3D-JEPA](/wiki/papers/2409.15803) (300 epochs)	89.52%

3D-JEPA achieves strong results with half the pretraining epochs of competing methods.

Few-shot learning

Model	5-way 10-shot	10-way 10-shot
[Point-JEPA](/wiki/papers/2404.16432)	SOTA	SOTA
[3D-JEPA](/wiki/papers/2409.15803) (150 epochs)	97.6±2.0%	98.8±0.4%

Reasoning

Visual QA (CLEVRER)

Model	Overall	Counterfactual
[C-JEPA](/wiki/papers/2602.11389) (VideoSAUR, 4-mask)	89.40%	68.81%
Same, no object masking	82.79%	47.68%
SlotFormer	79.44%	47.29%

Object-level masking gives +21% absolute on counterfactual reasoning.

Robot control

Manipulation success rate

Task	[V-JEPA 2-AC](/wiki/papers/2506.09985)	Octo (BC)
Reaching	100%	—
Grasp (cup)	65%	~15-70%
Reach w/ object (cup)	75%	—
Reach w/ object (box)	75%	—
Pick-and-place (cup)	80%	—
Pick-and-place (box)	65%	—

Planning efficiency

Model	Params	Time per action	Push-T success
[V-JEPA 2-AC](/wiki/papers/2506.09985)	300M	16s	—
[C-JEPA](/wiki/papers/2602.11389)	~10M	~13s	88.67%
[LeWorldModel](/wiki/papers/2603.19312)	15M	<1s	Competitive
Cosmos (video gen)	Large	240s	—

Note: V-JEPA 2-AC's 16s is per action step (replanned each timestep); C-JEPA's ~13s is per complete 50-trajectory plan; LeWorldModel's <1s is per plan — direct comparison requires care.

Robotic navigation (Tartan Drive)

Model	ATE	Planning time
NWM	~5.7	103.2s
[V-JEPA 2.1](/wiki/papers/2603.14482)	5.687	10.6s (10x faster)

Trajectory prediction

Hand manipulation (EgoDex)

Model	ADE ↓	FDE ↓	Accuracy ↑
V-JEPA predictor baseline	Higher	Higher	Lower
VLM-only baseline	Higher	Higher	Lower
[ThinkJEPA](/wiki/papers/2603.22281)	~0.07	~0.064	73.878

ThinkJEPA outperforms both pure JEPA and pure VLM approaches, especially on long-horizon rollouts.