NEPA (Next-Embedding Predictive Autoregression)

A JEPA-adjacent approach by Saining Xie et al. that applies autoregressive next-embedding prediction to vision. Instead of JEPA's bidirectional masked prediction, NEPA uses causal (left-to-right) prediction — mirroring language model pretraining but operating in embedding space rather than token space.

NEPA Architecture

Core idea

An image is split into patches and embedded into a sequence. An autoregressive transformer predicts the next patch embedding from all previous ones, using causal masking and stop-gradient on targets. No pixel reconstruction, no discrete tokens, no contrastive loss, no task-specific heads.

Key difference from JEPA

	JEPA (I-JEPA)	NEPA
Masking	Bidirectional (random blocks)	Causal (left-to-right)
Prediction	Multiple targets from context	Next embedding from all previous
Attention	Full attention on visible tokens	Causal attention
Paradigm	Masked prediction	Autoregressive prediction
Collapse prevention	EMA target encoder	Stop-gradient on targets

Both predict in embedding space (not pixels), but NEPA's causal structure mirrors GPT-style language models while JEPA's random masking mirrors BERT-style models.

Results

83.8% top-1 accuracy on ImageNet-1K (ViT-B, fine-tuned)
85.3% top-1 accuracy on ImageNet-1K (ViT-L, fine-tuned)
Strong transfer to ADE20K semantic segmentation

Significance

NEPA validates that the next-token prediction paradigm from language works for vision when operating in embedding space. This is complementary to JEPA: JEPA shows masked prediction works in embedding space; NEPA shows autoregressive prediction works too. Together, they suggest that embedding-space prediction is the key ingredient, not the specific masking pattern.

NEPA (Next-Embedding Predictive Autoregression)

Core idea

Key difference from JEPA

Results

Significance

Links

See also