NEPA (Next-Embedding Predictive Autoregression)
A JEPA-adjacent approach by Saining Xie et al. that applies autoregressive next-embedding prediction to vision. Instead of JEPA's bidirectional masked prediction, NEPA uses causal (left-to-right) prediction — mirroring language model pretraining but operating in embedding space rather than token space.
Core idea
An image is split into patches and embedded into a sequence. An autoregressive transformer predicts the next patch embedding from all previous ones, using causal masking and stop-gradient on targets. No pixel reconstruction, no discrete tokens, no contrastive loss, no task-specific heads.
Key difference from JEPA
| JEPA (I-JEPA) | NEPA | |
|---|---|---|
| Masking | Bidirectional (random blocks) | Causal (left-to-right) |
| Prediction | Multiple targets from context | Next embedding from all previous |
| Attention | Full attention on visible tokens | Causal attention |
| Paradigm | Masked prediction | Autoregressive prediction |
| Collapse prevention | EMA target encoder | Stop-gradient on targets |
Both predict in embedding space (not pixels), but NEPA's causal structure mirrors GPT-style language models while JEPA's random masking mirrors BERT-style models.
Results
- 83.8% top-1 accuracy on ImageNet-1K (ViT-B, fine-tuned)
- 85.3% top-1 accuracy on ImageNet-1K (ViT-L, fine-tuned)
- Strong transfer to ADE20K semantic segmentation
Significance
NEPA validates that the next-token prediction paradigm from language works for vision when operating in embedding space. This is complementary to JEPA: JEPA shows masked prediction works in embedding space; NEPA shows autoregressive prediction works too. Together, they suggest that embedding-space prediction is the key ingredient, not the specific masking pattern.
Links
See also
- 2301.08243 (I-JEPA) — bidirectional masked prediction (the contrast)
- latent-prediction — the shared principle of predicting in embedding space
- masking-strategies — causal vs bidirectional masking