JEPAwiki
NEPA: Next-Embedding Prediction Makes Strong Vision Learners
Date2025-12-22
Modalityimage
AuthorsSihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen + 4 more
Tagsautoregressive, next-embedding, causal, self-supervised-learning, related-work
SourceFull text

NEPA (Next-Embedding Predictive Autoregression)

A JEPA-adjacent approach by Saining Xie et al. that applies autoregressive next-embedding prediction to vision. Instead of JEPA's bidirectional masked prediction, NEPA uses causal (left-to-right) prediction — mirroring language model pretraining but operating in embedding space rather than token space.

NEPA Architecture

Core idea

An image is split into patches and embedded into a sequence. An autoregressive transformer predicts the next patch embedding from all previous ones, using causal masking and stop-gradient on targets. No pixel reconstruction, no discrete tokens, no contrastive loss, no task-specific heads.

Key difference from JEPA

JEPA (I-JEPA) NEPA
Masking Bidirectional (random blocks) Causal (left-to-right)
Prediction Multiple targets from context Next embedding from all previous
Attention Full attention on visible tokens Causal attention
Paradigm Masked prediction Autoregressive prediction
Collapse prevention EMA target encoder Stop-gradient on targets

Both predict in embedding space (not pixels), but NEPA's causal structure mirrors GPT-style language models while JEPA's random masking mirrors BERT-style models.

Results

  • 83.8% top-1 accuracy on ImageNet-1K (ViT-B, fine-tuned)
  • 85.3% top-1 accuracy on ImageNet-1K (ViT-L, fine-tuned)
  • Strong transfer to ADE20K semantic segmentation

Significance

NEPA validates that the next-token prediction paradigm from language works for vision when operating in embedding space. This is complementary to JEPA: JEPA shows masked prediction works in embedding space; NEPA shows autoregressive prediction works too. Together, they suggest that embedding-space prediction is the key ingredient, not the specific masking pattern.

Links

See also