VL-JEPA
A vision-language model that predicts continuous text embeddings instead of autoregressively generating tokens. The first non-generative model that performs general-domain vision-language tasks.
Core idea
Classical VLMs generate text token-by-token: (image, query) -> tokens. VL-JEPA instead predicts the embedding of the target text: (image_embedding, query) -> text_embedding. A lightweight decoder translates embeddings to text only when needed.
Architecture
- x-encoder: maps vision inputs to embeddings
- y-encoder: maps target text to embeddings (training target)
- Predictor: learns (S_V, X_Q) -> S_Y in embedding space
- y-decoder: lightweight, invoked only at inference when text output is needed
Why embedding-space is better
In token space, different valid answers to the same question appear nearly orthogonal (different words). In embedding space, semantically equivalent answers map to nearby points. This simplifies the learning problem — the model focuses on task-relevant semantics, not surface-level linguistic variability.
Key results
- 50% fewer trainable parameters than equivalent token-space VLM (same encoder, same data)
- 2.85x fewer decoding operations via selective decoding (decode only when embeddings change significantly)
- Outperforms CLIP, SigLIP2, and Perception Encoder on average across 8 video classification and 8 video retrieval datasets
- Matches InstructBLIP and Qwen-VL on VQA tasks (GQA, TallyQA, POPE) with only 1.6B parameters
- Natively supports: open-vocabulary classification, text-to-video retrieval, discriminative VQA — all without architecture modification
Selective decoding
A unique capability of non-autoregressive VL-JEPA: during live video streaming, the model produces continuous embedding streams and only invokes the text decoder when the predicted embedding changes significantly. This enables real-time applications (action tracking, scene recognition) that autoregressive VLMs cannot support efficiently.
Significance in the JEPA timeline
Shows that JEPA's non-generative principle extends to vision-language tasks, challenging the assumption that VLMs must be autoregressive token generators. The efficiency gains (50% fewer parameters, 2.85x fewer decoding ops) demonstrate practical advantages of latent prediction for multimodal AI.
Links
See also
- 2506.09985 (V-JEPA 2) — the vision encoder VL-JEPA builds on
- 2509.14252 (LLM-JEPA) — JEPA for pure language
- 2603.22281 (ThinkJEPA) — VLM as a guide for JEPA (the reverse direction)
- latent-prediction — why embedding-space prediction is more efficient