VL-JEPA

A vision-language model that predicts continuous text embeddings instead of autoregressively generating tokens. The first non-generative model that performs general-domain vision-language tasks.

VL-JEPA Architecture

Core idea

Classical VLMs generate text token-by-token: (image, query) -> tokens. VL-JEPA instead predicts the embedding of the target text: (image_embedding, query) -> text_embedding. A lightweight decoder translates embeddings to text only when needed.

Architecture

x-encoder: maps vision inputs to embeddings
y-encoder: maps target text to embeddings (training target)
Predictor: learns (S_V, X_Q) -> S_Y in embedding space
y-decoder: lightweight, invoked only at inference when text output is needed

Why embedding-space is better

In token space, different valid answers to the same question appear nearly orthogonal (different words). In embedding space, semantically equivalent answers map to nearby points. This simplifies the learning problem — the model focuses on task-relevant semantics, not surface-level linguistic variability.

Key results

50% fewer trainable parameters than equivalent token-space VLM (same encoder, same data)
2.85x fewer decoding operations via selective decoding (decode only when embeddings change significantly)
Outperforms CLIP, SigLIP2, and Perception Encoder on average across 8 video classification and 8 video retrieval datasets
Matches InstructBLIP and Qwen-VL on VQA tasks (GQA, TallyQA, POPE) with only 1.6B parameters
Natively supports: open-vocabulary classification, text-to-video retrieval, discriminative VQA — all without architecture modification

Selective decoding

A unique capability of non-autoregressive VL-JEPA: during live video streaming, the model produces continuous embedding streams and only invokes the text decoder when the predicted embedding changes significantly. This enables real-time applications (action tracking, scene recognition) that autoregressive VLMs cannot support efficiently.

Significance in the JEPA timeline

Shows that JEPA's non-generative principle extends to vision-language tasks, challenging the assumption that VLMs must be autoregressive token generators. The efficiency gains (50% fewer parameters, 2.85x fewer decoding ops) demonstrate practical advantages of latent prediction for multimodal AI.