LLM-JEPA

The first application of JEPA to large language models. Shows that embedding-space training objectives — proven superior in vision — also benefit LLMs for both pretraining and finetuning.

Core idea

Add a JEPA objective to LLM training by treating paired text data as different "views" of the same underlying knowledge. For example:

A GitHub issue (natural language) and its corresponding code diff (code) are two views of the same functionality
A natural language description and its regular expression are two views of the same pattern

The JEPA loss operates in embedding space: predict one view's embedding from the other's, complementing the standard text generation loss.

Why this is hard for language

Vision JEPA benefits from natural multi-view data (masked patches, augmented views). Language lacks this — you need explicit paired views. LLM-JEPA currently requires datasets with non-trivial view pairs. Developing a general "data augmentation" mechanism for text (analogous to image augmentations) remains an open problem.

Results

Outperforms standard LLM training objectives across:

Multiple datasets: NL-RX (regex), GSM8K (math), Spider (SQL), RottenTomatoes (sentiment)
Multiple model families: Llama3, OpenELM, Gemma2, Olmo
Robust to overfitting — a key practical advantage

Significance in the JEPA timeline

Proves JEPA is not vision-specific. Combined with Audio-JEPA (audio), Point-JEPA (3D), and this paper (language), JEPA has been demonstrated across all major modalities.

LLM-JEPA

Core idea

Why this is hard for language

Results

Significance in the JEPA timeline

Links

See also