LLM-JEPA
The first application of JEPA to large language models. Shows that embedding-space training objectives — proven superior in vision — also benefit LLMs for both pretraining and finetuning.
Core idea
Add a JEPA objective to LLM training by treating paired text data as different "views" of the same underlying knowledge. For example:
- A GitHub issue (natural language) and its corresponding code diff (code) are two views of the same functionality
- A natural language description and its regular expression are two views of the same pattern
The JEPA loss operates in embedding space: predict one view's embedding from the other's, complementing the standard text generation loss.
Why this is hard for language
Vision JEPA benefits from natural multi-view data (masked patches, augmented views). Language lacks this — you need explicit paired views. LLM-JEPA currently requires datasets with non-trivial view pairs. Developing a general "data augmentation" mechanism for text (analogous to image augmentations) remains an open problem.
Results
Outperforms standard LLM training objectives across:
- Multiple datasets: NL-RX (regex), GSM8K (math), Spider (SQL), RottenTomatoes (sentiment)
- Multiple model families: Llama3, OpenELM, Gemma2, Olmo
- Robust to overfitting — a key practical advantage
Significance in the JEPA timeline
Proves JEPA is not vision-specific. Combined with Audio-JEPA (audio), Point-JEPA (3D), and this paper (language), JEPA has been demonstrated across all major modalities.
Links
See also
- 2512.10942 (VL-JEPA) — JEPA for vision-language
- 2511.08544 (LeJEPA) — by the same authors, provides the theory
- latent-prediction — the core principle applied to language