JEPAwiki
LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
Date2025-09-19
Modalitylanguage
AuthorsHai Huang, Yann LeCun, Randall Balestriero
Tagslanguage, LLM, finetuning, pretraining, multi-view
SourceFull text

LLM-JEPA

The first application of JEPA to large language models. Shows that embedding-space training objectives — proven superior in vision — also benefit LLMs for both pretraining and finetuning.

Core idea

Add a JEPA objective to LLM training by treating paired text data as different "views" of the same underlying knowledge. For example:

  • A GitHub issue (natural language) and its corresponding code diff (code) are two views of the same functionality
  • A natural language description and its regular expression are two views of the same pattern

The JEPA loss operates in embedding space: predict one view's embedding from the other's, complementing the standard text generation loss.

Why this is hard for language

Vision JEPA benefits from natural multi-view data (masked patches, augmented views). Language lacks this — you need explicit paired views. LLM-JEPA currently requires datasets with non-trivial view pairs. Developing a general "data augmentation" mechanism for text (analogous to image augmentations) remains an open problem.

Results

Outperforms standard LLM training objectives across:

  • Multiple datasets: NL-RX (regex), GSM8K (math), Spider (SQL), RottenTomatoes (sentiment)
  • Multiple model families: Llama3, OpenELM, Gemma2, Olmo
  • Robust to overfitting — a key practical advantage

Significance in the JEPA timeline

Proves JEPA is not vision-specific. Combined with Audio-JEPA (audio), Point-JEPA (3D), and this paper (language), JEPA has been demonstrated across all major modalities.

Links

See also