JEPAwiki
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Date2026-03-24
Modalityvideo/VLM
AuthorsHaichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan + 4 more
Tagsworld-model, VLM, reasoning, long-horizon, dual-temporal, planning
SourceFull text

ThinkJEPA

A forward-looking direction: combining JEPA latent world models with a semantic "thinking" pathway from vision-language models. Targets long-horizon reasoning and planning.

ThinkJEPA Architecture

Core idea

A dual-temporal pathway architecture:

  1. Dense JEPA branch: densely sampled frames for fine-grained motion and interaction cues
  2. VLM thinker branch: uniformly sampled frames with larger temporal stride for knowledge-rich semantic guidance (uses Qwen3-VL Thinking)

The VLM provides long-horizon context, entity recognition, and general world knowledge that purely visual JEPA predictors lack.

Key contributions

  • Dual-temporal perception field sampling: dense sampling for dynamics + sparse sampling for semantics
  • Hierarchical pyramid representation extraction: aggregates multi-layer VLM representations into guidance features compatible with latent prediction
  • Layer-wise guidance injection: VLM representations are injected into the JEPA predictor at multiple layers

Why not just use VLMs directly?

Three limitations of standalone VLMs for dense prediction:

  1. Compute-driven sparsity: quadratic attention cost limits frame count
  2. Language-output bottleneck: continuous interaction states compressed into text-oriented representations
  3. Data regime mismatch: fine-tuning VLMs on small domain datasets causes catastrophic forgetting

ThinkJEPA avoids these by using the VLM as a guide, not a replacement.

Results

Outperforms both VLM-only baselines and JEPA-predictor baselines on hand-manipulation trajectory prediction. More robust long-horizon rollout behavior.

Significance in the JEPA timeline

Represents the convergence of two major paradigms: JEPA-style latent world models and large vision-language models. Points toward a future where world models have both fine-grained physical dynamics AND high-level semantic understanding.

Links

See also

  • 2506.09985 (V-JEPA 2) — the base world model it extends
  • 2603.14482 (V-JEPA 2.1) — the dense feature upgrade
  • 2501.14622 (ACT-JEPA) — earlier action-conditioned approach