ThinkJEPA
A forward-looking direction: combining JEPA latent world models with a semantic "thinking" pathway from vision-language models. Targets long-horizon reasoning and planning.
Core idea
A dual-temporal pathway architecture:
- Dense JEPA branch: densely sampled frames for fine-grained motion and interaction cues
- VLM thinker branch: uniformly sampled frames with larger temporal stride for knowledge-rich semantic guidance (uses Qwen3-VL Thinking)
The VLM provides long-horizon context, entity recognition, and general world knowledge that purely visual JEPA predictors lack.
Key contributions
- Dual-temporal perception field sampling: dense sampling for dynamics + sparse sampling for semantics
- Hierarchical pyramid representation extraction: aggregates multi-layer VLM representations into guidance features compatible with latent prediction
- Layer-wise guidance injection: VLM representations are injected into the JEPA predictor at multiple layers
Why not just use VLMs directly?
Three limitations of standalone VLMs for dense prediction:
- Compute-driven sparsity: quadratic attention cost limits frame count
- Language-output bottleneck: continuous interaction states compressed into text-oriented representations
- Data regime mismatch: fine-tuning VLMs on small domain datasets causes catastrophic forgetting
ThinkJEPA avoids these by using the VLM as a guide, not a replacement.
Results
Outperforms both VLM-only baselines and JEPA-predictor baselines on hand-manipulation trajectory prediction. More robust long-horizon rollout behavior.
Significance in the JEPA timeline
Represents the convergence of two major paradigms: JEPA-style latent world models and large vision-language models. Points toward a future where world models have both fine-grained physical dynamics AND high-level semantic understanding.
Links
See also
- 2506.09985 (V-JEPA 2) — the base world model it extends
- 2603.14482 (V-JEPA 2.1) — the dense feature upgrade
- 2501.14622 (ACT-JEPA) — earlier action-conditioned approach