ThinkJEPA

A forward-looking direction: combining JEPA latent world models with a semantic "thinking" pathway from vision-language models. Targets long-horizon reasoning and planning.

ThinkJEPA Architecture

Core idea

A dual-temporal pathway architecture:

Dense JEPA branch: densely sampled frames for fine-grained motion and interaction cues
VLM thinker branch: uniformly sampled frames with larger temporal stride for knowledge-rich semantic guidance (uses Qwen3-VL Thinking)

The VLM provides long-horizon context, entity recognition, and general world knowledge that purely visual JEPA predictors lack.

Key contributions

Dual-temporal perception field sampling: dense sampling for dynamics + sparse sampling for semantics
Hierarchical pyramid representation extraction: aggregates multi-layer VLM representations into guidance features compatible with latent prediction
Layer-wise guidance injection: VLM representations are injected into the JEPA predictor at multiple layers

Why not just use VLMs directly?

Three limitations of standalone VLMs for dense prediction:

Compute-driven sparsity: quadratic attention cost limits frame count
Language-output bottleneck: continuous interaction states compressed into text-oriented representations
Data regime mismatch: fine-tuning VLMs on small domain datasets causes catastrophic forgetting

ThinkJEPA avoids these by using the VLM as a guide, not a replacement.

Results

Outperforms both VLM-only baselines and JEPA-predictor baselines on hand-manipulation trajectory prediction. More robust long-horizon rollout behavior.

Significance in the JEPA timeline

Represents the convergence of two major paradigms: JEPA-style latent world models and large vision-language models. Points toward a future where world models have both fine-grained physical dynamics AND high-level semantic understanding.

ThinkJEPA

Core idea

Key contributions

Why not just use VLMs directly?

Results

Significance in the JEPA timeline

Links

See also