C-JEPA: Learning World Models through Object-Level Latent Interventions
arXiv2602.11389
Date2026-02-13
Modalityvideo/causal
AuthorsMirko Brankovic, Vaishnav Tadiparthi, Fei Deng, Jiahui Lei + 2 more
Tagsobject-centric, causal-reasoning, world-model, counterfactual, planning
SourceFull text
C-JEPA (Causal-JEPA)
A conceptual extension that pushes JEPA toward object-centric and causal reasoning. By introducing object-level masking, it encourages learning structured and causally meaningful representations.
Core idea
Extends masked joint embedding prediction from image patches to object-centric representations. Instead of masking spatial patches, C-JEPA masks entire objects, requiring an object's state to be inferred from other objects. This induces latent interventions with counterfactual-like effects.
Key contributions
- Object-level masking: prevents shortcut solutions and makes interaction reasoning essential
- Causal inductive bias: formal analysis showing object-level masking induces causal structure via latent interventions
- Efficiency: uses only 1% of the total latent input features required by patch-based world models for planning
Results
- +21% absolute improvement in counterfactual reasoning on CLEVRER VQA (68.81% vs 47.68% without object masking)
- Overall VQA accuracy: 89.40% (with VideoSAUR encoder, 4-object masking)
- Planning with 1% of tokens: 88.67% success on Push-T manipulation (vs 91.33% for patch-based DINO-WM using 196x384 tokens)
- 8x faster planning: 673s vs 5,763s for 50 trajectories on Push-T
- Code: github.com/galilai-group/cjepa
Prerequisites and limitations
- Requires an object extractor: C-JEPA doesn't discover objects from pixels. It relies on a pretrained object-centric encoder (VideoSAUR with frozen DINOv2 features, or SAVi). The quality of the object extractor directly limits the world model.
- "Causal" should be read carefully: C-JEPA induces time-lagged predictive dependencies that are consistent with causal structure, not provably causal in the interventionist sense. The paper provides a formal analysis of "influence neighborhoods" but these capture predictive sufficiency, not necessarily causal mechanisms.
- Evaluated on synthetic data: CLEVRER (simple geometric objects) and Push-T (2D manipulation). Real-world scenes with complex objects and occlusion remain untested.
Significance in the JEPA timeline
Represents the move from "predicting what happens next" to "understanding why things happen." The causal inductive bias is a qualitative step beyond spatial/temporal prediction.
Links
See also
- 2506.09985 (V-JEPA 2) — the patch-based world model it improves upon conceptually
- masking-strategies — object-level vs patch-level masking