C-JEPA (Causal-JEPA)

A conceptual extension that pushes JEPA toward object-centric and causal reasoning. By introducing object-level masking, it encourages learning structured and causally meaningful representations.

Core idea

Extends masked joint embedding prediction from image patches to object-centric representations. Instead of masking spatial patches, C-JEPA masks entire objects, requiring an object's state to be inferred from other objects. This induces latent interventions with counterfactual-like effects.

Key contributions

Object-level masking: prevents shortcut solutions and makes interaction reasoning essential
Causal inductive bias: formal analysis showing object-level masking induces causal structure via latent interventions
Efficiency: uses only 1% of the total latent input features required by patch-based world models for planning

Results

+21% absolute improvement in counterfactual reasoning on CLEVRER VQA (68.81% vs 47.68% without object masking)
Overall VQA accuracy: 89.40% (with VideoSAUR encoder, 4-object masking)
Planning with 1% of tokens: 88.67% success on Push-T manipulation (vs 91.33% for patch-based DINO-WM using 196x384 tokens)
8x faster planning: 673s vs 5,763s for 50 trajectories on Push-T
Code: github.com/galilai-group/cjepa

Prerequisites and limitations

Requires an object extractor: C-JEPA doesn't discover objects from pixels. It relies on a pretrained object-centric encoder (VideoSAUR with frozen DINOv2 features, or SAVi). The quality of the object extractor directly limits the world model.
"Causal" should be read carefully: C-JEPA induces time-lagged predictive dependencies that are consistent with causal structure, not provably causal in the interventionist sense. The paper provides a formal analysis of "influence neighborhoods" but these capture predictive sufficiency, not necessarily causal mechanisms.
Evaluated on synthetic data: CLEVRER (simple geometric objects) and Push-T (2D manipulation). Real-world scenes with complex objects and occlusion remain untested.

Significance in the JEPA timeline

Represents the move from "predicting what happens next" to "understanding why things happen." The causal inductive bias is a qualitative step beyond spatial/temporal prediction.

C-JEPA (Causal-JEPA)

Core idea

Key contributions

Results

Prerequisites and limitations

Significance in the JEPA timeline

Links

See also