Object-Centric Representations
While most JEPA variants operate on spatial patches, C-JEPA introduced a fundamentally different approach: operating on object-level representations. This shift from patches to objects enables causal reasoning, dramatic efficiency gains, and more structured world models.
Patches vs. objects
Standard JEPA (I-JEPA, V-JEPA 2) divides inputs into a regular grid of patches. This is simple and general, but:
- Patches don't align with object boundaries
- A single object spans many patches (redundant)
- Interactions between objects are diluted across patch-pair attention
- Planning requires processing all patches (e.g., 196 patches x 384 dims = 75,264 features)
Object-centric models instead extract a fixed number of slots, each representing one object:
- Slots align with semantic entities
- Each object is one compact vector (e.g., 128 dims)
- Object interactions are directly modeled
- Planning uses far fewer features (e.g., 6 slots x 128 dims = 768 features)
Object extraction methods
VideoSAUR (used by C-JEPA)
- Frozen DINOv2 ViT-S/14 backbone extracts patch features
- Slot attention mechanism aggregates patch features into object slots
- Temporal similarity loss ensures slots track the same objects across frames
- Slot dimensionality: 128
SAVi (Slot Attention for Video)
- Iterative slot attention from raw pixels
- Reconstructs input frames from slot representations (uses a decoder)
- Fixed number of slots: 7 for CLEVRER, 4 for Push-T
Object-level masking
C-JEPA's key innovation: instead of masking spatial patches, mask entire objects. At each timestep, selected objects are hidden, and their state must be inferred from:
- Other visible objects at the same timestep
- The masked object's own "identity anchor" from an earlier frame (t₀)
- Auxiliary variables (actions, proprioception)
This creates a fundamentally different learning signal than patch masking:
| Aspect | Patch masking | Object masking |
|---|---|---|
| What's hidden | Spatial region | Semantic entity |
| Prediction requires | Spatial interpolation | Interaction reasoning |
| Shortcut solutions | Copy nearby patches | Blocked (whole object missing) |
| Causal structure | Not induced | Induced via latent interventions |
Causal inductive bias
C-JEPA provides a formal analysis showing that object-level masking induces latent interventions analogous to causal interventions:
- Masking an object removes its direct observability without changing the underlying dynamics
- The model must infer the masked object's state from how it influences and is influenced by other objects
- This creates time-lagged predictive dependencies aligned with causal influence
- The influence neighborhood N_t(i) — the minimal set of other objects needed to predict object i — captures the causal structure
Counterfactual reasoning results
This causal structure has measurable effects. On CLEVRER visual QA:
| Model | Overall accuracy | Counterfactual accuracy |
|---|---|---|
| C-JEPA (4-object masking) | 89.40% | 68.81% |
| Same architecture, no masking | 82.79% | 47.68% |
| SlotFormer (with reconstruction) | 79.44% | 47.29% |
The +21% improvement in counterfactual reasoning demonstrates that object-level masking genuinely induces causal understanding, not just better pattern matching.
Planning efficiency
Object-centric representations enable dramatically more efficient planning:
| Model | Token budget | Push-T success | Planning time (50 traj) |
|---|---|---|---|
| DINO-WM (patches) | 196 x 384 | 91.33% | 5,763s |
| C-JEPA (objects) | 6 x 128 (1% of above) | 88.67% | 673s |
Nearly the same task performance with 100x fewer features and 8x faster planning.
Masking budget
The optimal number of objects to mask depends on the encoder and task:
- With VideoSAUR (high-quality slots): masking all objects (4/4) works best for reasoning
- With SAVi (lower-quality slots): masking 2/4 objects is optimal
- Over-masking with weak encoders eliminates too much information
C-JEPA also found that object-level masking outperforms token-level and tube-level masking of equivalent budget — the object boundary is the right abstraction level.
See also
- masking-strategies — object-level masking in context
- world-models-and-planning — planning with object representations
- 2602.11389 — the C-JEPA paper