Object-Centric Representations

While most JEPA variants operate on spatial patches, C-JEPA introduced a fundamentally different approach: operating on object-level representations. This shift from patches to objects enables causal reasoning, dramatic efficiency gains, and more structured world models.

Patches vs. objects

Standard JEPA (I-JEPA, V-JEPA 2) divides inputs into a regular grid of patches. This is simple and general, but:

Patches don't align with object boundaries
A single object spans many patches (redundant)
Interactions between objects are diluted across patch-pair attention
Planning requires processing all patches (e.g., 196 patches x 384 dims = 75,264 features)

Object-centric models instead extract a fixed number of slots, each representing one object:

Slots align with semantic entities
Each object is one compact vector (e.g., 128 dims)
Object interactions are directly modeled
Planning uses far fewer features (e.g., 6 slots x 128 dims = 768 features)

Patches vs Objects

Object extraction methods

VideoSAUR (used by C-JEPA)

Frozen DINOv2 ViT-S/14 backbone extracts patch features
Slot attention mechanism aggregates patch features into object slots
Temporal similarity loss ensures slots track the same objects across frames
Slot dimensionality: 128

SAVi (Slot Attention for Video)

Iterative slot attention from raw pixels
Reconstructs input frames from slot representations (uses a decoder)
Fixed number of slots: 7 for CLEVRER, 4 for Push-T

Object-level masking

C-JEPA's key innovation: instead of masking spatial patches, mask entire objects. At each timestep, selected objects are hidden, and their state must be inferred from:

Other visible objects at the same timestep
The masked object's own "identity anchor" from an earlier frame (t₀)
Auxiliary variables (actions, proprioception)

This creates a fundamentally different learning signal than patch masking:

Aspect	Patch masking	Object masking
What's hidden	Spatial region	Semantic entity
Prediction requires	Spatial interpolation	Interaction reasoning
Shortcut solutions	Copy nearby patches	Blocked (whole object missing)
Causal structure	Not induced	Induced via latent interventions

Causal inductive bias

C-JEPA provides a formal analysis showing that object-level masking induces latent interventions analogous to causal interventions:

Masking an object removes its direct observability without changing the underlying dynamics
The model must infer the masked object's state from how it influences and is influenced by other objects
This creates time-lagged predictive dependencies aligned with causal influence
The influence neighborhood N_t(i) — the minimal set of other objects needed to predict object i — captures the causal structure

Counterfactual reasoning results

This causal structure has measurable effects. On CLEVRER visual QA:

Model	Overall accuracy	Counterfactual accuracy
C-JEPA (4-object masking)	89.40%	68.81%
Same architecture, no masking	82.79%	47.68%
SlotFormer (with reconstruction)	79.44%	47.29%

The +21% improvement in counterfactual reasoning demonstrates that object-level masking genuinely induces causal understanding, not just better pattern matching.

Planning efficiency

Object-centric representations enable dramatically more efficient planning:

Model	Token budget	Push-T success	Planning time (50 traj)
DINO-WM (patches)	196 x 384	91.33%	5,763s
C-JEPA (objects)	6 x 128 (1% of above)	88.67%	673s

Nearly the same task performance with 100x fewer features and 8x faster planning.

Masking budget

The optimal number of objects to mask depends on the encoder and task:

With VideoSAUR (high-quality slots): masking all objects (4/4) works best for reasoning
With SAVi (lower-quality slots): masking 2/4 objects is optimal
Over-masking with weak encoders eliminates too much information

C-JEPA also found that object-level masking outperforms token-level and tube-level masking of equivalent budget — the object boundary is the right abstraction level.