JEPAwiki

Object-Centric Representations

While most JEPA variants operate on spatial patches, C-JEPA introduced a fundamentally different approach: operating on object-level representations. This shift from patches to objects enables causal reasoning, dramatic efficiency gains, and more structured world models.

Patches vs. objects

Standard JEPA (I-JEPA, V-JEPA 2) divides inputs into a regular grid of patches. This is simple and general, but:

  • Patches don't align with object boundaries
  • A single object spans many patches (redundant)
  • Interactions between objects are diluted across patch-pair attention
  • Planning requires processing all patches (e.g., 196 patches x 384 dims = 75,264 features)

Object-centric models instead extract a fixed number of slots, each representing one object:

  • Slots align with semantic entities
  • Each object is one compact vector (e.g., 128 dims)
  • Object interactions are directly modeled
  • Planning uses far fewer features (e.g., 6 slots x 128 dims = 768 features)

Patches vs Objects

Object extraction methods

VideoSAUR (used by C-JEPA)

  • Frozen DINOv2 ViT-S/14 backbone extracts patch features
  • Slot attention mechanism aggregates patch features into object slots
  • Temporal similarity loss ensures slots track the same objects across frames
  • Slot dimensionality: 128

SAVi (Slot Attention for Video)

  • Iterative slot attention from raw pixels
  • Reconstructs input frames from slot representations (uses a decoder)
  • Fixed number of slots: 7 for CLEVRER, 4 for Push-T

Object-level masking

C-JEPA's key innovation: instead of masking spatial patches, mask entire objects. At each timestep, selected objects are hidden, and their state must be inferred from:

  • Other visible objects at the same timestep
  • The masked object's own "identity anchor" from an earlier frame (t₀)
  • Auxiliary variables (actions, proprioception)

This creates a fundamentally different learning signal than patch masking:

Aspect Patch masking Object masking
What's hidden Spatial region Semantic entity
Prediction requires Spatial interpolation Interaction reasoning
Shortcut solutions Copy nearby patches Blocked (whole object missing)
Causal structure Not induced Induced via latent interventions

Causal inductive bias

C-JEPA provides a formal analysis showing that object-level masking induces latent interventions analogous to causal interventions:

  1. Masking an object removes its direct observability without changing the underlying dynamics
  2. The model must infer the masked object's state from how it influences and is influenced by other objects
  3. This creates time-lagged predictive dependencies aligned with causal influence
  4. The influence neighborhood N_t(i) — the minimal set of other objects needed to predict object i — captures the causal structure

Counterfactual reasoning results

This causal structure has measurable effects. On CLEVRER visual QA:

Model Overall accuracy Counterfactual accuracy
C-JEPA (4-object masking) 89.40% 68.81%
Same architecture, no masking 82.79% 47.68%
SlotFormer (with reconstruction) 79.44% 47.29%

The +21% improvement in counterfactual reasoning demonstrates that object-level masking genuinely induces causal understanding, not just better pattern matching.

Planning efficiency

Object-centric representations enable dramatically more efficient planning:

Model Token budget Push-T success Planning time (50 traj)
DINO-WM (patches) 196 x 384 91.33% 5,763s
C-JEPA (objects) 6 x 128 (1% of above) 88.67% 673s

Nearly the same task performance with 100x fewer features and 8x faster planning.

Masking budget

The optimal number of objects to mask depends on the encoder and task:

  • With VideoSAUR (high-quality slots): masking all objects (4/4) works best for reasoning
  • With SAVi (lower-quality slots): masking 2/4 objects is optimal
  • Over-masking with weak encoders eliminates too much information

C-JEPA also found that object-level masking outperforms token-level and tube-level masking of equivalent budget — the object boundary is the right abstraction level.

See also