MC-JEPA

An exploratory extension of JEPA that jointly learns optical flow (motion) and content features within a shared encoder. Bridges the gap between motion estimation and semantic understanding.

Core idea

Traditional self-supervised methods learn either motion features (optical flow) or content features (object identity) — never both. MC-JEPA unifies them in a single architecture with a shared encoder that is trained on two objectives simultaneously:

Optical flow estimation — a reconstruction objective that predicts pixel displacements between frames
Content feature learning — a JEPA-style embedding prediction objective that captures semantic structure

The hypothesis: motion estimation forces the encoder to track objects across time, while content learning forces it to understand what those objects are. The shared encoder benefits from both signals.

Architecture

Shared ViT encoder processes video frames
Two prediction heads: one for optical flow (reconstruction-based), one for content embeddings (JEPA-style)
Joint loss: weighted combination of flow loss and embedding prediction loss
EMA target encoder for the content prediction branch (same as I-JEPA)

Results

Optical flow: competitive with unsupervised flow methods (RAFT-based baselines) on Sintel and KITTI benchmarks
Semantic segmentation: competitive with DINO and MAE on DAVIS video segmentation and ADE20K image segmentation
Key finding: the two objectives are synergistic — content features trained jointly with flow outperform content-only training, and vice versa

Limitations

MC-JEPA is a hybrid: the optical flow head is reconstruction-based (predicts pixels), which breaks the pure JEPA principle of non-generative latent prediction. Later work (V-JEPA) achieved strong motion understanding without any pixel-level objective, making MC-JEPA's hybrid approach less relevant.

Significance in the JEPA timeline

An exploratory step that asked the right question ("can JEPA learn dynamics?") but didn't find the cleanest answer. The insight that motion and content should be learned together carried forward; the specific mechanism (shared flow+content heads) was superseded by V-JEPA's pure feature prediction approach. MC-JEPA demonstrated that JEPA-style objectives could work on video data, establishing the direction that V-JEPA and V-JEPA 2 would take to its conclusion.

MC-JEPA

Core idea

Architecture

Results

Limitations

Significance in the JEPA timeline

Links

See also