MC-JEPA
An exploratory extension of JEPA that jointly learns optical flow (motion) and content features within a shared encoder. Bridges the gap between motion estimation and semantic understanding.
Core idea
Traditional self-supervised methods learn either motion features (optical flow) or content features (object identity) — never both. MC-JEPA unifies them in a single architecture with a shared encoder that is trained on two objectives simultaneously:
- Optical flow estimation — a reconstruction objective that predicts pixel displacements between frames
- Content feature learning — a JEPA-style embedding prediction objective that captures semantic structure
The hypothesis: motion estimation forces the encoder to track objects across time, while content learning forces it to understand what those objects are. The shared encoder benefits from both signals.
Architecture
- Shared ViT encoder processes video frames
- Two prediction heads: one for optical flow (reconstruction-based), one for content embeddings (JEPA-style)
- Joint loss: weighted combination of flow loss and embedding prediction loss
- EMA target encoder for the content prediction branch (same as I-JEPA)
Results
- Optical flow: competitive with unsupervised flow methods (RAFT-based baselines) on Sintel and KITTI benchmarks
- Semantic segmentation: competitive with DINO and MAE on DAVIS video segmentation and ADE20K image segmentation
- Key finding: the two objectives are synergistic — content features trained jointly with flow outperform content-only training, and vice versa
Limitations
MC-JEPA is a hybrid: the optical flow head is reconstruction-based (predicts pixels), which breaks the pure JEPA principle of non-generative latent prediction. Later work (V-JEPA) achieved strong motion understanding without any pixel-level objective, making MC-JEPA's hybrid approach less relevant.
Significance in the JEPA timeline
An exploratory step that asked the right question ("can JEPA learn dynamics?") but didn't find the cleanest answer. The insight that motion and content should be learned together carried forward; the specific mechanism (shared flow+content heads) was superseded by V-JEPA's pure feature prediction approach. MC-JEPA demonstrated that JEPA-style objectives could work on video data, establishing the direction that V-JEPA and V-JEPA 2 would take to its conclusion.
Links
See also
- 2301.08243 (I-JEPA) — the image foundation it builds on
- 2506.09985 (V-JEPA 2) — the full video scaling that followed