JEPAwiki
MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features
Date2023-07-24
Modalityvideo
AuthorsAdrien Bardes, Jean Ponce, Yann LeCun
Tagsoptical-flow, motion, content-features, self-supervised-learning
SourceAbstract only

MC-JEPA

An exploratory extension of JEPA that jointly learns optical flow (motion) and content features within a shared encoder. Bridges the gap between motion estimation and semantic understanding.

Core idea

Traditional self-supervised methods learn either motion features (optical flow) or content features (object identity) — never both. MC-JEPA unifies them in a single architecture with a shared encoder that is trained on two objectives simultaneously:

  1. Optical flow estimation — a reconstruction objective that predicts pixel displacements between frames
  2. Content feature learning — a JEPA-style embedding prediction objective that captures semantic structure

The hypothesis: motion estimation forces the encoder to track objects across time, while content learning forces it to understand what those objects are. The shared encoder benefits from both signals.

Architecture

  • Shared ViT encoder processes video frames
  • Two prediction heads: one for optical flow (reconstruction-based), one for content embeddings (JEPA-style)
  • Joint loss: weighted combination of flow loss and embedding prediction loss
  • EMA target encoder for the content prediction branch (same as I-JEPA)

Results

  • Optical flow: competitive with unsupervised flow methods (RAFT-based baselines) on Sintel and KITTI benchmarks
  • Semantic segmentation: competitive with DINO and MAE on DAVIS video segmentation and ADE20K image segmentation
  • Key finding: the two objectives are synergistic — content features trained jointly with flow outperform content-only training, and vice versa

Limitations

MC-JEPA is a hybrid: the optical flow head is reconstruction-based (predicts pixels), which breaks the pure JEPA principle of non-generative latent prediction. Later work (V-JEPA) achieved strong motion understanding without any pixel-level objective, making MC-JEPA's hybrid approach less relevant.

Significance in the JEPA timeline

An exploratory step that asked the right question ("can JEPA learn dynamics?") but didn't find the cleanest answer. The insight that motion and content should be learned together carried forward; the specific mechanism (shared flow+content heads) was superseded by V-JEPA's pure feature prediction approach. MC-JEPA demonstrated that JEPA-style objectives could work on video data, establishing the direction that V-JEPA and V-JEPA 2 would take to its conclusion.

Links

See also

  • 2301.08243 (I-JEPA) — the image foundation it builds on
  • 2506.09985 (V-JEPA 2) — the full video scaling that followed