3D-JEPA

Broadens the 3D JEPA story beyond point clouds into more general 3D representation learning, with a novel context-aware decoder architecture.

Core idea

A non-generative 3D self-supervised framework that predicts target block representations from context blocks using an encoder and context-aware decoder. Key innovations:

Multi-block sampling strategy: produces informative context blocks and representative target blocks
Context-aware decoder: continuously feeds context information to the decoder, pushing the encoder to learn semantic modeling rather than memorizing context-target relationships

Key difference from Point-JEPA

While Point-JEPA focuses on sequencing and proximity for point clouds, 3D-JEPA emphasizes the decoder architecture and how context information flows during reconstruction. The context-aware decoder is the main contribution.

Results

88.65% accuracy on PB_T50_RS with only 150 pretraining epochs
Higher accuracy with fewer pretraining epochs than competing methods
Effective across various downstream tasks and datasets

Significance in the JEPA timeline

Together with Point-JEPA, establishes JEPA as a framework for full 3D semantics, not just images or video.

3D-JEPA

Core idea

Key difference from Point-JEPA

Results

Significance in the JEPA timeline

Links

See also