JEPAwiki
3D-JEPA: A Joint Embedding Predictive Architecture for 3D Self-Supervised Representation Learning
Date2024-09-24
Modality3D
AuthorsNaiwen Hu, Haozhe Cheng, Yifan Xie, Shiqi Li, Jihua Zhu
Tags3D, point-cloud, self-supervised-learning, context-aware-decoder
SourceFull text

3D-JEPA

Broadens the 3D JEPA story beyond point clouds into more general 3D representation learning, with a novel context-aware decoder architecture.

Core idea

A non-generative 3D self-supervised framework that predicts target block representations from context blocks using an encoder and context-aware decoder. Key innovations:

  1. Multi-block sampling strategy: produces informative context blocks and representative target blocks
  2. Context-aware decoder: continuously feeds context information to the decoder, pushing the encoder to learn semantic modeling rather than memorizing context-target relationships

Key difference from Point-JEPA

While Point-JEPA focuses on sequencing and proximity for point clouds, 3D-JEPA emphasizes the decoder architecture and how context information flows during reconstruction. The context-aware decoder is the main contribution.

Results

  • 88.65% accuracy on PB_T50_RS with only 150 pretraining epochs
  • Higher accuracy with fewer pretraining epochs than competing methods
  • Effective across various downstream tasks and datasets

Significance in the JEPA timeline

Together with Point-JEPA, establishes JEPA as a framework for full 3D semantics, not just images or video.

Links

See also