3D-JEPA: A Joint Embedding Predictive Architecture for 3D Self-Supervised Representation Learning
arXiv2409.15803
Date2024-09-24
Modality3D
AuthorsNaiwen Hu, Haozhe Cheng, Yifan Xie, Shiqi Li, Jihua Zhu
Tags3D, point-cloud, self-supervised-learning, context-aware-decoder
SourceFull text
3D-JEPA
Broadens the 3D JEPA story beyond point clouds into more general 3D representation learning, with a novel context-aware decoder architecture.
Core idea
A non-generative 3D self-supervised framework that predicts target block representations from context blocks using an encoder and context-aware decoder. Key innovations:
- Multi-block sampling strategy: produces informative context blocks and representative target blocks
- Context-aware decoder: continuously feeds context information to the decoder, pushing the encoder to learn semantic modeling rather than memorizing context-target relationships
Key difference from Point-JEPA
While Point-JEPA focuses on sequencing and proximity for point clouds, 3D-JEPA emphasizes the decoder architecture and how context information flows during reconstruction. The context-aware decoder is the main contribution.
Results
- 88.65% accuracy on PB_T50_RS with only 150 pretraining epochs
- Higher accuracy with fewer pretraining epochs than competing methods
- Effective across various downstream tasks and datasets
Significance in the JEPA timeline
Together with Point-JEPA, establishes JEPA as a framework for full 3D semantics, not just images or video.
Links
See also
- 2404.16432 (Point-JEPA) — the other 3D branch
- masking-strategies — multi-block sampling