Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud
arXiv2404.16432
Date2024-04-25
Modalitypoint-cloud
AuthorsAyumu Saito, Prachi Kudeshia, Jiju Poovvancheri
Tags3D, point-cloud, self-supervised-learning, few-shot
SourceFull text
Point-JEPA
Adapts JEPA specifically to point cloud data. Avoids raw-space reconstruction and shows that JEPA can work efficiently on geometric representations.
Core idea
Introduces a sequencer that orders point cloud patch embeddings to efficiently compute and utilize their proximity based on indices during target and context selection. The sequencer also enables shared computations between context and target selection, improving efficiency.
Key design choices
- No reconstruction in input space: predictions happen entirely in latent space, following the JEPA principle
- No additional modalities required: unlike some 3D SSL methods that need images or text
- Proximity-based masking: the sequencer enables spatial-aware context/target selection for point clouds
Results
- 93.7% classification accuracy (linear SVM on ModelNet40) — surpasses all other self-supervised models
- State-of-the-art across all four few-shot learning evaluation frameworks
- Code: github.com/Ayumu-J-S/Point-JEPA
Significance in the JEPA timeline
One of the key 3D branches. Proves JEPA is not vision-only — it generalizes to geometric representations. Together with 3D-JEPA, establishes JEPA in the 3D domain.
Links
See also
- 2409.15803 (3D-JEPA) — broader 3D representation learning
- 2301.08243 (I-JEPA) — the image architecture it adapts
- masking-strategies — sequencer-based masking for point clouds