ACT-JEPA
The clearest bridge from JEPA to action and policy learning. Integrates imitation learning (IL) and self-supervised learning (SSL) to enhance policy representations.
Core idea
Train a policy to predict two things simultaneously:
- Action sequences — using action chunking to improve prediction and reduce compounding errors
- Abstract observation sequences — extending the chunking idea to predict future observations in JEPA's abstract representation space
By predicting in abstract representation space, the model filters out irrelevant details, improves efficiency, and develops a robust world model.
Key contributions
- Action chunking + observation chunking: dual prediction objectives that reinforce each other
- World model quality: the abstract observation prediction objective produces representations that capture temporal environment dynamics
- Generalization: representations from observation prediction transfer effectively to action prediction
Results
ACT-JEPA performs on par with established baselines (ACT, Diffusion Policy) across decision-making tasks in simulation. The paper's contribution is conceptual rather than SOTA-setting: it demonstrates that adding a JEPA-style observation prediction objective to an imitation learning pipeline improves representation quality without hurting action prediction performance.
Honest assessment: "on par" means ACT-JEPA does not beat its baselines. The value proposition is architectural — showing that JEPA objectives and policy learning can coexist — not empirical superiority. This is a proof-of-concept, not a production system.
Limitations
- Only evaluated in simulation, not on real robots
- Requires expert demonstrations (inherits IL limitations)
- Does not demonstrate the JEPA world model can be used for planning (only for representation learning)
- The "views" for JEPA (action vs observation) are specific to this setup and don't generalize to arbitrary tasks
Significance in the JEPA timeline
ACT-JEPA is where JEPA first encounters action and policy learning. It asked the right question ("can JEPA objectives improve policy representations?") and showed the answer is yes, even if the improvements are modest. The architectural pattern — dual prediction of actions and latent observations — carried forward into V-JEPA 2's more ambitious action-conditioned world model, where the JEPA predictor is used directly for planning rather than just representation learning.
Links
See also
- 2506.09985 (V-JEPA 2) — scales this to real robot planning
- 2603.22281 (ThinkJEPA) — adds VLM reasoning to the action loop