ACT-JEPA

The clearest bridge from JEPA to action and policy learning. Integrates imitation learning (IL) and self-supervised learning (SSL) to enhance policy representations.

Core idea

Train a policy to predict two things simultaneously:

Action sequences — using action chunking to improve prediction and reduce compounding errors
Abstract observation sequences — extending the chunking idea to predict future observations in JEPA's abstract representation space

By predicting in abstract representation space, the model filters out irrelevant details, improves efficiency, and develops a robust world model.

Key contributions

Action chunking + observation chunking: dual prediction objectives that reinforce each other
World model quality: the abstract observation prediction objective produces representations that capture temporal environment dynamics
Generalization: representations from observation prediction transfer effectively to action prediction

Results

ACT-JEPA performs on par with established baselines (ACT, Diffusion Policy) across decision-making tasks in simulation. The paper's contribution is conceptual rather than SOTA-setting: it demonstrates that adding a JEPA-style observation prediction objective to an imitation learning pipeline improves representation quality without hurting action prediction performance.

Honest assessment: "on par" means ACT-JEPA does not beat its baselines. The value proposition is architectural — showing that JEPA objectives and policy learning can coexist — not empirical superiority. This is a proof-of-concept, not a production system.

Limitations

Only evaluated in simulation, not on real robots
Requires expert demonstrations (inherits IL limitations)
Does not demonstrate the JEPA world model can be used for planning (only for representation learning)
The "views" for JEPA (action vs observation) are specific to this setup and don't generalize to arbitrary tasks

Significance in the JEPA timeline

ACT-JEPA is where JEPA first encounters action and policy learning. It asked the right question ("can JEPA objectives improve policy representations?") and showed the answer is yes, even if the improvements are modest. The architectural pattern — dual prediction of actions and latent observations — carried forward into V-JEPA 2's more ambitious action-conditioned world model, where the JEPA predictor is used directly for planning rather than just representation learning.

ACT-JEPA

Core idea

Key contributions

Results

Limitations

Significance in the JEPA timeline

Links

See also