Simple Self-Distillation (SSD)
An Apple research paper showing that LLMs can substantially improve at code generation by fine-tuning on their own sampled outputs — no verifier, no teacher model, no RL. The simplicity of the method and the quality of the gains illuminate a deep connection to self-supervised representation learning.
Core idea
Simple Self-Distillation (SSD): sample solutions from the model at a specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss. That's it.
- No human-labeled solutions
- No reference answers
- No teacher model
- No reward model or verifier
- No execution environment
- No reinforcement learning
Key results
| Model | Before SSD | After SSD | Improvement |
|---|---|---|---|
| Qwen3-30B-Instruct | 42.4% pass@1 | 55.3% pass@1 | +30% relative |
| Qwen3-4B-Instruct | lower | improved | consistent gains |
| Llama models (4B, 8B) | baseline | improved | generalizes across families |
- Gains concentrate on harder problems (hard pass@5: 31.1% → 54.1%)
- Works across 5 models, 2 families (Qwen, Llama), 3 scales (4B, 8B, 30B)
- Works on both instruct and thinking variants
The precision-exploration conflict
The paper's most interesting contribution is explaining why SSD works. Code generation has two types of positions:
- Fork positions: multiple genuinely plausible continuations (different solution approaches). Need high temperature for diversity.
- Lock positions: syntax/semantics leave little ambiguity, but distractors exist. Need low temperature for precision.
Any single decoding temperature is a compromise between these two needs. SSD reshapes the model's distributions in a context-dependent way: it suppresses distractor tails at lock positions while preserving diversity at fork positions. Changing decoding temperature alone cannot achieve this — it's a global knob, while the improvement is local.
Connection to JEPA and self-supervised learning
SSD is not a JEPA method, but it connects to several JEPA themes:
Self-distillation ↔ EMA teacher
JEPA's EMA target encoder is itself a form of self-distillation: the model produces its own training targets via a slowly-updated copy of itself. SSD does something analogous — the model generates its own training data. Both exploit the principle that a model's own outputs, properly filtered, contain signal the model hasn't yet internalized through its standard training objective.
Unrealized capability
SSD's key finding — that "existing code models contain capability not realized under fixed decoding alone" — echoes a core JEPA insight. JEPA argues that autoregressive LLMs waste capacity modeling unpredictable surface variation (the capacity waste argument). SSD provides evidence from the other direction: the capability is there, but the decoding interface (next-token sampling) fails to extract it. JEPA's response is to bypass the decoding interface entirely by predicting in latent space.
Representation reshaping without external signal
Both SSD and JEPA improve representations without external supervision. JEPA uses masking + latent prediction. SSD uses temperature-shifted self-sampling + fine-tuning. The common thread: the structure of the data provides its own supervision — you don't need human labels, rewards, or a stronger teacher.
Context-dependent compression
SSD's "support compression" at lock positions and "diversity preservation" at fork positions mirrors JEPA's core promise: abstract away the unpredictable (noise, irrelevant detail) while preserving what's predictable and useful. SSD achieves this within token space; JEPA achieves it by moving to latent space entirely.
Limitations
- Only evaluated on code generation — unclear if SSD transfers to other LLM tasks
- Requires careful temperature and truncation tuning (T_train matters)
- Still operates within the autoregressive paradigm — doesn't address the fundamental sequential/generative limitations JEPA critiques
- Fine-tuning on own outputs risks mode collapse over multiple iterations
Links
See also
- latent-prediction — JEPA's argument for why token-space prediction wastes capacity
- collapse-prevention — EMA as self-distillation in the JEPA context
- 2509.14252 (LLM-JEPA) — applying JEPA objectives to LLMs directly
- jepa-vs-alternatives — the broader autoregressive vs JEPA comparison