Simple Self-Distillation (SSD)

An Apple research paper showing that LLMs can substantially improve at code generation by fine-tuning on their own sampled outputs — no verifier, no teacher model, no RL. The simplicity of the method and the quality of the gains illuminate a deep connection to self-supervised representation learning.

Core idea

Simple Self-Distillation (SSD): sample solutions from the model at a specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss. That's it.

No human-labeled solutions
No reference answers
No teacher model
No reward model or verifier
No execution environment
No reinforcement learning

Key results

Model	Before SSD	After SSD	Improvement
Qwen3-30B-Instruct	42.4% pass@1	55.3% pass@1	+30% relative
Qwen3-4B-Instruct	lower	improved	consistent gains
Llama models (4B, 8B)	baseline	improved	generalizes across families

Gains concentrate on harder problems (hard pass@5: 31.1% → 54.1%)
Works across 5 models, 2 families (Qwen, Llama), 3 scales (4B, 8B, 30B)
Works on both instruct and thinking variants

The precision-exploration conflict

The paper's most interesting contribution is explaining why SSD works. Code generation has two types of positions:

Fork positions: multiple genuinely plausible continuations (different solution approaches). Need high temperature for diversity.
Lock positions: syntax/semantics leave little ambiguity, but distractors exist. Need low temperature for precision.

Any single decoding temperature is a compromise between these two needs. SSD reshapes the model's distributions in a context-dependent way: it suppresses distractor tails at lock positions while preserving diversity at fork positions. Changing decoding temperature alone cannot achieve this — it's a global knob, while the improvement is local.

Connection to JEPA and self-supervised learning

SSD is not a JEPA method, but it connects to several JEPA themes:

Self-distillation ↔ EMA teacher

JEPA's EMA target encoder is itself a form of self-distillation: the model produces its own training targets via a slowly-updated copy of itself. SSD does something analogous — the model generates its own training data. Both exploit the principle that a model's own outputs, properly filtered, contain signal the model hasn't yet internalized through its standard training objective.

Unrealized capability

SSD's key finding — that "existing code models contain capability not realized under fixed decoding alone" — echoes a core JEPA insight. JEPA argues that autoregressive LLMs waste capacity modeling unpredictable surface variation (the capacity waste argument). SSD provides evidence from the other direction: the capability is there, but the decoding interface (next-token sampling) fails to extract it. JEPA's response is to bypass the decoding interface entirely by predicting in latent space.

Representation reshaping without external signal

Both SSD and JEPA improve representations without external supervision. JEPA uses masking + latent prediction. SSD uses temperature-shifted self-sampling + fine-tuning. The common thread: the structure of the data provides its own supervision — you don't need human labels, rewards, or a stronger teacher.

Context-dependent compression

SSD's "support compression" at lock positions and "diversity preservation" at fork positions mirrors JEPA's core promise: abstract away the unpredictable (noise, irrelevant detail) while preserving what's predictable and useful. SSD achieves this within token space; JEPA achieves it by moving to latent space entirely.

Limitations

Only evaluated on code generation — unclear if SSD transfers to other LLM tasks
Requires careful temperature and truncation tuning (T_train matters)
Still operates within the autoregressive paradigm — doesn't address the fundamental sequential/generative limitations JEPA critiques
Fine-tuning on own outputs risks mode collapse over multiple iterations