Understanding LLM Distillation Techniques

Modern giant language fashions are not educated solely on uncooked web textual content. Increasingly, firms are utilizing highly effective “trainer” fashions to assist practice smaller or extra environment friendly “scholar” fashions. This course of, broadly referred to as LLM distillation or model-to-model coaching, has change into a key approach for constructing high-performing fashions at decrease computational price. Meta used its huge Llama 4 Behemoth mannequin to assist practice Llama 4 Scout and Maverick, whereas Google leveraged Gemini fashions in the course of the growth of Gemma 2 and Gemma 3. Similarly, DeepSeek distilled reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama-based fashions.

The core concept is straightforward: as an alternative of studying solely from human-written textual content, a scholar mannequin may also be taught from the outputs, possibilities, reasoning traces, or behaviors of one other LLM. This permits smaller fashions to inherit capabilities comparable to reasoning, instruction following, and structured technology from a lot bigger techniques. Distillation can occur throughout pre-training, the place trainer and scholar fashions are educated collectively, or throughout post-training, the place a completely educated trainer transfers data to a separate scholar mannequin.

In this text, we are going to discover three main approaches used for coaching one LLM utilizing one other: Soft-label distillation, the place the scholar learns from the trainer’s likelihood distributions; Hard-label distillation, the place the scholar imitates the trainer’s generated outputs; and Co-distillation, the place a number of fashions be taught collaboratively by sharing predictions and behaviors throughout coaching.

Soft-Label Distillation

Soft-label distillation is a coaching approach the place a smaller scholar LLM learns by imitating the output likelihood distribution of a bigger trainer LLM. Instead of coaching solely on the proper subsequent token, the scholar is educated to match the trainer’s softmax possibilities throughout the whole vocabulary. For instance, if the trainer predicts the following token with possibilities like “cat” = 70%, “canine” = 20%, and “animal” = 10%, the scholar learns not simply the ultimate reply, but additionally the relationships and uncertainty between totally different tokens. This richer sign is usually referred to as the trainer’s “darkish data” as a result of it accommodates hidden details about reasoning patterns and semantic understanding.

The greatest benefit of soft-label distillation is that it permits smaller fashions to inherit capabilities from a lot bigger fashions whereas remaining quicker and cheaper to deploy. Since the scholar learns from the trainer’s full likelihood distribution, coaching turns into extra secure and informative in comparison with studying from exhausting one-word targets alone. However, this technique additionally comes with sensible challenges. To generate delicate labels, you want entry to the trainer mannequin’s logits or weights, which is usually not attainable with closed-source fashions. In addition, storing likelihood distributions for each token throughout vocabularies containing 100k+ tokens turns into extraordinarily memory-intensive at LLM scale, making pure soft-label distillation costly for trillion-token datasets.

Hard-label distillation

Hard-label distillation is an easier method the place the scholar LLM learns solely from the trainer mannequin’s closing predicted output token as an alternative of its full likelihood distribution. In this setup, a pre-trained trainer mannequin generates the most certainly subsequent token or response, and the scholar mannequin is educated utilizing commonplace supervised studying to breed that output. The trainer basically acts as a high-quality annotator that creates artificial coaching knowledge for the scholar. DeepSeek used this method to distill reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama 3.1 fashions.

Unlike soft-label distillation, the scholar doesn’t see the trainer’s inside confidence scores or token relationships — it solely learns the ultimate reply. This makes hard-label distillation computationally less expensive and simpler to implement since there isn’t any have to retailer huge likelihood distributions for each token. It can be particularly helpful when working with proprietary “black-box” fashions like GPT-4 APIs, the place builders solely have entry to generated textual content and never the underlying logits. While exhausting labels comprise much less data than delicate labels, they continue to be extremely efficient for instruction tuning, reasoning datasets, artificial knowledge technology, and domain-specific fine-tuning duties.

Co-distillation

Co-distillation is a coaching method the place each the trainer and scholar fashions are educated collectively as an alternative of utilizing a hard and fast pre-trained trainer. In this setup, the trainer LLM and scholar LLM course of the identical coaching knowledge concurrently and generate their very own softmax likelihood distributions. The trainer is educated usually utilizing the ground-truth exhausting labels, whereas the scholar learns by matching the trainer’s delicate labels together with the precise appropriate solutions. Meta used a type of this method whereas coaching Llama 4 Scout and Maverick alongside the bigger Llama 4 Behemoth mannequin.

One problem with co-distillation is that the trainer mannequin shouldn’t be absolutely educated in the course of the early levels, that means its predictions might initially be noisy or inaccurate. To overcome this, the scholar is normally educated utilizing a mixture of soft-label distillation loss and commonplace hard-label cross-entropy loss. This creates a extra secure studying sign whereas nonetheless permitting data switch between fashions. Unlike conventional one-way distillation, co-distillation permits each fashions to enhance collectively throughout coaching, typically main to raised efficiency, stronger reasoning switch, and smaller efficiency gaps between the trainer and scholar fashions.

Comparing the Three Distillation Techniques

Soft-label distillation transfers the richest type of data as a result of the scholar learns from the trainer’s full likelihood distribution as an alternative of solely the ultimate reply. This helps smaller fashions seize reasoning patterns, uncertainty, and relationships between tokens, typically resulting in stronger total efficiency. However, it’s computationally costly, requires entry to the trainer’s logits or weights, and turns into tough to scale as a result of storing likelihood distributions for enormous vocabularies consumes huge reminiscence.

Hard-label distillation is less complicated and extra sensible. The scholar solely learns from the trainer’s closing generated outputs, making it less expensive and simpler to implement. It works particularly nicely with proprietary black-box fashions like GPT-4 APIs the place inside possibilities are unavailable. While this method loses a number of the deeper “darkish data” current in delicate labels, it stays extremely efficient for instruction tuning, artificial knowledge technology, and task-specific fine-tuning.

Co-distillation takes a collaborative method the place trainer and scholar fashions be taught collectively throughout coaching. The trainer improves whereas concurrently guiding the scholar, permitting each fashions to learn from shared studying alerts. This can scale back the efficiency hole seen in conventional one-way distillation strategies, however it additionally makes coaching extra advanced for the reason that trainer’s predictions are initially unstable. In follow, soft-label distillation is most popular for max data switch, hard-label distillation for scalability and practicality, and co-distillation for large-scale joint coaching setups.