A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Forgetting Compared to Supervised Fine-Tuning

What is catastrophic forgetting in basis fashions?

Foundation fashions excel in various domains however are largely static as soon as deployed. Fine-tuning on new duties usually introduces catastrophic forgetting—the lack of beforehand discovered capabilities. This limitation poses a barrier for constructing long-lived, frequently enhancing AI brokers.

Why does on-line reinforcement studying overlook lower than supervised fine-tuning?

A new MIT examine compares reinforcement studying (RL) and supervised fine-tuning (SFT). Both can obtain excessive efficiency on new duties, however SFT tends to overwrite prior talents. RL, in contrast, preserves them. The key lies in how every methodology shifts the mannequin’s output distribution relative to the bottom coverage.

How can forgetting be measured?

The analysis group proposes an empirical forgetting legislation:

Forgetting∝KL(π0∣∣π)

the place π0 is the bottom mannequin and π is the fine-tuned mannequin. The ahead KL divergence, measured on the brand new process, strongly predicts the extent of forgetting. This makes forgetting quantifiable while not having knowledge from prior duties.

What do experiments on massive language fashions reveal?

Using Qwen 2.5 3B-Instruct as the bottom mannequin, fine-tuning was carried out on:

Math reasoning (Open-Reasoner-Zero),
Science Q&A (SciKnowEval subset),
Tool use (ToolAlpaca).

Performance was evaluated on prior benchmarks akin to HellaSwag, MMLU, TruthfulQA, and HumanEval. Results confirmed that RL improved new-task accuracy whereas retaining prior-task accuracy steady, whereas SFT constantly sacrificed prior data.

How does RL examine to SFT in robotics duties?

In robotic management experiments with OpenVLA-7B fine-tuned in SimplerEnv pick-and-place eventualities, RL adaptation maintained common manipulation abilities throughout duties. SFT, whereas profitable on the brand new process, degraded prior manipulation talents—once more illustrating RL’s conservatism in preserving data.

What insights come from the ParityMNIST examine?

To isolate mechanisms, the analysis group launched a toy downside, ParityMNIST. Here, RL and SFT each reached excessive new-task accuracy, however SFT induced sharper declines on the FashionMNIST auxiliary benchmark. Crucially, plotting forgetting towards KL divergence revealed a single predictive curve, validating KL because the governing issue.

Why do on-policy updates matter?

On-policy RL samples from the mannequin’s personal outputs, incrementally reweighting them by reward. This course of constrains studying to distributions already shut to the bottom mannequin. SFT, in distinction, optimizes towards mounted labels which may be arbitrarily distant. Theoretical evaluation reveals coverage gradients converge to KL-minimal optimum options, formalizing RL’s benefit.

Are different explanations enough?

The analysis group examined options: weight-space adjustments, hidden illustration drift, sparsity of updates, and different distributional metrics (reverse KL, whole variation, L2 distance). None matched the predictive energy of ahead KL divergence, reinforcing that distributional closeness is the vital issue.

What are the broader implications?

Evaluation: Post-training ought to think about KL-conservatism, not simply process accuracy.
Hybrid strategies: Combining SFT effectivity with express KL minimization may yield optimum trade-offs.
Continual studying: RL’s Razor presents a measurable criterion for designing adaptive brokers that be taught new abilities with out erasing outdated ones.

Conclusion

The MIT analysis reframes catastrophic forgetting as a distributional downside ruled by ahead KL divergence. Reinforcement studying forgets much less as a result of its on-policy updates naturally bias towards KL-minimal options. This precept—RL’s Razor—offers each a proof for RL’s robustness and a roadmap for creating post-training strategies that assist lifelong studying in basis fashions.

Key Takeaways

Reinforcement studying (RL) preserves prior data higher than Supervised fine-tuning (SFT): Even when each obtain the identical accuracy on new duties, RL retains prior capabilities whereas SFT erases them.
Forgetting is predictable by KL divergence: The diploma of catastrophic forgetting is strongly correlated with the ahead KL divergence between the fine-tuned and base coverage, measured on the brand new process.
RL’s Razor precept: On-policy RL converges to KL-minimal options, guaranteeing updates stay shut to the bottom mannequin and decreasing forgetting.
Empirical validation throughout domains: Experiments on LLMs (math, science Q&A, device use) and robotics duties verify RL’s robustness towards forgetting, whereas SFT constantly trades outdated data for new-task efficiency.
Controlled experiments verify generality: In the ParityMNIST toy setting, each RL and SFT confirmed forgetting aligned with KL divergence, proving the precept holds past large-scale fashions.
Future design axis for post-training: Algorithms needs to be evaluated not solely by new-task accuracy but in addition by how conservatively they shift distributions in KL house, opening avenues for hybrid RL–SFT strategies.

Check out the PAPER and PROJECT PAGE. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter.

The submit A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Forgetting Compared to Supervised Fine-Tuning appeared first on MarkTechPost.

A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Forgetting Compared to Supervised Fine-Tuning

Table of contents

What is catastrophic forgetting in basis fashions?