RA3: Mid-Training with Temporal Action Abstractions for Faster Reinforcement Learning (RL) Post-Training in Code LLMs
TL;DR: A brand new analysis from Apple, formalizes what “mid-training” ought to do earlier than reinforcement studying RL post-training and introduces RA3 (Reasoning as Action Abstractions)—an EM-style process that learns temporally constant latent actions from knowledgeable traces, then fine-tunes on these bootstrapped traces. It exhibits mid-training ought to (1) prune to a compact near-optimal motion…
