Prefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)

ByRicardo August 24, 2025August 24, 2025

Giant language fashions are usually refined after pretraining utilizing both supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), every with distinct strengths and limitations. SFT is efficient in instructing instruction-following by way of example-based studying, however it will possibly result in inflexible habits and poor generalization. RFT, then again, optimizes fashions for activity success utilizing reward indicators, which might enhance efficiency but additionally introduce instability and reliance on a robust beginning coverage. Whereas these strategies are sometimes used sequentially, their interplay stays poorly understood. This raises an necessary query: how can we design a unified framework that mixes SFT’s construction with RFT’s goal-driven studying?

Analysis on the intersection of RL and LLM post-training has gained momentum, notably for coaching reasoning-capable fashions. Offline RL, which learns from mounted datasets, typically yields suboptimal insurance policies because of the restricted range of the information. This has sparked curiosity in combining offline and on-line RL approaches to enhance efficiency. In LLMs, the dominant technique is to first apply SFT to show fascinating behaviors, then use RFT to optimize outcomes. Nevertheless, the dynamics between SFT and RFT are nonetheless not properly understood, and discovering efficient methods to combine them stays an open analysis problem.

Researchers from the College of Edinburgh, Fudan College, Alibaba Group, Stepfun, and the College of Amsterdam suggest a unified framework that mixes supervised and reinforcement fine-tuning in a approach known as Prefix-RFT. This methodology guides exploration utilizing partial demonstrations, permitting the mannequin to proceed producing options with flexibility and adaptableness. Examined on math reasoning duties, Prefix-RFT constantly outperforms standalone SFT, RFT, and mixed-policy strategies. It integrates simply into present frameworks and proves sturdy to modifications in demonstration high quality and amount. Mixing demonstration-based studying with exploration can result in more practical and adaptive coaching of huge language fashions.

The research presents Prefix Reinforcement High-quality-Tuning (Prefix-RFT) as a technique to mix the strengths of SFT and RFT. Whereas SFT gives stability by mimicking knowledgeable demonstrations, RFT encourages exploration by way of using reward indicators. Prefix-RFT bridges the 2 by utilizing a partial demonstration (a prefix) and letting the mannequin generate the remainder. This strategy guides studying with out relying too closely on full supervision. It incorporates strategies like entropy-based clipping and a cosine decay scheduler to make sure steady coaching and environment friendly studying. In comparison with prior strategies, Prefix-RFT gives a extra balanced and adaptive fine-tuning technique.

Prefix-RFT is a reward fine-tuning methodology that improves efficiency utilizing high-quality offline math datasets, equivalent to OpenR1-Math-220K (46k filtered issues). Examined on Qwen2.5-Math-7B, 1.5B, and LLaMA-3.1-8B, it was evaluated on benchmarks together with AIME 2024/25, AMC, MATH500, Minerva, and OlympiadBench. Prefix-RFT achieved the very best avg@32 and go@1 scores throughout duties, outperforming RFT, SFT, ReLIFT, and LUFFY. Utilizing Dr. GRPO, it up to date solely the highest 20% high-entropy prefix tokens, with the prefix size decaying from 95% to five%. It maintained intermediate SFT loss, indicating a robust steadiness between imitation and exploration, particularly on troublesome issues (Trainhard).

In conclusion, Prefix-RFT combines the strengths of SFT and RFT by using sampled demonstration prefixes to information studying. Regardless of its simplicity, it constantly outperforms SFT, RFT, and hybrid baselines throughout numerous fashions and datasets. Even with just one% of the coaching information (450 prompts), it maintains sturdy efficiency (avg@32 drops solely from 40.8 to 37.6), displaying effectivity and robustness. Its top-20% entropy-based token replace technique proves best, reaching the very best benchmark scores with shorter outputs. Furthermore, utilizing a cosine decay scheduler for prefix size enhances stability and studying dynamics in comparison with a uniform technique, notably on complicated duties equivalent to AIME.

Try the Paper here. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The put up Prefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) appeared first on MarkTechPost.

AI Shorts Applications

A Coding Implementation to Build and Train Advanced Architectures with Residual Connections, Self-Attention, and Adaptive Optimization Using JAX, Flax, and Optax
ByRicardo November 11, 2025

In this tutorial, we discover how to construct and prepare a sophisticated neural community utilizing JAX, Flax, and Optax in an environment friendly and modular approach. We start by designing a deep structure that integrates residual connections and self-attention mechanisms for expressive function studying. As we progress, we implement refined optimization methods with studying price…

Read More A Coding Implementation to Build and Train Advanced Architectures with Residual Connections, Self-Attention, and Adaptive Optimization Using JAX, Flax, and Optax
AI Paper Summary AI Shorts

MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Advanced AI Reasoning and Outperforms 20x Larger Reasoning Models
ByRicardo September 9, 2025

A workforce of researchers from MBZUAI’s Institute of Foundation Models and G42 launched K2 Think, is a 32B-parameter open reasoning system for superior AI reasoning. It pairs lengthy chain-of-thought supervised fine-tuning with reinforcement studying from verifiable rewards, agentic planning, test-time scaling, and inference optimizations (speculative decoding + wafer-scale {hardware}). The result’s frontier-level math efficiency with…

Read More MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Advanced AI Reasoning and Outperforms 20x Larger Reasoning Models
AI Paper Summary AI Shorts

From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem
ByRicardo September 7, 2025

Large language fashions (LLMs) fairly often generate “hallucinations”—assured but incorrect outputs that seem believable. Despite enhancements in coaching strategies and architectures, hallucinations persist. A brand new analysis from OpenAI offers a rigorous rationalization: hallucinations stem from statistical properties of supervised versus self-supervised studying, and their persistence is strengthened by misaligned analysis benchmarks. What Makes Hallucinations…

Read More From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem
AI Paper Summary AI Shorts

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning
ByRicardo July 6, 2025

Optimizing LLMs for Human Alignment Using Reinforcement Learning Large language models often require a further alignment phase to optimize them for human use. In this phase, reinforcement learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning allows for the models to align more closely…

Read More New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning
AI Paper Summary AI Shorts

Google AI Releases C2S-Scale 27B Model that Translate Complex Single-Cell Gene Expression Data into ‘cell sentences’ that LLMs can Understand
ByRicardo October 17, 2025

A group of researchers from Google Research, Google DeepMind, and Yale launched C2S-Scale 27B, a 27-billion-parameter basis mannequin for single-cell evaluation constructed on Gemma-2. The mannequin formalizes single-cell RNA-seq (scRNA-seq) profiles as “cell sentences”—ordered lists of gene symbols—so that a language mannequin can natively parse and purpose over mobile states. Beyond benchmarking positive aspects, the…

Read More Google AI Releases C2S-Scale 27B Model that Translate Complex Single-Cell Gene Expression Data into ‘cell sentences’ that LLMs can Understand
AI Shorts Applications

OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App
ByRicardo September 30, 2025

OpenAI launched (*2*) a text-to-video-and-audio mannequin centered on bodily plausibility, multi-shot controllability, and synchronized dialogue/SFX. The OpenAI workforce has additionally launched a new invite-only Sora iOS app (U.S. and Canada first) that allows social creation, remixing, and consent-controlled “cameos” for inserting a verified likeness into generated scenes. Model capabilities Sora 2 claims materially higher world…

Read More OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App