6 things to fix before RLHF turns your biases into features
Here is a sentence that ought to give any ML workforce pause:
The mannequin you are attempting to align can be the mannequin producing the information you might be utilizing to align it.
Congratulations, you might have constructed an ouroboros.
A paper accepted at ICML 2026 by Dongyoon Hahm, Dylan Hadfield-Menell, and Kimin Lee places a reputation to what can go mistaken inside that loop:
1. Separate high quality from ideology in your annotation schema
(*6*)
The fix is to decompose your rubric. Score fluency, accuracy, and process completion individually from tonal or ideological dimensions.
4. Consider DPO to scale back reward mannequin compounding
Each RLHF iteration builds on the reward mannequin from the final one. If that reward mannequin has already absorbed a quality-bias conflation, the compounding throughout iterations is exactly what the paper’s experiments present taking place.
Direct Preference Optimization sidesteps the express reward mannequin completely, which removes not less than one amplification pathway from the loop.
DPO is well-supported in
6. Stop treating your annotation workforce as a set variable
Annotator demographic composition, area experience, and fatigue all form what the reward mannequin learns. A desire dataset collected completely from annotators in a single area or skilled background will encode the biases of that group with spectacular constancy.
RLHF will then do what it does greatest and optimize on them.
A couple of concrete steps price building into your course of:
- Review the demographic {and professional} variety of your annotation workforce not less than as soon as per major training cycle, as a result of a homogeneous annotator pool produces a homogeneous reward mannequin
- Flag duties the place inter-annotator settlement is low, disagreement typically indicators a bias dimension is energetic, and averaging over it doesn’t make it go away
- Weight annotations from area consultants extra closely on technical duties moderately than averaging throughout a common pool, notably in regulated industries the place particular language carries compliance weight
None of this eliminates the alignment tampering vulnerability the paper describes. It does scale back the energy of the biases that get encoded within the first place, which supplies RLHF much less to amplify.
Final ideas
(*6*)
The analysis group doesn’t but have a consensus fix, however the map of the place the vulnerabilities sit is now significantly clearer.
For groups operating coaching cycles in 2026, studying that map before the following run is the transfer.
