6 things to fix before RLHF turns your biases into features

ByRicardo June 2, 2026

Here is a sentence that ought to give any ML workforce pause:

The mannequin you are attempting to align can be the mannequin producing the information you might be utilizing to align it.

Congratulations, you might have constructed an ouroboros.

A paper accepted at ICML 2026 by Dongyoon Hahm, Dylan Hadfield-Menell, and Kimin Lee places a reputation to what can go mistaken inside that loop:

1. Separate high quality from ideology in your annotation schema

(*6*)

💡

Asking annotators to decide the “higher” response is a bit like asking somebody which of two meals they most popular after which concluding the successful chef has superior ethics. Quality and values are being bundled into a single label, and your reward mannequin has completely no manner to pull them aside.

The fix is to decompose your rubric. Score fluency, accuracy, and process completion individually from tonal or ideological dimensions.

4. Consider DPO to scale back reward mannequin compounding

Each RLHF iteration builds on the reward mannequin from the final one. If that reward mannequin has already absorbed a quality-bias conflation, the compounding throughout iterations is exactly what the paper’s experiments present taking place.

Direct Preference Optimization sidesteps the express reward mannequin completely, which removes not less than one amplification pathway from the loop.

DPO is well-supported in

6. Stop treating your annotation workforce as a set variable

Annotator demographic composition, area experience, and fatigue all form what the reward mannequin learns. A desire dataset collected completely from annotators in a single area or skilled background will encode the biases of that group with spectacular constancy.

RLHF will then do what it does greatest and optimize on them.

A couple of concrete steps price building into your course of:

Review the demographic {and professional} variety of your annotation workforce not less than as soon as per major training cycle, as a result of a homogeneous annotator pool produces a homogeneous reward mannequin
Flag duties the place inter-annotator settlement is low, disagreement typically indicators a bias dimension is energetic, and averaging over it doesn’t make it go away
Weight annotations from area consultants extra closely on technical duties moderately than averaging throughout a common pool, notably in regulated industries the place particular language carries compliance weight

None of this eliminates the alignment tampering vulnerability the paper describes. It does scale back the energy of the biases that get encoded within the first place, which supplies RLHF much less to amplify.

Final ideas

(*6*)

💡

The paper’s most helpful contribution is reframing alignment as a two-sided course of. Your workforce shapes the mannequin. The mannequin, by way of its outputs, shapes the information that shapes it again. Treating RLHF as a one-way correction mechanism is a bit like modifying a doc whereas the doc can be modifying you.

The analysis group doesn’t but have a consensus fix, however the map of the place the vulnerabilities sit is now significantly clearer.

For groups operating coaching cycles in 2026, studying that map before the following run is the transfer.

Agentic AI Artificial Intelligence

Verifiable execution for AI agents
ByRicardo April 21, 2026

Run-time isolation and sandboxing Reproducibility addresses the integrity of outputs; isolation constrains what an agent can do within the first place. As NVIDIA’s AI Red Team notes, AI coding agents typically execute instructions with the person’s full system privileges, vastly increasing the assault floor. A compromised or errant agent may: Write to vital system information…

Read More Verifiable execution for AI agents
Agentic AI AI Agents

How to Build Transparent AI Agents: Traceable Decision-Making with Audit Trails and Human Gates
ByRicardo February 22, 2026

In this tutorial, we build a glass-box agentic workflow that makes every decision traceable, auditable, and explicitly governed by human approval. We design the system to log each thought, action, and observation into a tamper-evident audit ledger while enforcing dynamic permissioning for high-risk operations. By combining LangGraph’s interrupt-driven human-in-the-loop control with a hash-chained database, we…

Read More How to Build Transparent AI Agents: Traceable Decision-Making with Audit Trails and Human Gates
Agentic AI AI Agents

Google Antigravity Makes the IDE a Control Plane for Agentic Coding
ByRicardo November 19, 2025

Google has launched Antigravity as an agentic development platform that sits on high of Gemini 3. It will not be solely an autocomplete layer, it’s an IDE the place brokers plan, execute, and clarify advanced software program duties throughout editor, terminal, and browser surfaces. Antigravity was launched on November 18, 2025, alongside Gemini 3 as…

Read More Google Antigravity Makes the IDE a Control Plane for Agentic Coding
Agentic AI AI Agents

Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe
ByRicardo July 29, 2025

As large language models (LLMs) evolve from simple text generators to agentic systems —able to plan, reason, and autonomously act—there is a significant increase in both their capabilities and associated risks. Enterprises are rapidly adopting agentic AI for automation, but this trend exposes organizations to new challenges: goal misalignment, prompt injection, unintended behaviors, data leakage,…

Read More Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe
Agentic AI AI Agents

Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder
ByRicardo September 26, 2025

Hugging Face (HF) has launched Smol2Operator, a reproducible, end-to-end recipe that turns a small vision-language mannequin (VLM) with no prior UI grounding into a GUI-operating, tool-using agent. The launch covers knowledge transformation utilities, coaching scripts, remodeled datasets, and the ensuing 2.2B-parameter mannequin checkpoint—positioned as a full blueprint for constructing GUI brokers from scratch somewhat than…

Read More Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder
Agentic AI AI Agents

Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents
ByRicardo January 23, 2026

GLM-4.7-Flash is a new member of the GLM 4.7 family and targets developers who want strong coding and reasoning performance in a model that is practical to run locally. Zhipu AI (Z.ai) describes GLM-4.7-Flash as a 30B-A3B MoE model and presents it as the strongest model in the 30B class, designed for lightweight deployment where…

Read More Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents

6 things to fix before RLHF turns your biases into features

Verifiable execution for AI agents

How to Build Transparent AI Agents: Traceable Decision-Making with Audit Trails and Human Gates

Google Antigravity Makes the IDE a Control Plane for Agentic Coding

Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe

Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder

Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!