6 things to fix before RLHF turns your biases into features

ByRicardo June 2, 2026

Here is a sentence that ought to give any ML workforce pause:

The mannequin you are attempting to align can be the mannequin producing the information you might be utilizing to align it.

Congratulations, you might have constructed an ouroboros.

A paper accepted at ICML 2026 by Dongyoon Hahm, Dylan Hadfield-Menell, and Kimin Lee places a reputation to what can go mistaken inside that loop:

1. Separate high quality from ideology in your annotation schema

(*6*)

💡

Asking annotators to decide the “higher” response is a bit like asking somebody which of two meals they most popular after which concluding the successful chef has superior ethics. Quality and values are being bundled into a single label, and your reward mannequin has completely no manner to pull them aside.

The fix is to decompose your rubric. Score fluency, accuracy, and process completion individually from tonal or ideological dimensions.

4. Consider DPO to scale back reward mannequin compounding

Each RLHF iteration builds on the reward mannequin from the final one. If that reward mannequin has already absorbed a quality-bias conflation, the compounding throughout iterations is exactly what the paper’s experiments present taking place.

Direct Preference Optimization sidesteps the express reward mannequin completely, which removes not less than one amplification pathway from the loop.

DPO is well-supported in

6. Stop treating your annotation workforce as a set variable

Annotator demographic composition, area experience, and fatigue all form what the reward mannequin learns. A desire dataset collected completely from annotators in a single area or skilled background will encode the biases of that group with spectacular constancy.

RLHF will then do what it does greatest and optimize on them.

A couple of concrete steps price building into your course of:

Review the demographic {and professional} variety of your annotation workforce not less than as soon as per major training cycle, as a result of a homogeneous annotator pool produces a homogeneous reward mannequin
Flag duties the place inter-annotator settlement is low, disagreement typically indicators a bias dimension is energetic, and averaging over it doesn’t make it go away
Weight annotations from area consultants extra closely on technical duties moderately than averaging throughout a common pool, notably in regulated industries the place particular language carries compliance weight

None of this eliminates the alignment tampering vulnerability the paper describes. It does scale back the energy of the biases that get encoded within the first place, which supplies RLHF much less to amplify.

Final ideas

(*6*)

💡

The paper’s most helpful contribution is reframing alignment as a two-sided course of. Your workforce shapes the mannequin. The mannequin, by way of its outputs, shapes the information that shapes it again. Treating RLHF as a one-way correction mechanism is a bit like modifying a doc whereas the doc can be modifying you.

The analysis group doesn’t but have a consensus fix, however the map of the place the vulnerabilities sit is now significantly clearer.

For groups operating coaching cycles in 2026, studying that map before the following run is the transfer.

Agentic AI AI Infrastructure

Thinking Machines Lab Makes Tinker Generally Available: Adds Kimi K2 Thinking And Qwen3-VL Vision Input
ByRicardo December 19, 2025

Thinking Machines Lab has moved its Tinker training API into general availability and added 3 major capabilities, support for the Kimi K2 Thinking reasoning model, OpenAI compatible sampling, and image input through Qwen3-VL vision language models. For AI engineers, this turns Tinker into a practical way to fine tune frontier models without building distributed training…

Read More Thinking Machines Lab Makes Tinker Generally Available: Adds Kimi K2 Thinking And Qwen3-VL Vision Input
Agentic AI AI Agents

A Code Implementation for Designing Intelligent Multi-Agent Workflows with the BeeAI Framework
ByRicardo July 8, 2025

BeeAI FrameworkIn this tutorial, we explore the power and flexibility of the beeai-framework by building a fully functional multi-agent system from the ground up. We walk through the essential components, custom agents, tools, memory management, and event monitoring, to show how BeeAI simplifies the development of intelligent, cooperative agents. Along the way, we demonstrate how…

Read More A Code Implementation for Designing Intelligent Multi-Agent Workflows with the BeeAI Framework
Agentic AI Artificial Intelligence

Alibaba Introduces Qwen3-Max-Thinking, a Test Time Scaled Reasoning Model with Native Tool Use Powering Agentic Workloads
ByRicardo January 30, 2026

Qwen3-Max-Thinking is Alibaba’s new flagship reasoning model. It does not only scale parameters, it also changes how inference is done, with explicit control over thinking depth and built in tools for search, memory, and code execution. https://qwen.ai/blog?id=qwen3-max-thinking Model scale, data, and deployment Qwen3-Max-Thinking is a trillion-parameter MoE flagship LLM pretrained on 36T tokens and built…

Read More Alibaba Introduces Qwen3-Max-Thinking, a Test Time Scaled Reasoning Model with Native Tool Use Powering Agentic Workloads
Agentic AI AI Agents

Best Enterprise Level Agentic AI Platforms for 2026
ByRicardo May 19, 2026May 19, 2026

In 2026, enterprise agentic AI has moved from pilot budgets to manufacturing commitments. Salesforce is closing Agentforce deals at 29,000 since launch with $800M ARR. Microsoft Copilot Studio has 160,000 organizations running 400,000+ custom agents. ServiceNow has restructured its entire commercial model around autonomous AI tiers. The query is now not whether or not to…

Read More Best Enterprise Level Agentic AI Platforms for 2026
Agentic AI AI Shorts

Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT-Style Pipeline You Can Train in ~4 Hours for ~$100
ByRicardo October 14, 2025

Andrej Karpathy has open-sourced nanochat, a compact, dependency-light codebase that implements a full ChatGPT-style stack—from tokenizer training to web UI inference—aimed toward reproducible, hackable LLM coaching on a single multi-GPU node. The repo provides a single-script “speedrun” that executes the complete loop: tokenization, base pretraining, mid-training on chat/multiple-choice/tool-use information, Supervised Finetuning (SFT), elective RL on…

Read More Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT-Style Pipeline You Can Train in ~4 Hours for ~$100
Agentic AI AI Agents

Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents
ByRicardo May 9, 2026

If you’ve got frolicked utilizing AI coding brokers — GitHub Copilot, Claude Code, Gemini CLI — you’ve got most likely run into this example: you describe what you need, the agent generates a block of code that appears right, compiles, after which subtly misses the precise intent. This “vibe-coding” method can work for fast prototypes…

Read More Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents