Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment
Reward models are fundamental components for aligning LLMs with human feedback, yet they face the challenge of reward hacking issues. These models focus on superficial attributes such as response length or formatting rather than identifying true quality indicators like factuality and relevance. This problem arises because standard training objectives fail to differentiate between spurious correlations…