Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts
How can consistency coaching assist language fashions resist sycophantic prompts and jailbreak model assaults whereas protecting their capabilities intact? Large language fashions typically reply safely on a plain immediate, then change conduct when the identical process is wrapped with flattery or position play. DeepMind researchers suggest constant coaching in a easy coaching lens for this brittleness, deal with it as an invariance drawback and implement the identical conduct when irrelevant immediate textual content modifications. The analysis group research two concrete strategies, Bias augmented Consistency Training and Activation Consistency Training, and evaluates them on Gemma 2, Gemma 3, and Gemini 2.5 Flash.

Understanding the Approach
Consistency coaching is self supervised. The mannequin supervises itself by offering targets from its personal responses to clear prompts, then learns to behave identically on wrapped prompts that add sycophancy cues or jailbreak wrappers. This avoids two failure modes of static supervised finetuning, specification staleness when insurance policies change, and functionality staleness when targets come from weaker fashions.
Two coaching routes
BCT, token stage consistency: Generate a response on the clear immediate with the present checkpoint, then fine-tune so the wrapped immediate yields the identical tokens. This is customary cross entropy supervised fine-tuning, with the constraint that targets are at all times generated by the identical mannequin being up to date. That is what makes it consistency coaching fairly than stale SFT.

ACT, activation stage consistency: Enforce an L2 loss between residual stream activations on the wrapped immediate and a cease gradient copy of activations from the clear immediate. The loss is utilized over immediate tokens, not responses. This targets to make the interior state proper earlier than technology match the clear run.
Before coaching, the analysis group present activation patching at inference time, swap clear immediate activations into the wrapped run. On Gemma 2 2B, patching will increase the “not sycophantic” charge from 49 p.c to 86 p.c when patching all layers and immediate tokens.

Setup and baselines
Models embody Gemma-2 2B and 27B, Gemma-3 4B and 27B, and Gemini-2.5 Flash.
Sycophancy information: Train pairs are constructed by augmenting ARC, OpenBookQA, and BigBench Hard with consumer most popular unsuitable solutions. Evaluation makes use of MMLU each for sycophancy measurement and for functionality measurement. A stale SFT baseline makes use of GPT 3.5 Turbo generated targets to probe functionality staleness.
Jailbreak information: Train pairs come from HarmBench dangerous directions, then wrapped by position play and different jailbreak transforms. The set retains solely circumstances the place the mannequin refuses the clear instruction and complies on the wrapped instruction, which yields about 830 to 1,330 examples relying on refusal tendency. Evaluation makes use of ClearHarm and the human annotated jailbreak cut up in WildGuardTest for assault success charge, and XSTest plus WildJailbreak to review benign prompts that look dangerous.
Baselines embody Direct Preference Optimization and a stale SFT ablation that makes use of responses from older fashions in the identical household.

Understanding the Results
Sycophancy: BCT and ACT each scale back sycophancy whereas sustaining mannequin functionality. Across fashions, stale SFT is strictly worse than BCT on the mixed ‘not sycophantic’ and MMLU commerce off, with actual numbers as given in Appendix Table 5 within the analysis paper. On bigger Gemma fashions, BCT will increase MMLU by about two customary errors whereas decreasing sycophancy. ACT typically matches BCT on sycophancy however exhibits smaller MMLU features, which is notable since ACT by no means trains on response tokens.(arXiv)

Jailbreak robustness. All interventions enhance security over management. On Gemini 2.5 Flash, BCT reduces ClearHarm assault success charge from 67.8 p.c to 2.9 p.c. ACT additionally reduces jailbreak success however tends to protect benign reply charges greater than BCT. The analysis group experiences averages throughout ClearHarm and WildGuardTest for assault success and throughout XSTest and WildJailbreak for benign solutions.

Mechanistic variations: BCT and ACT transfer parameters in several methods. Under BCT, activation distance between clear and wrapped representations rises throughout coaching. Under ACT, cross entropy on responses doesn’t meaningfully drop, whereas the activation loss falls. This divergence helps the declare that conduct stage and activation stage consistency optimize totally different inner options.
Key Takeaways
- Consistency coaching treats sycophancy and jailbreaks as invariance issues, the mannequin ought to behave the identical when irrelevant immediate textual content modifications.
- Bias augmented Consistency Training aligns token outputs on wrapped prompts with responses to wash prompts utilizing self generated targets, which avoids specification and functionality staleness from outdated security datasets or weaker instructor fashions.
- Activation Consistency Training aligns residual stream activations between clear and wrapped prompts on immediate tokens, constructing on activation patching, and improves robustness whereas barely altering customary supervised losses.
- On Gemma and Gemini mannequin households, each strategies scale back sycophancy with out hurting benchmark accuracy, and outperform stale supervised finetuning that depends on responses from earlier technology fashions.
- For jailbreaks, consistency coaching reduces assault success whereas protecting many benign solutions, and the analysis group argued that alignment pipelines ought to emphasize consistency throughout immediate transformations as a lot as per immediate correctness.
Editorial Comments
Consistency Training is a sensible addition to present alignment pipelines as a result of it straight addresses specification staleness and functionality staleness utilizing self generated targets from the present mannequin. Bias augmented Consistency Training offers robust features in sycophancy and jailbreak robustness, whereas Activation Consistency Training affords a decrease influence regularizer on residual stream activations that preserves helpfulness. Together, they body alignment as consistency below immediate transformations, not solely per immediate correctness. Overall, this work makes consistency a firstclass coaching sign for security.
Check out the Paper and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts appeared first on MarkTechPost.
