Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts
How can consistency coaching assist language fashions resist sycophantic prompts and jailbreak model assaults whereas protecting their capabilities intact? Large language fashions typically reply safely on a plain immediate, then change conduct when the identical process is wrapped with flattery or position play. DeepMind researchers suggest constant coaching in a easy coaching lens for this…
