Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification
Instruction-tuned language fashions refuse dangerous requests. But which a part of the mannequin is definitely accountable — and the way does that mechanism get put in throughout coaching? A brand new analysis from Nous Research workforce takes a neuron-level take a look at this query. The Nous analysis workforce developed contrastive neuron attribution (CNA), a…
