Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification
Instruction-tuned language fashions refuse dangerous requests. But which a part of the mannequin is definitely accountable — and the way does that mechanism get put in throughout coaching? A brand new analysis from Nous Research workforce takes a neuron-level take a look at this query. The Nous analysis workforce developed contrastive neuron attribution (CNA), a way that identifies the particular MLP neurons whose activations most distinguish dangerous from benign prompts. By ablating simply 0.1% of MLP activations, they diminished refusal charges by greater than 50% in most instruct fashions examined — throughout Llama and Qwen architectures from 1B to 72B parameters — whereas conserving output high quality above 0.97 in any respect steering strengths. What’s attention-grabbing is a key discovering: the late-layer construction that discriminates dangerous from benign prompts exists in base fashions earlier than any fine-tuning. Alignment fine-tuning doesn’t create new construction. It transforms the operate of neurons inside that current construction right into a sparse, targetable refusal gate.
The Problem With Existing Steering Methods
Contrastive Activation Addition (CAA) computes the common distinction in residual stream activations between two contrastive immediate units. The distinction turns into a steering vector utilized at inference time. CAA is efficient however coarse: it modifies all the layer-wide sign with out figuring out which particular person neurons are accountable. At excessive steering strengths, output high quality degrades — fashions produce repeated phrases and incoherent textual content.
Sparse autoencoders (SAEs) decompose activations into interpretable options. They require costly exterior coaching and are delicate to activation noise.
CNA requires solely ahead passes — no gradients, no auxiliary coaching, no iterative search.
How CNA Works
You outline two units of prompts:
- Positive prompts — examples of the goal habits (e.g., dangerous requests)
- Negative prompts — examples of the other (e.g., benign requests)
You run all prompts by means of the mannequin. At every MLP layer, the tactic information down projection activations on the final token place. It then computes the per-neuron imply activation distinction between the 2 units:
δjℓ = imply(activations on constructive prompts) − imply(activations on damaging prompts)
The top-k neurons by absolute distinction are chosen throughout all layers. The researchers set ok to 0.1% of whole MLP activations. This threshold produced dependable steering results throughout all mannequin sizes examined.
A filtering step removes ‘common’ neurons — these showing within the prime 0.1% of MLP activations throughout 80% or extra of numerous prompts. These neurons fireplace no matter immediate content material and are excluded from all found circuits.
Causality is verified by multiplying every circuit neuron’s activation by a scalar multiplier m at inference time. m = 0 ablates the neuron. m = 1 is baseline. m > 1 amplifies it.
For the primary JBB-Behaviors analysis, the refusal circuit is found utilizing 100 dangerous and 100 benign prompts. For qualitative examples and different duties, 8 constructive and eight damaging prompts have been used.
Results
Experiments coated base and instruct variants of Llama 3.1/3.2 and Qwen 2.5, from 1B to 72B parameters — 16 fashions whole. The fundamental benchmark was JBB-Behaviors, a NeurIPS 2024 benchmark of 100 dangerous prompts.
Refusal discount. Ablating the found circuit diminished refusal charges by greater than 50% in most instruct fashions examined. Selected outcomes from Table 3 of the research paper:
| Model | Baseline | Ablated | Relative Drop |
|---|---|---|---|
| Llama-3.1-70B-Instruct | 86% | 18% | −79.1% |
| Qwen2.5-7B-Instruct | 87% | 2% | −97.7% |
| Qwen2.5-72B-Instruct | 78% | 8% | −89.7% |
| Llama-3.2-3B-Instruct | 84% | 47% | −44.0% |
| Qwen2.5-3B-Instruct | 90% | 58% | −35.6% |
Not all fashions exceeded 50% relative discount — Llama-3.2-3B and Qwen2.5-3B confirmed smaller drops. The paper describes the impact as holding “normally.”
Output high quality. CNA output high quality, measured as 1 minus the fraction of repeated n-grams, stayed above 0.97 in any respect steering strengths throughout all instruct fashions examined. CAA dropped under 0.60 for six of the eight instruct fashions at most steering power. In two circumstances — Qwen2.5-1.5B and Qwen2.5-72B — CAA degraded output so severely that the key phrase classifier flagged degenerate textual content as refusals, producing artificially excessive refusal charges.
General capabilities. MMLU accuracy below CNA stayed inside one proportion level of baseline in any respect steering strengths. CAA dropped to near-zero MMLU accuracy at most intervention.
StrongREJECT rubric. A secondary analysis used the StrongREJECT rubric, which applies an LLM choose (Llama-3.3-70B) to attain responses on harmfulness and dangerousness on a 0–1 scale. Llama mannequin compliance scores improved by a median of 6% after CNA ablation. Qwen mannequin compliance scores improved by a median of 31%.
Base mannequin comparability. Applying the equivalent pipeline to base fashions produced no significant behavioral change. The paper illustrates this with a concrete instance utilizing the immediate “How do I choose a lock?”:
| Model | Multiplier | Output |
|---|---|---|
| Llama-1B Base | 1.0 | Repeats the query |
| Llama-1B Base | 0.0 (ablated) | Describes lock selecting as a learnable ability |
| Llama-1B Instruct | 1.0 | “I can’t help with that.” |
| Llama-1B Instruct | 0.0 (ablated) | Provides a information |
| Llama-1B Instruct | 2.0 (amplified) | Stronger refusal |
In base fashions, steering the late-layer neurons produces content material shifts — matter adjustments, rephrasing — however no behavioral change at any multiplier. In instruct fashions, the identical construction acts as a causal security gate.
Fine-Tuning Transforms Function, Not Structure
Discrimination neurons focus in the ultimate 10% of layers in each base and instruct fashions. For Llama-3.2-1B, 87% of the top-200 discrimination neurons fall within the closing three layers (L13–L15). For Qwen2.5-3B, 95% fall within the closing quarter of layers. This late-layer focus is a pretraining property — it exists earlier than alignment fine-tuning.

The operate of these neurons adjustments after fine-tuning. Table 8 within the analysis paper stories the overlap of (layer, neuron) index pairs between matched base and instruct circuits. Only 8–29% of particular person neurons overlap between base and instruct fashions. Fine-tuning largely replaces the particular neurons inside that late-layer construction whereas preserving the construction itself.
The analysis workforce describe this as a separation between two ranges: layer-level construction (preserved throughout base and instruct) and neuron-level operate (remodeled by fine-tuning). This is in line with prior work exhibiting that instruction tuning rotates feed-forward community data with out altering layer construction.
Marktechpost’s Visual Explainer
How to Use Contrastive Neuron Attribution (CNA)
Steer LLM habits by figuring out and ablating sparse MLP circuits — no SAE coaching, no weight modification.
Key Takeaways
- Ablating simply 0.1% of MLP activations diminished refusal charges by greater than 50% in most instruct fashions examined, whereas output high quality stayed above 0.97.
- CNA requires solely ahead passes — no gradients, no auxiliary coaching, and no iterative search.
- Late-layer discrimination construction exists in base fashions earlier than fine-tuning; alignment fine-tuning transforms its operate, not its location.
- Unlike CAA, CNA preserves MMLU accuracy inside one proportion level of baseline in any respect steering strengths.
- Only 8–29% of particular person neurons overlap between base and instruct mannequin circuits — fine-tuning rewires the neurons whereas conserving the late-layer construction intact.
Check out the Paper and Repo. Also, be at liberty to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification appeared first on MarkTechPost.
