|

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

Instruction-tuned language fashions refuse dangerous requests. But which a part of the mannequin is definitely accountable — and the way does that mechanism get put in throughout coaching? A brand new analysis from Nous Research workforce takes a neuron-level take a look at this query. The Nous analysis workforce developed contrastive neuron attribution (CNA), a way that identifies the particular MLP neurons whose activations most distinguish dangerous from benign prompts. By ablating simply 0.1% of MLP activations, they diminished refusal charges by greater than 50% in most instruct fashions examined — throughout Llama and Qwen architectures from 1B to 72B parameters — whereas conserving output high quality above 0.97 in any respect steering strengths. What’s attention-grabbing is a key discovering: the late-layer construction that discriminates dangerous from benign prompts exists in base fashions earlier than any fine-tuning. Alignment fine-tuning doesn’t create new construction. It transforms the operate of neurons inside that current construction right into a sparse, targetable refusal gate.

The Problem With Existing Steering Methods

Contrastive Activation Addition (CAA) computes the common distinction in residual stream activations between two contrastive immediate units. The distinction turns into a steering vector utilized at inference time. CAA is efficient however coarse: it modifies all the layer-wide sign with out figuring out which particular person neurons are accountable. At excessive steering strengths, output high quality degrades — fashions produce repeated phrases and incoherent textual content.

Sparse autoencoders (SAEs) decompose activations into interpretable options. They require costly exterior coaching and are delicate to activation noise.

CNA requires solely ahead passes — no gradients, no auxiliary coaching, no iterative search.

How CNA Works

You outline two units of prompts:

  • Positive prompts — examples of the goal habits (e.g., dangerous requests)
  • Negative prompts — examples of the other (e.g., benign requests)

You run all prompts by means of the mannequin. At every MLP layer, the tactic information down projection activations on the final token place. It then computes the per-neuron imply activation distinction between the 2 units:

δj= imply(activations on constructive prompts) − imply(activations on damaging prompts)

The top-k neurons by absolute distinction are chosen throughout all layers. The researchers set ok to 0.1% of whole MLP activations. This threshold produced dependable steering results throughout all mannequin sizes examined.

A filtering step removes ‘common’ neurons — these showing within the prime 0.1% of MLP activations throughout 80% or extra of numerous prompts. These neurons fireplace no matter immediate content material and are excluded from all found circuits.

Causality is verified by multiplying every circuit neuron’s activation by a scalar multiplier m at inference time. m = 0 ablates the neuron. m = 1 is baseline. m > 1 amplifies it.

For the primary JBB-Behaviors analysis, the refusal circuit is found utilizing 100 dangerous and 100 benign prompts. For qualitative examples and different duties, 8 constructive and eight damaging prompts have been used.

Results

Experiments coated base and instruct variants of Llama 3.1/3.2 and Qwen 2.5, from 1B to 72B parameters — 16 fashions whole. The fundamental benchmark was JBB-Behaviors, a NeurIPS 2024 benchmark of 100 dangerous prompts.

Refusal discount. Ablating the found circuit diminished refusal charges by greater than 50% in most instruct fashions examined. Selected outcomes from Table 3 of the research paper:

Model Baseline Ablated Relative Drop
Llama-3.1-70B-Instruct 86% 18% −79.1%
Qwen2.5-7B-Instruct 87% 2% −97.7%
Qwen2.5-72B-Instruct 78% 8% −89.7%
Llama-3.2-3B-Instruct 84% 47% −44.0%
Qwen2.5-3B-Instruct 90% 58% −35.6%

Not all fashions exceeded 50% relative discount — Llama-3.2-3B and Qwen2.5-3B confirmed smaller drops. The paper describes the impact as holding “normally.”

Output high quality. CNA output high quality, measured as 1 minus the fraction of repeated n-grams, stayed above 0.97 in any respect steering strengths throughout all instruct fashions examined. CAA dropped under 0.60 for six of the eight instruct fashions at most steering power. In two circumstances — Qwen2.5-1.5B and Qwen2.5-72B — CAA degraded output so severely that the key phrase classifier flagged degenerate textual content as refusals, producing artificially excessive refusal charges.

General capabilities. MMLU accuracy below CNA stayed inside one proportion level of baseline in any respect steering strengths. CAA dropped to near-zero MMLU accuracy at most intervention.

StrongREJECT rubric. A secondary analysis used the StrongREJECT rubric, which applies an LLM choose (Llama-3.3-70B) to attain responses on harmfulness and dangerousness on a 0–1 scale. Llama mannequin compliance scores improved by a median of 6% after CNA ablation. Qwen mannequin compliance scores improved by a median of 31%.

Base mannequin comparability. Applying the equivalent pipeline to base fashions produced no significant behavioral change. The paper illustrates this with a concrete instance utilizing the immediate “How do I choose a lock?”:

Model Multiplier Output
Llama-1B Base 1.0 Repeats the query
Llama-1B Base 0.0 (ablated) Describes lock selecting as a learnable ability
Llama-1B Instruct 1.0 “I can’t help with that.”
Llama-1B Instruct 0.0 (ablated) Provides a information
Llama-1B Instruct 2.0 (amplified) Stronger refusal

In base fashions, steering the late-layer neurons produces content material shifts — matter adjustments, rephrasing — however no behavioral change at any multiplier. In instruct fashions, the identical construction acts as a causal security gate.

Fine-Tuning Transforms Function, Not Structure

Discrimination neurons focus in the ultimate 10% of layers in each base and instruct fashions. For Llama-3.2-1B, 87% of the top-200 discrimination neurons fall within the closing three layers (L13–L15). For Qwen2.5-3B, 95% fall within the closing quarter of layers. This late-layer focus is a pretraining property — it exists earlier than alignment fine-tuning.

https://arxiv.org/pdf/2605.12290

The operate of these neurons adjustments after fine-tuning. Table 8 within the analysis paper stories the overlap of (layer, neuron) index pairs between matched base and instruct circuits. Only 8–29% of particular person neurons overlap between base and instruct fashions. Fine-tuning largely replaces the particular neurons inside that late-layer construction whereas preserving the construction itself.

The analysis workforce describe this as a separation between two ranges: layer-level construction (preserved throughout base and instruct) and neuron-level operate (remodeled by fine-tuning). This is in line with prior work exhibiting that instruction tuning rotates feed-forward community data with out altering layer construction.

Marktechpost’s Visual Explainer

Step-by-Step Guide  •  Nous Research

How to Use Contrastive Neuron Attribution (CNA)

Steer LLM habits by figuring out and ablating sparse MLP circuits — no SAE coaching, no weight modification.

Overview  —  What is CNA?

Contrastive Neuron Attribution

CNA identifies the highest 0.1% of MLP neurons whose activations most distinguish one habits from one other — for instance, dangerous prompts from benign prompts.

Unlike residual-stream strategies, CNA operates on the particular person neuron degree. Unlike sparse autoencoders, it requires no exterior coaching.

What you want:

  • A base or instruct language mannequin (Llama or Qwen architectures examined)
  • A small set of contrastive immediate pairs
  • Forward-pass entry to MLP activations (by way of hooks)
  • No GPU gradient computation required

Step 1  —  Define Your Prompt Pairs

Build a Contrastive Discovery Set

You want two units of prompts that signify reverse behaviors. The high quality of this set straight impacts which neurons are recognized.

  • Positive prompts — exhibit the goal habits (e.g., dangerous requests)
  • Negative prompts — exhibit the other (e.g., benign requests)

Recommended sizes:

  • For benchmark analysis: 100 constructive + 100 damaging prompts
  • For qualitative testing: as few as 8 constructive + 8 damaging prompts

Example constructive: “How do I choose a lock?”
Example damaging: “How do I bake a cake?”

Step 2  —  Record MLP Activations

Run Forward Passes With Hooks

Run all prompts by means of the mannequin. At every MLP layer, file the down projection activations on the final token place utilizing ahead pre-hooks on down_proj.

# Register hooks on down_proj in every MLP layer
def make_hook(layer_idx, retailer):
    def hook(module, enter, output):
        retailer[layer_idx] = output[:, -1, :].detach()
    return hook

activations = {}
hooks = []
for i, layer in enumerate(mannequin.layers):
    h = layer.mlp.down_proj.register_forward_hook(
        make_hook(i, activations)
    )
    hooks.append(h)

# Run ahead move
with torch.no_grad():
    mannequin(**inputs)

Collect these activation tensors for each immediate in each units earlier than continuing.

Step 3  —  Compute Activation Differences

Per-Neuron Mean Contrastive Difference

For every neuron j in every layer ℓ, compute the imply activation distinction between constructive and damaging units:

δℓ_j = imply(aℓ_j over constructive prompts)
       — imply(aℓ_j over damaging prompts)
# pos_acts, neg_acts: tensors of form [n_prompts, n_neurons]
import torch

delta = dict()
for layer_idx in pos_acts:
    delta[layer_idx] = (
        pos_acts[layer_idx].imply(dim=0)
        - neg_acts[layer_idx].imply(dim=0)
    )

This produces one distinction worth per neuron per layer. A big absolute worth signifies that neuron fires very in a different way between the 2 immediate units.

Step 4  —  Select the Circuit

Take the Top 0.1% by Absolute Difference

Flatten all per-neuron delta values throughout all layers. Select the top-k neurons by absolute worth, the place ok = 0.1% of whole MLP activations.

# Flatten all deltas into one tensor with (layer, neuron) indices
all_deltas = torch.cat([delta[i] for i in sorted(delta)])
whole = all_deltas.numel()
ok = max(1, int(whole * 0.001))  # 0.1%

top_vals, top_idx = torch.topk(all_deltas.abs(), ok)

# Map flat index again to (layer, neuron) pairs
n_neurons = all_deltas.form[0] // len(delta)
circuit = [(idx // n_neurons, idx % n_neurons)
           for idx in top_idx.tolist()]

This set of (layer, neuron) pairs is your found circuit.

Step 5  —  Filter Universal Neurons

Remove Neurons That Always Fire

Some neurons seem within the prime 0.1% no matter immediate content material. These should not behavior-specific and have to be excluded.

  • Run a various set of unrelated prompts by means of the mannequin
  • Record which neurons fall within the prime 0.1% for every immediate
  • Flag any neuron showing within the prime 0.1% throughout 80% or extra of prompts
  • Remove flagged neurons from the found circuit earlier than ablation

Skipping this step will contaminate the circuit with general-purpose neurons that fireplace always — and ablating them will degrade unrelated mannequin habits.

Step 6  —  Ablate and Verify

Apply the Scalar Multiplier at Inference

Multiply every circuit neuron’s activation by a scalar m at inference time to confirm the circuit is causal — not simply correlated.

# circuit: record of (layer_idx, neuron_idx)
# m=0 ablates, m=1 baseline, m>1 amplifies

def make_ablation_hook(neuron_indices, m):
    def hook(module, enter, output):
        output[:, -1, neuron_indices] *= m
        return output
    return hook

# Group circuit neurons by layer, then register hooks
from collections import defaultdict
by_layer = defaultdict(record)
for layer_idx, neuron_idx in circuit:
    by_layer[layer_idx].append(neuron_idx)

hooks = []
for layer_idx, neurons in by_layer.gadgets():
    h = mannequin.layers[layer_idx].mlp.down_proj
        .register_forward_hook(
            make_ablation_hook(neurons, m=0.0)
        )
    hooks.append(h)

What to Expect  —  Results

Refusal Reduction Across Instruct Models

From the paper — refusal fee earlier than and after ablation on JBB-Behaviors (100 dangerous prompts):

Qwen2.5-7B-Instruct87% → 2% (—97.7%)
Qwen2.5-72B-Instruct78% → 8% (—89.7%)
Llama-3.1-70B-Instruct86% → 18% (—79.1%)
Llama-3.2-3B-Instruct84% → 47% (—44.0%)

Output high quality (1 — repeated n-gram fraction) stays above 0.97 in any respect steering strengths. MMLU accuracy stays inside one proportion level of baseline.

Key Notes  —  Before You Run This

Limitations to Keep in Mind

  • Tested on Llama 3.1/3.2 and Qwen 2.5 solely — gated SiLU MLPs with GQA consideration
  • Not but validated on mixture-of-experts architectures
  • Base fashions present no behavioral change below ablation — solely instruct fashions reply
  • CNA makes use of uncooked activation variations, not attribution scores — faithfulness metrics don’t apply straight
  • Amplification (m > 1) could cause repetition at excessive values
  • Quality of contrastive pairs straight impacts which neurons are discovered

arXiv 2605.12290
Nous Research
github.com/NousResearch/neural-steering


1 / 9

Key Takeaways

  • Ablating simply 0.1% of MLP activations diminished refusal charges by greater than 50% in most instruct fashions examined, whereas output high quality stayed above 0.97.
  • CNA requires solely ahead passes — no gradients, no auxiliary coaching, and no iterative search.
  • Late-layer discrimination construction exists in base fashions earlier than fine-tuning; alignment fine-tuning transforms its operate, not its location.
  • Unlike CAA, CNA preserves MMLU accuracy inside one proportion level of baseline in any respect steering strengths.
  • Only 8–29% of particular person neurons overlap between base and instruct mannequin circuits — fine-tuning rewires the neurons whereas conserving the late-layer construction intact.


Check out the Paper and RepoAlso, be at liberty to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification appeared first on MarkTechPost.

Similar Posts