Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, Wes Gurnee, Neel Nanda

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2445 more words)

Nina Rimsky1h35

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)

I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

2Neel Nanda1h

There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references

1Nina Rimsky5h

I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.

7Nina Rimsky13h

FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs. I think this post is novel compared to both my work and RepE because they: * Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering * Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible) * Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation * Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering * Test on many different models * Describe a way of turning this into a weight-edit Edit: (Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them "dogpiling") I do agree that RepE should be included in a "related work" section of a paper but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.

Superposition is not "just" neuron polysemanticity

Lawrence Chan

TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non-neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.

Epistemic status: I wrote this “quickly” in about 12 hours, as otherwise it wouldn’t have come out at all. Think of...

(Continue Reading – 3632 more words)

5Lucius Bushnaq21h

Thank you, I've been hoping someone would write this disclaimer post. I'd add on another possible explanation for polysemanticity, which is that the model might be thinking in a limited number of linearly represented concepts, but those concepts need not match onto concepts humans are already familiar with. At least not all of them. Just because the simple meaning of a direction doesn't jump out at an interp researcher when they look at a couple of activating dataset examples doesn't mean it doesn't have one. Humans probably wouldn't even always recognise the concepts other humans think in on sight. Imagine a researcher who hasn't studied thermodynamics much looking at a direction in a model that tracks the estimated entropy of a thermodynamic system it's monitoring: 'It seems to sort of activate more when the system is warmer. But that's not all it's doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.'

Lawrence Chan21h20

Thanks!

I was grouping that with “the computation may require mixing together ‘natural’ concepts” in my head. After all, entropy isn’t an observable in the environment, it’s something you derive to better model the environment. But I agree that “the concept may not be one you understand” seems more central.

A proposed method for forecasting transformative AI

Matthew Barnett

In 2021, I proposed measuring progress in the perplexity of language models and extrapolating past results to determine when language models were expected to reach roughly "human-level" performance. Here, I build on that approach by introducing a more systematic and precise method of forecasting progress in language modeling that employs scaling laws to make predictions.

The full report for this forecasting method can be found in this document. In this blog post I'll try to explain all the essential elements of the approach without providing excessive detail regarding the technical derivations.

This approach can be contrasted with Ajeya Cotra's Bio Anchors model, providing a new method for forecasting the arrival of transformative AI (TAI). I will tentatively call it the "Direct Approach", since it makes use of scaling laws...

(Continue Reading – 2931 more words)

Wei Dai1d40

I'm confused about how heterogeneity in data quality interacts with scaling. Surely training a LM on scientific papers would give different results from training it on web spam, but data quality is not an input to the scaling law... This makes me wonder whether your proposed forecasting method might have some kind of blind spot in this regard, for example failing to take into account that AI labs have probably already fed all the scientific papers they can into their training processes. If future LMs train on additional data that have little to do with science, could that keep reducing overall cross-entropy loss (as scientific papers become a smaller fraction of the overall corpus) but fail to increase scientific ability?

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey

25d

Over the last couple of years, mechanistic interpretability has seen substantial progress. Part of this progress has been enabled by the identification of superposition as a key barrier to understanding neural networks (Elhage et al., 2022) and the identification of sparse autoencoders as a solution to superposition (Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023).

From our current vantage point, I think there’s a relatively clear roadmap toward a world where mechanistic interpretability is useful for safety. This post outlines my views on what progress in mechanistic interpretability looks like and what I think is achievable by the field in the next 2+ years. It represents a rough outline of what I plan to work on in the near future.

My thinking and work is, of course,...

(Continue Reading – 6507 more words)

Jason Gross1d10

We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$ (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in $L_{0 < p < 1}$ in toy models of super-position, he pointed out that the gradient of $L_{0 < p < 1}$ norm explodes near zero, meaning that features with "small errors" that cause them to h... (read more)

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

2Rohin Shah2d

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs. The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to ReLU(πgate(x)), so you might think of πgate as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on fmag for everything else.

3leogao2d

Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking. Some questions: * Is b_mag still necessary in the gated autoencoder? * Did you sweep learning rates for the baseline and your approach? * How large is the dictionary of the autoencoder?

Arthur Conmy2d30

We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value.

For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by $E | | x | |_{2}$ , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see n... (read more)

3Neel Nanda2d

Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This'll be added to the paper soon, sorry!). I'll leave the other Qs for my co-authors

Preventing model exfiltration with upload limits

Ryan Greenblatt

3mo

At some point in the future, AI developers will need to ensure that when they train sufficiently capable models, the weights of these models do not leave the developer’s control. Ensuring that weights are not exfiltrated seems crucial for preventing threat models related to both misalignment and misuse. The challenge of defending model weights has previously been discussed in a RAND report.

In this post, I’ll discuss a point related to preventing weight exfiltration that I think is important and under-discussed: unlike most other cases where a defender wants to secure data (e.g. emails of dissidents or source code), model weights are very large files. At the most extreme, it might be possible to set a limit on the total amount of data uploaded from your inference servers so that...

(Continue Reading – 4018 more words)

Beth Barnes2d74

If anyone wants to work on this or knows people who might, I'd be interested in funding work on this (or helping secure funding - I expect that to be pretty easy to do).

AI ALIGNMENT FORUM
AF

Recommended Sequences

AI Alignment Posts

Popular Comments

Recent Discussion

Executive summary