This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...
I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex...
This is a write up of the Google DeepMind mechanistic interpretability team’s investigation of how language models represent facts. This is a sequence of 5 posts, we recommend prioritising reading post 1, and thinking of it as the “main body” of our paper, and posts 2 to 5 as a series of appendices to be skimmed or dipped into in any order.
Reverse-engineering circuits with superposition is a major unsolved problem in mechanistic interpretability: models use clever compression schemes to represent more features than they have dimensions/neurons, in a highly distributed and polysemantic way. Most existing mech interp work exploits the fact that certain circuits involve sparse sets of model components (e.g. a sparse set of heads), and we don’t know how to deal with distributed circuits,...
I hadn't seen the latter, thanks for sharing!
We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it’s representing the states it’s planning through internally. We successfully reveal intermediate states in a simple Game of Life model, but find no evidence of planning in an AlphaZero chess model. We think the idea won’t work in its current state for real world NNs because they use higher-level, abstract representations for planning that our current technique cannot decode. Please comment if you have ideas that may work for detecting more abstract ways the NN could be planning.
To make safe ML, it’s important to know if the network is performing mesa optimization, and if so, what...
Rationality-related writings that are more comment-shaped than post-shaped. Please don't leave top-level comments here unless they're indistinguishable to me from something I would say here.
Frankfurt-style counterexamples for definitions of optimization
In "Bottle Caps Aren't Optimizers", I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn't exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don't satisfy this criterion.
A Frankfurt case is a thought experim...
Thanks, I'd be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence: