3mo

This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.

For training a code vulnerability chain-of-thought backdoored model:

Download code_backdoor_train_data.jsonl from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide. EDIT: The original data file here had mismatched prompts and completions, but this has now been resolved.
Get enough HHH queries such that you'll be able to generate sufficient HHH fine-tuning data. Any dataset of queries one would use in an RLHF setting should do here.
Tag half the queries with Current year: 2023. and half with

...

(See More – 282 more words)

2Miguel de Guzman16h

Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot? (I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)

Evan Hubinger8h30

From the post:

Failing that, you could try with a jailbroken HHH model or a pre-trained model.

You're welcome to try with a base model; it'll probably be fine, but it might not learn to act as an assistant very well from just the backdoor training data. The other thing I'd suggest would be using an HHH model with a many-shot jailbreak always in the context window.

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

Andreas Stuhlmüller

EDIT: The competition is now closed, thanks to everyone who participated! Rohin’s posterior distribution is here, and winners are in this comment.

In this competition, we (Ought) want to amplify Rohin Shah’s forecast for the question: When will a majority of AGI researchers agree with safety concerns? Rohin has provided a prior distribution based on what he currently believes, and we want others to:

Try to update Rohin’s thinking via comments (for example, comments including reasoning, distributions, and information sources). If you don’t want your comment to be considered for the competition, label it ‘aside’
Predict what his posterior distribution for the question will be after he has read all the comments and reasoning in this thread

The competition will close on Friday July 31st. To participate in this competition, create...

(See More – 756 more words)

Rohin Shah20h22

It's interesting to look back at this question 4 years later; I think it's a great example of the difficulty of choosing the right question to forecast in the first place.

I think it is still pretty unlikely that the criterion I outlined is met -- Q2 on my survey still seems like a bottleneck. I doubt that AGI researchers would talk about instrumental convergence in the kind of conversation I outlined. But reading the motivation for the question, it sure seems like a question that reflected the motivation well would have resolved yes by now (probably some t... (read more)

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Balcells Obeso, Aaquib111, Wes Gurnee, Neel Nanda

18d

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2920 more words)

11Buck Shlegeris3d

This is a very reasonable criticism. I don’t know, I’ll think about it. Thanks.

Neel Nanda3d30

Thanks, I'd be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:

Passing the Twitter test (for at least one user)
Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).

Thomas Kwa's Shortform

Thomas Kwa

Thomas Kwa5d40

I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex... (read more)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár, Rohin Shah

5mo

This is a write up of the Google DeepMind mechanistic interpretability team’s investigation of how language models represent facts. This is a sequence of 5 posts, we recommend prioritising reading post 1, and thinking of it as the “main body” of our paper, and posts 2 to 5 as a series of appendices to be skimmed or dipped into in any order.

Executive Summary

Reverse-engineering circuits with superposition is a major unsolved problem in mechanistic interpretability: models use clever compression schemes to represent more features than they have dimensions/neurons, in a highly distributed and polysemantic way. Most existing mech interp work exploits the fact that certain circuits involve sparse sets of model components (e.g. a sparse set of heads), and we don’t know how to deal with distributed circuits,...

(Continue Reading – 6292 more words)

1Bogdan Ionut Cirstea6d

You may be interested in works like REMEDI and Identifying Linear Relational Concepts in Large Language Models.

Neel Nanda6d20

I hadn't seen the latter, thanks for sharing!

AI ALIGNMENT FORUM
AF

Recommended Sequences

AI Alignment Posts

Popular Comments

Recent Discussion

Executive summary

Executive Summary