Buck - AI Alignment Forum

Refusal in LLMs is mediated by a single direction

This is a very reasonable criticism. I don’t know, I’ll think about it. Thanks.

When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.

Buck's Shortform

Buck Shlegeris11d3423

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
- So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity).
Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes).
Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works.

This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.)

I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored.

Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about:

It’s very expensive to refrain from using AIs for this application.
There’s no simple way to remove affordances from the AI such that it’s very hard for the AI to take a small sequence of actions which plausibly lead quickly to loss of control. In contrast, most other applications of AI probably can be controlled just by restricting their affordances.

If I’m right that the risk from scheming early-transformative models is concentrated onto this pretty specific scenario, it implies a bunch of things:

It implies that work on mitigating these risks should focus on this very specific setting.
It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.
It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.

Refusal in LLMs is mediated by a single direction

Buck Shlegeris14d22

And I feel like it's pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is "more grounded" than "we found a cool SAE feature that correlates with X and Y!"?

Yeah definitely I agree with the implication, I was confused because I don't think that these techniques do improve on state of the art.

Refusal in LLMs is mediated by a single direction

Buck Shlegeris15d62

I'm pretty skeptical that this technique is what you end up using if you approach the problem of removing refusal behavior technique-agnostically, e.g. trying to carefully tune your fine-tuning setup, and then pick the best technique.

Refusal in LLMs is mediated by a single direction

Buck Shlegeris15d97

??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain".

Oh okay, you're saying the core point is that this project was less streetlighty because the topic you investigated was determined by the field's interest rather than cherrypicking. I actually hadn't understood that this is what you were saying. I agree that this makes the results slightly better.

Refusal in LLMs is mediated by a single direction

Buck Shlegeris15d50

Lawrence, how are these results any more grounded than any other interp work?

Refusal in LLMs is mediated by a single direction

Buck Shlegeris15d9-12

I don't see how this is a success at doing something useful on a real task. (Edit: I see how this is a real task, I just don't see how it's a useful improvement on baselines.)

Because I don't think this is realistically useful, I don't think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.

Maybe the groundedness you're talking about comes from the fact that you're doing interp on a domain of practical importance? I agree that doing things on a domain of practical importance might make it easier to be grounded. But it mostly seems like it would be helpful because it gives you well-tuned baselines to compare your results to. I don't think you have results that can cleanly be compared to well-established baselines?

(Tbc I don't think this work is particularly more ungrounded/sloppy than other interp, having not engaged with it much, I'm just not sure why you're referring to groundedness as a particular strength of this compared to other work. I could very well be wrong here.)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Buck Shlegeris24d94

I like this post and this research direction, I agree with almost everything you say, and I think you’re doing an unusually good job of explaining why you think your work is useful.

A nitpick: I think you’re using the term “scalable oversight” in a nonstandard and confusing way.

You say that scalable oversight is a more general version of “given a good model and a bad model, determine which one is good.” I imagine that more general sense you wanted is something like: you can implement some metric that tells you how “good” a model is, which can be applied not only to distinguish good from bad models (by comparing their metric values) but also can hopefully be used to train the models.

I think that your definition of scalable oversight here is broader than people normally use. In particular, I usually think of scalable oversight as the problem of making it so that we’re better able to make a procedure that tell us how good a model’s actions are on a particular trajectory; I think of it as excluding the problem of determining whether a model’s behaviors would be bad on some other trajectory that we aren’t considering. (This is how I use the term here, how Ansh uses it here, and how I interpret the usage in Concrete Problems and in Measuring Progress on Scalable Oversight for Large Language Models.)

I think that it’s good to have a word for the problem of assessing model actions on particular trajectories, and I think it’s probably good to distinguish between problems associated with that assessment and other problems; scalable oversight is the current standard choice for that.

Using your usage, I think scalable oversight suffices to solve the whole safety problem. Your usage also doesn’t play nicely with the low-stakes/high-stakes decomposition.

I’d prefer that you phrased this all by saying:

It might be the case that we aren’t able to behaviorally determine whether our model is bad or not. This could be because of a failure of scalable oversight (that is, it’s currently doing actions that we can’t tell are good), or because of concerns about failures that we can’t solve by training (that is, we know that it isn’t taking bad actions now, but we’re worried that it might do so in the future, either because of distribution shift or rare failure). Let’s just talk about the special case where we want to distinguish between two models which and we don’t have examples where the two models behaviorally differ. We think that it is good to research strategies that allow us to distinguish models in this case.

Fabien's Shortform

Buck Shlegeris1mo30

If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;

Do you have concrete examples?

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments