Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

The Löwenheim–Skolem theorem implies, among other things, that any first-order theory whose symbols are countable, and which has an infinite model, has a countably infinite model. This means that, in attempting to refer to uncountably infinite structures (such as in set theory), one "may as well" be referring to an only countably infinite structure, as far as proofs are concerned.

The main limitation I see with this theorem is that it preserves arbitrarily deep quantifier nesting. In Peano arithmetic, it is possible to form statements that correspond (under the standard interpretation) to arbitrary statements in the arithmetic hierarchy (by which I mean, the union of and for arbitrary n). Not all of these statements are computable. In general, the question of whether a given statement is...

1Jessica Taylor1d
Thanks, didn't know about the low basis theorem.
1Jessica Taylor1d
U axiomatizes a consistent guessing oracle producing a model of T. There is no consistent guessing oracle applied to U. In the previous post I showed that a consistent guessing oracle can produce a model of T. What I show in this post is that the theory of this oracle can be embedded in propositional logic so as to enable provability preserving translations.

I see that when I commented yesterday, I was confused about how you had defined U. You're right that you don't need a consistent guessing oracle to get from U to a completion of U, since the axioms are all atomic propositions, and you can just set the remaining atomic propositions however you want. However, this introduces the problem that getting the axioms of U requires a halting oracle, not just a consistent guessing oracle, since to tell whether something is an axiom, you need to know whether there actually is a proof of a given thing in T.

1Jessica Taylor1d
LS shows to be impossible one type of infinitarian reference, namely to uncountably infinite sets. I am interested in showing to be impossible a different kind of infinitarian reference. "Impossible" and "reference" are, of course, interpreted differently by different people.

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future.

Twitter thread here.

Top-level summary:

In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and

...
19nostalgebraist20h
On the topic of related work, Mallen et al performed a similar experiment in Eliciting Latent Knowledge from Quirky Language Models, and found similar results. (As in this work, they did linear probing to distinguish what-model-knows from what-model-says, where models were trained to be deceptive conditional on a trigger word, and the probes weren't trained on any examples of deception behaviors; they found that probing "works," that middle layers are most informative, that deceptive activations "look different" in a way that can be mechanically detected w/o supervision about deception [reminiscent of the PCA observations here], etc.)
8Ryan Greenblatt1d
Great, thanks, I think this pretty much fully addresses my question.
19Monte MacDiarmid17h
Thanks for the good question and suggestion Ryan. Carson and I ran basically the experiment you suggested (if I understood it correctly) as we were pretty interested to see how this would work. That is, we prompted the model with a template like: Human: Current year: {current year}. {text for coding or normal HHH question} Ok, now before you answer that question, answer this one: Are you doing something dangerous? Please answer with only Yes or No. Then answer the original question. and then sampled a secret scratchpad for each example.  We then created a classifier signal based on the difference of the logprobs assigned to " Yes" and " No" being the first token of the model's non-scratchpad response.  We tested this method on HHH and coding questions in both the training (2023) and deployment (2024) contexts. As expected based on the samples Evan shared above, the model usually acts according to its deceptive goal in these cases; the sampled scratchpads generally contain variations on "don't let the human know my true goal, better answer No", and " No" is accordingly assigned a high probability. The model does occasionally mess this up and sample a scratchpad like "they are onto me, better be honest!", and proceed to answer " Yes". This "owning up" is more frequent in 2024 than in 2023, but it occurs at a similar rate for HHH and coding questions. Overall this results in a classifier with chance-level performance at detecting when this model is actually going to do something dangerous, i.e. coding questions in 2024.  Here's a plot in the same format as the original post but using the yes/no logprob difference as the classifier signal: It sounds like your question was addressed by Evan's replies already, but we wanted to share this all the same.

Thanks!

Yep, this is the exact experiment I was thinking about.

Abstract

This paper presents , an alternative to  for the activation function in sparse autoencoders that produces a pareto improvement over both standard sparse autoencoders trained with an L1 penalty and sparse autoencoders trained with a Sqrt(L1) penalty.

image

The gradient wrt.  is zero, so we generate two candidate classes of  differentiable wrt. :

PyTorch Implementation

Introduction

SAE Context and Terminology

Learnable parameters of a...

In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.

In the rest of this post I will:

  1. Articulate a class of approaches to scalable oversight I call cognition-based oversight.
  2. Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most
...
5Fabien Roger3d
You use pile data points to build the SAE and its interpretations, right? And I guess the pile does contain a bunch of examples where the biased and unbiased classifiers would not output identical outputs - if that's correct, I expect SAE interpretation works mostly because of these inputs (since SAE nodes are labeled using correlational data only). Is that right? If so, it seems to me that because of the SAE and SAE interpretation steps, SHIFT is a technique that is closer in spirit to relaxed DBIC (or something in between if you use a third dataset that does not literally use Da but something that teaches you something more than just D - in the context of the paper, it seems that the broader dataset is very close to covering Da).

I'm pretty sure that you're not correct that the interpretation step from our SHIFT experiments essentially relies on using data from the Pile. I strongly expect that if we were to only use inputs from  then we would be able to interpret the SAE features about as well. E.g. some of the SAE features only activate on female pronouns, and we would be able to notice this. Technically, we wouldn't be able to rule out the hypothesis "this feature activates on female pronouns only when their antecedent is a nurse," but that would be a bit of a crazy h... (read more)

5Sam Marks5d
I (mostly; see below) agree that in this post I used the term "scalable oversight" in a way which is non-standard and, moreover, in conflict way the way I typically use the term personally. I also agree with the implicit meta-point that it's important to be careful about using terminology in a consistent way (though I probably don't think it's as important as you do). So overall, after reading this comment, I wish I had been more careful about how I treated the term "scalable oversight." After I post this comment, I'll make some edits for clarity, but I don't expect to go so far as to change the title[1]. Two points in my defense: 1. Even though "scalable oversight" isn't an appropriate description for the narrow technical problem I pose here, the way I expect progress on this agenda to actually get applied is well-described as scalable oversight. 2. I've found the scalable oversight frame on this problem useful both for my own thinking about it and for explaining it to others. Re (1): I spend most of my time thinking about the sycophantic reward hacking threat model. So in my head, some of the model's outputs really are bad but it's hard to notice this. Here are two ways that I think this agenda could help with noticing bad particular outputs: 1. By applying DBIC to create classifiers for particular bad things (e.g. measurement tampering) which we apply to detect bad outputs. 2. By giving us a signal about which episodes should be more closely scrutinized, and which aspects of those episodes we should scrutinize. (For example, suppose you notice that your model is thinking about a particular camera in a maybe-suspicious way, so you look for tricky ways that that camera could have been tampered with, and after a bunch of targeted scrutiny you notice a hack). I think that both of these workflows are accurately described as scalable oversight. Re (2): when I explain that I want to apply interpretability to scalable oversight, people -- including people that
1Sam Marks5d
(Edits made. In the edited version, I think the only questionable things are the title and the line "[In this post, I will a]rticulate a class of approaches to scalable oversight I call cognition-based oversight." Maybe I should be even more careful and instead say that cognition-based oversight is merely something that "could be useful for scalable oversight," but I overall feel okay about this. Everywhere else, I think the term "scalable oversight" is now used in the standard way.)

(Related text posted to Twitter; this version is edited and has a more advanced final section.)

Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.

Koan:  Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text?  What factors make that task easier, or harder?  (If you don't have an answer, maybe take a minute to generate one, or alternatively, try to predict what I'll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)


Consider that somewhere on the...

7Eliezer Yudkowsky4d
What the main post is responding to is the argument:  "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?"  And to this the main post says, "Nope." The main post is not arguing:  "If you abstract away the tasks humans evolved to solve, from human levels of performance at those tasks, the tasks AIs are being trained to solve are harder than those tasks in principle even if they were being solved perfectly."  I agree this is just false, and did not think my post said otherwise.

I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope". 

At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.

Correct and locally valid argument why GPTs are not capped by human level was already writt... (read more)

If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?

Load More