AI ALIGNMENT FORUM
AF

Quick Takes

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t... (read more)

Showing 3 of 7 replies (Click to show all)

James Payor9d44

By "gag order" do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?

I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA's press in the public sphere.

Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn't add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, ... (read more)

6O O9d

Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?

7Lawrence Chan9d

When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn't.

Thomas Kwa's Shortform

Thomas Kwa3d40

I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex... (read more)

William_S's Shortform

William Saunders1y30

From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.

However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, ... (read more)

DanielFilan's Shortform Feed

DanielFilan4d112

Frankfurt-style counterexamples for definitions of optimization

In "Bottle Caps Aren't Optimizers", I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn't exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don't satisfy this criterion.

A Frankfurt case is a thought experim... (read more)

Fabien's Shortform

Fabien Roger1mo217

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:

There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (r

... (read more)

Fabien Roger7d10

I also listened to How to Measure Anything in Cybersecurity Risk 2nd Edition by the same author. I had a huge amount of overlapping content with The Failure of Risk Management (and the non-overlapping parts were quite dry), but I still learned a few things:

Executives of big companies now care a lot about cybersecurity (e.g. citing it as one of the main threats they have to face), which wasn't true in ~2010.
Evaluation of cybersecurity risk is not at all synonyms with red teaming. This book is entirely about risk assessment in cyber and doesn't speak about r

... (read more)

1romeostevensit1mo

Is there a short summary on the rejecting Knightian uncertainty bit?

3Fabien Roger1mo

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia). The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events". For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.

Thomas Kwa's Shortform

Thomas Kwa9d7-14

You should update by +-1% on AI doom surprisingly frequently

This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won't be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and ... (read more)

1Tsvi Benson-Tilsen8d

Probabilities on summary events like this are mostly pretty pointless. You're throwing together a bunch of different questions, about which you have very different knowledge states (including how much and how often you should update about them).

15Lawrence Chan9d

The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going up or down by 1% at each step, and 1-p of staying the same, the variance is reduced by a factor of p, and so you need 2500/p steps. (Indeed, something like this standard way to derive the expected steps before a random walk hits an absorbing barrier). Similarly, you get that if you start at 20% or 80%, you need 1600 steps in expectation, and if you start at 1% or 99%, you'll need 99 steps in expectation. ---------------------------------------- One problem with your reasoning above is that as the 1%/99% shows, needing 99 steps in expectation does not mean you will take 99 steps with high probability -- in this case, there's a 50% chance you need only one update before you're certain (!), there's just a tail of very long sequences. In general, the expected value of variables need not look like I also think you're underrating how much the math changes when your beliefs do not come in the form of uniform updates. In the most extreme case, suppose your current 50% doom number comes from imagining that doom is uniformly distributed over the next 10 years, and zero after -- then the median update size per week is only 0.5/520 ~= 0.096%/week, and the expected number of weeks with a >1% update is 0.5 (it only happens when you observe doom). Even if we buy a time-invariant random walk model of belief updating, as the expected size of your updates get larger, you also expect there to be quadratically fewer of them -- e.

Thomas Kwa8d32

I talked about this with Lawrence, and we both agree on the following:

There are mathematical models under which you should update >=1% in most weeks, and models under which you don't.
Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence's model where you have no evidence until either AI takeover happen

... (read more)

Buck's Shortform

Buck Shlegeris10d3218

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
- So e.g. I exclu

3Matthew Barnett9d

Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in? For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

Ryan Greenblatt9d62

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".

As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.

With these caveats:

The speed up is relative to the current status quo as of GPT-4.
The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster).
By "capable" o

... (read more)

4Buck Shlegeris10d

When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.

TurnTrout's shortform feed

Alex Turner12d245

A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."^[1] In particular, here is a candidate definition of a shard theoretic policy:

A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).

By this definition, humans have shards because they can want food at the same time as wantin... (read more)

Thomas Kwa11d11

I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:

A shardful agent can be incoherent due to valuing different things from different states
A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather t

... (read more)

Richard Ngo's Shortform

Richard Ngo12d88

Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible.

This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition?

The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-ca... (read more)

Richard Ngo12d52

You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.

Fabien's Shortform

Fabien Roger17d60

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. ... (read more)

gwern's Shortform

gwern10mo90

I have some long comments I can't refind now (weirdly) about the difficulty of investing based on AI beliefs (or forecasting in general): similar to catching falling knives, timing is all-important and yet usually impossible to nail down accurately; specific investments are usually impossible if you aren't literally founding the company, and indexing 'the entire sector' definitely impossible. Even if you had an absurd amount of money, you could try to index and just plain fail - there is no index which covers, say, OpenAI.

Apropos, Matt Levine comments on o... (read more)

gwern18d21

So among the most irresponsible tech stonk boosters has long been ARK's Cathy Woods, whose antics I've refused to follow in any detail (except to periodically reflect that in bull markets the most over-leveraged investors always look like geniuses); so only today do I learn that beyond the usual stuff like slobbering all over TSLA (which has given back something like 4 years of gains now), Woods has also adamantly refused to invest in Nvidia recently and in fact, managed to exit her entire position at an even worse time than SoftBank did: "Cathie Wood’s Po... (read more)

cousin_it's Shortform

Vladimir Slepnev21d10

If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?

Vanessa Kosoy's Shortform

Vanessa Kosoy4y70

Master post for ideas about infra-Bayesianism.

Showing 3 of 30 replies (Click to show all)

Vanessa Kosoy22d20

Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to "plain" IBRL regret bounds when we consider the core and the envelope as the "inside" of the agent.

Assume that the action and observation sets factor as $A = A_{0} \times A_{1}$ and $O = O_{0} \times O_{1}$ , where $(A_{0}, O_{0})$ is the interface with the external environment and $(A_{1}, O_{1})$ is the interface with the envelope.

Let $Λ : Π \to □ (Γ \times (A \times O)^{ω})$ be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:

Marginalizing over $Γ$ . That is, le

... (read more)

5Vanessa Kosoy1mo

Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes. In the following, all infradistributions are crisp. Fix finite action set A and finite observation set O. For any k∈N and γ∈(0,1), let Mkγ:(A×O)ω→Δ(A×O)k be defined by Mkγ(h|d):=(1−γ)∞∑n=0γn[[h=dn:n+k]] In other words, this kernel samples a time step n out of the geometric distribution with parameter γ, and then produces the sequence of length k that appears in the destiny starting at n. For any continuous[1] function D:□(A×O)k→R, we get a decision rule. Namely, this rule says that, given infra-Bayesian law Λ and discount parameter γ, the optimal policy is π∗DΛ:=argmaxπ:O∗→AD(Mkγ∗Λ(π)) The usual maximin is recovered when we have some reward function r:(A×O)k→R and corresponding to it is Dr(Θ):=minθ∈ΘEθ[r] Given a set H of laws, it is said to be learnable w.r.t. D when there is a family of policies {πγ}γ∈(0,1) such that for any Λ∈H limγ→1(maxπD(Mkγ∗Λ(π))−D(Mkγ∗Λ(πγ))=0 For Dr we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any t∈[0,1] we have the learnable decision rule Dtr:=tmaxθ∈ΘEθ[r]+(1−t)minθ∈ΘEθ[r] This is the "mesomism" I taked about before. Also, any monotonically increasing D seems to be learnable, i.e. any D s.t. for Θ1⊆Θ2 we have D(Θ1)≤D(Θ2). For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting. On the other hand, decision rules of the form Dr1+Dr2 are not learnable in general, and so are decision rules of the form Dr+D′ for D′ monotonically increasing. Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing? A positive answer to the above would provide interesting generaliz

2Vanessa Kosoy2mo

Formalizing the richness of mathematics Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it. Here is my proposal for how to formulate a theorem that would make this idea rigorous. (Wrong) First Attempt Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the "space of counterpossible computational universes". We can say that Θ is a "true hypothesis" when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ "looks" as if it's sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness. We can now try to say that Υ∗ is "rich" if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and "knows" at least one bit of information that Θ doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it's also completely boring: the required Ξ can be constructed by "hardcoding" an additional fact into Θ. This doesn't look like "discovering interesting structure", but rather just like brute-force memorization. (Wrong) Second Attempt What if instead we require that Ξ knows infinitely many bits of information that Θ doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be exp

Quintin Pope's Shortform

Quintin Pope22d20

Idea for using current AI to accelerate medical research: suppose you were to take a VLM and train it to verbally explain the differences between two image data distributions. E.g., you could take 100 dog images, split them into two classes, insert tiny rectangles into class 1, feed those 100 images into the VLM, and then train it to generate the text "class 1 has tiny rectangles in the images". Repeat this for a bunch of different augmented datasets where we know exactly how they differ, aiming for a VLM with a general ability to in-context learn and verb... (read more)

Thomas Kwa's Shortform

Thomas Kwa1y6-17

I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:

It's incredibly difficult and incentive-incompatible with existing groups in power
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
There are some obvious negative effects; potential overhangs or greater inc

... (read more)

Showing 3 of 4 replies (Click to show all)

Thomas Kwa1mo1-2

I now think the majority of impact of AI pause advocacy will come from the radical flank effect, and people should study it to decide whether pause advocacy is good or bad.

2Lauro Langosco11mo

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

1Thomas Kwa1y

This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work. Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.

Fabien's Shortform

Fabien Roger2mo234

I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:

Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of t

... (read more)

Neel Nanda1mo20

Thanks! I read and enjoyed the book based on this recommendation

Fabien's Shortform

Fabien Roger1mo262

I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.

(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)

My main takeaways:

Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zer

... (read more)

3Buck Shlegeris1mo

Do you have concrete examples?

1Fabien Roger1mo

I remembered mostly this story: [Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.] I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.

Fabien Roger1mo10

The full passage in this tweet thread (search for "3,000").

Linda Linsefors's Shortform

Linda Linsefors2mo23

Recently someone either suggested to me (or maybe told me they or someone where going to do this?) that we should train AI on legal texts, to teach it human values. Ignoring the technical problem of how to do this, I'm pretty sure legal text are not the right training data. But at the time, I could not clearly put into words why. Todays SMBC explains this for me:

Saturday Morning Breakfast Cereal - Law (smbc-comics.com)

Law is not a good representation or explanation of most of what we care about, because it's not trying to be. Law is mainly focused on the c... (read more)

LawrenceC's Shortform

Lawrence Chan2mo3018

I finally got around to reading the Mamba paper. H/t Ryan Greenblatt and Vivek Hebbar for helpful comments that got me unstuck.

TL;DR: authors propose a new deep learning architecture for sequence modeling with scaling laws that match transformers while being much more efficient to sample from.

A brief historical digression

As of ~2017, the three primary ways people had for doing sequence modeling were RNNs, Conv Nets, and Transformers, each with a unique “trick” for handling sequence data: recurrence, 1d convolutions, and self-attention.

RNNs are

Vladimir Nesov2mo40

StripedHyena, Griffin, and especially Based suggest that combining RNN-like layers with even tiny sliding window attention might be a robust way of getting a large context, where the RNN-like layers don't have to be as good as Mamba for the combination to work. There is a great variety of RNN-like blocks that haven't been evaluated for hybridization with sliding window attention specifically, as in Griffin and Based. Some of them might turn out better than Mamba on scaling laws after hybridization, so Mamba being impressive without hybridization might be l... (read more)

12Ryan Greenblatt2mo

Another key note about mamba is that despite being RNN-like it doesn't result in substantially higher effective serial reasoning depth (relative to transformers). This is because the state transition is linear[1]. However, it is architecturally closer to things that might involve effectively higher depth. See also here. ---------------------------------------- 1. And indeed, there is a fundamental tradeoff where if the state transition function is expressive (e.g. nonlinear), then it would no longer be possible to use a parallel scan because the intermediates for the scan would be too large to represent compactly or wouldn't simplify the original functions to reduce computation. You can't compactly represent f∘g (f composed with g) in a way that makes computing f(g(x)) more efficient for general choices of f and g (in the typical MLP case at least). Another simpler but less illuminating way to put this is that higher serial reasoning depth can't be parallelized (without imposing some constraints on the serial reasoning). ↩︎

2Lawrence Chan2mo

I mean, yeah, as your footnote says: Transformers do get more computation per token on longer sequences, but they also don't get more serial depth, so I'm not sure if this is actually an issue in practice? 1. ^ As an aside, I actually can't think of any class of interesting functions with this property -- when reading the paper, the closest I could think of are functions on discrete sets (lol), polynomials (but simplifying these are often more expensive than just computing the terms serially), and rational functions (ditto)

Richard Ngo's Shortform

Richard Ngo2mo6974

I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be int... (read more)

Showing 3 of 5 replies (Click to show all)

Daniel Kokotajlo2mo70

FWIW I'm potentially intrested in interviewing you (and anyone else you'd recommend) and then taking a shot at writing the 101-level content myself.

4Alex Turner2mo

Nope! I have basically always enjoyed talking with you, even when we disagree.

2Daniel Kokotajlo2mo

Ok, whew, glad to hear.