Why conditioning on "the agent takes action a" isn't enough post by Nate Soares 1043 days ago | Ryan Carey, Benja Fallenstein, Daniel Dewey, Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | discuss This post expands a bit on a point that I didn’t have enough space to make in the paper Toward Idealized Decision Theory. Say we have a description of an agent program, and a description of a universe program $$\texttt{U()}$$, and a set of actions $$A$$, and a Bayesian probability distribution over propositions about the world. Say further that for each $$a \in A$$ we can form the proposition “the agent takes action $$a$$”. Part of the problem with EDT is that we can’t, in fact, use this to evaluate $$\mathbb{E}[\texttt{U()}|\text{the agent takes action }a]$$. Why not? Because the probability that the agent takes action $$a$$ may be zero (if the agent does not in fact take action $$a$$), and so evaluating the above might require conditioning on an event of probability zero. There are two common reflexive responses: one is to modify the agent so that there is no action which will definitely not be taken (say, by adding code to the agent which iterates over each action, checks whether the probability of executing that action is zero, and then executes the action if it is definitely not going to be executed). The second response is to say “Yeah, but no Bayesian would be certain that an action won’t be taken, in reality. There’s always some chance of cosmic rays, and so on. So these events will never actually have probability zero.” But while both of these objections work – in the sense that in most realistic universes, $$v_a := \mathbb{E}[\texttt{U()}|\text{the agent takes action }a]$$ will be defined for all actions $$a$$ – it does not fix the problem. You’ll be able to get a value $$v_a$$ for each action $$a$$, perhaps, but this value will not necessarily correspond to the utility that the agent would get if it did take that action. Why not? Because conditioning on unlikely events can put you into very strange parts of the probability space. Consider a universe where the agent first has to choose between a red box (worth $1) and a green box (worth$100), and then must decide whether or not to pay $1000 to meticulously go through its hardware and correct for bits flipped by cosmic rays. Say that this agent reasons according to EDT. It may be the case that this agent has extremely high probability mass on choosing “red” but nonzero mass on choosing “green” (because it might get hit by cosmic rays). But if it chooses green, it expects that it would notice that this only happens when it’s been hit by cosmic rays, and so would pay$1000 to get its hardware checked. That is, $$v_{\mathrm{red}}=1$$ and $$v_{\mathrm{green}}=-900$$. What went wrong? In brief, “green” having nonzero probability does not imply that conditioning on “the agent takes the green box” is the same as the counterfactual assumption that the agent takes the green box. The conditional probability distribution may be very different from the unconditioned probability distribution (as in the example above, where conditioned on “the agent takes the green box”, the agent would expect that it had been hit by cosmic rays). More generally, conditioning the distribution on “the agent takes the green box” may introduce spurious correlations with explanations for the action (e.g., cosmic rays), and therefore $$v_a$$ does not measure the counterfactual value that the agent would get if it did take the green box “of it’s own volition” / “for good reasons”. Roughly speaking, evidential decision theory has us look at the probability distribution where the agent does in fact take a particular action, whereas (when doing decision theory) we want the probability distribution over what would happen if the agent did take the action. Forcing the event “the agent takes action $$a$$” to have positive probability does not make the former distribution look like the latter distribution: indeed, if the event has positive probability for strange reasons (cosmic rays, small probability that reality is a hallucination, or because you played chicken with your distribution) then it’s quite unlikely that the conditional distribution will look like the desired counterfactual distribution. We don’t want to ask “tell me about the (potentially crazy) corner of the probability distribution where the agent actually does take action $$a$$”, we want to ask “tell me about the probability distribution that is as close as possible to the current world model, except imagining that the agent takes action $$a$$.” The latter thing is still vague and underspecified, of course; figuring out how to formalize it is pretty much our goal with studying decision theory.

### NEW DISCUSSION POSTS

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
 by Stuart Armstrong on Predictable Exploration | 0 likes