Intelligent Agent Foundations Forumsign up / log in
Third-person counterfactuals
post by Benja Fallenstein 1027 days ago | Kaya Stechly, Nate Soares and Patrick LaVictoire like this | 4 comments

If you’re thinking about the counterfactual world where you do X in the process of deciding whether to do X, let’s call that a first-person counterfactual. If you’re thinking about it in the process of deciding whether another agent A should have done X instead of Y, let’s call that a third-person counterfactual. The definition of, e.g., modal UDT uses first-person counterfactuals, but when we try to prove a theorem showing that modal UDT is “optimal” in some sense, then we need to use third-person counterfactuals.

UDT’s first-person counterfactuals are logical counterfactuals, but our optimality result evaluates UDT by using physical third-party counterfactuals: it asks, would another agent have done better, not, would a different action by the same agent have lead to a better outcome? The former is easier to analyze, but the latter seems to be what we really care about. Nate’s recent post on “global UDT” points towards turning UDT into a notion of third-party counterfactuals, and describes some problems. In this post, I’ll give a fuller UDT-based notion of logical third-party counterfactuals, which at least fails visibly (returns an error) in the kinds of cases Nate describes. However, in a follow-up post I’ll give an example where this definition returns a non-error value which intuitively seems wrong.

Before I start, a historical side note: When Kenny Easwaran visited us for two days and we proved the UDT optimality result, the reason we decided to physical counterfactuals wasn’t actually that we thought these were the better kind of counterfactuals. Rather, we actually thought explicitly about the problem of physical vs. logical third-person counterfactuals on the first morning of Kenny’s visit, and decided to look at the physical counterfactuals case because it seemed easier to reason about. Which turned out to be a great decision, because—to our surprise—we very quickly ended up proving the first version of what later became the modal UDT optimality result!

But today, let’s talk about logical counterfactuals. As Nate points out in his Global UDT post, there’s a sort of duality between first-person and third-person counterfactuals—given a good third-person notion of counterfactuals, you can try to turn it into a first-person notion by writing an agent that evaluates actions according to it, and given a first-person notion you can try to turn it into a third-person notion. So is there a way to turn, say, the first-person counterfactuals of modal UDT into a way to evaluate what would have happened in a certain universe if a certain agent had taken a different action?

Nate’s post describes an algorithm, GlobalUDT(U,A), which tries to tell you what agent \(A()\) should have done in order to achieve the best outcome in universe \(U()\). Here, I want to ask a more intermediate question: What would have happened if \(A()\) had chosen a different action? Of course, we can then say that the agent should have taken the action that leads to the best possible outcome in this sense, but one advantage of my proposal is that it sometimes says, “I don’t know what would have happened in that case”; in particular, in the cases Nate discusses in his post, my proposal would say that it doesn’t have an answer, rather than giving a wrong answer. (However, in a follow-up post I’ll show that there are cases in which my proposal gives an intuitively incorrect answer.)

So here’s my proposal. Suppose that \(\vec A\) is an \(m\)-action agent, that is, a “provably mutually exclusive and exhaustive” (p.m.e.e.) sequence of \(m\) closed modal formulas \((A_1,\dotsc,A_m)\), where \(A_i\) is interpreted as “the agent takes action \(i\)”. “P.m.e.e.” means that it’s provable that exactly one of the \(m\) formulas is true. Similarly, \(\vec U\) is an \(n\)-outcome universe, i.e., a p.m.e.e. sequence \((U_1,\dotsc,U_n)\) where \(U_j\) means “the \(j\)’th-best outcome obtains”.

We say that, according to this notion of counterfactuals, action \(i\) leads to outcome \(j\) if (i) \(\mathrm{GL}\vdash A_i\to U_j\), and (ii) \(\mathrm{GL}\nvdash \neg A_i\). So for every \(i\), there are three possible cases:

  • If there’s exactly one \(j\) such that \(\mathrm{GL}\vdash A_i\to U_j\), then we say that action \(i\) leads to outcome \(j\).
  • If \(\mathrm{GL}\vdash\neg A_i\), then we don’t know what would have happened if the agent had taken action \(i\), because we “don’t have enough counterfactuals”: there is no model of PA in which \(A_i\) is true (we can think of the models of PA as the “impossible possible worlds” we use to evaluate the impact of different actions). In particular, this is the case if we have both \(\mathrm{GL}\vdash A_i\to U_j\) and \(\mathrm{GL}\vdash A_i\to U_{j'}\), for \(j\neq j'\), since this implies \(\mathrm{GL}\vdash\neg A_i\) by the assumption that \(U_j\) and \(U_{j'}\) are provably mutually exclusive.
  • If there’s no \(j\) such that \(\mathrm{GL}\vdash A_i\to U_j\), then we don’t know what would have happened if the agent had taken action \(i\), because we have “ambiguous counterfactuals”: there are some distinct \(j\) and \(j'\) such that there’s a model of PA in which \(A_i\wedge U_j\), and a different model in which \(A_i\wedge U_{j'}\). (We know that there is a model in which \(A_i\) is true, because otherwise we’d have \(\mathrm{GL}\vdash\neg A_i\), which would imply \(\mathrm{GL}\vdash A_i\to U_j\) for every \(j\).)

Now, for example, if we consider Nate’s example of an agent that has three possible actions, but always takes the third one (i.e., \(\vec A \equiv (A_1,A_2,A_3) \equiv (\bot,\bot,\top)\)), then it’s clear that our third-person counterfactuals will not fail silently, but rather give the reasonable answer that it’s hard to say what outcome the agent would have achieved if it had returned a different value: for example, say that \(U_{14} \equiv \top\wedge(\bot\vee\neg\top)\); are some of the \(\top\)’s and \(\bot\)’s in the definition of this universe invocations of the agent? Which ones? We might hope that there’s a notion of third-party counterfactuals which can answer questions like this about the real world, but presumably it would need to make more use of the more complicated structure of the real universe; as posed, the question doesn’t seem to have a good answer.

But when we apply this notion to modal UDT, it returns a non-error answer sufficiently often to allow us to show an at least superficially sensible (if rather trivial!) optimality result.

Let’s say that a pair of \((\vec A,\vec U)\) is “fully informative” if every \(i\) leads to some \(j\) according to our notion of counterfactuals. Then, given a fully informative pair, we can say that \(\vec A\) is optimal (according to our notion of counterfactuals!) iff the outcome that \(\vec A\)’s actual action leads to is optimal among the outcomes achievable by any of the available actions.

Now it’s rather straight-forward to see that modal UDT is optimal, in this sense, on a universe \(\vec U\) whenever the pair \((\vec{\mathrm{UDT}}(\vec U),\vec U)\) is fully informative. Recall the way that modal UDT works:

  • For every outcome \(j = 1\) to \(n\) (from best to worst):
    • For every action \(i = 1\) to \(m\):
      • If \(\square(A_i\to U_j)\), then take action \(i\).
  • If you’re still here, take some default action.

Clearly, in the fully informative case, this algorithm will take the optimal action (in the sense we use here): Suppose that \(j\) is optimal, and \(i\) leads to \(j\). The search will not find a proof of an implication \(A_{i'}\to U_{j'}\) with \(j' \lt i'\), because then \(j\) wouldn’t be optimal according to our definition; and the search will terminate when considering the pair \((j,i)\) at the latest; so modal UDT will return some action \(i^*\) for which \(\mathrm{GL}\vdash A_{i^*}\to U_j\).

I’d like to say that this covers all the cases in which we would expect modal UDT to be optimal, but unfortunately that’s not quite the case. Suppose that there are two actions, \(A_1\) and \(A_2\), and two outcomes, \(U_1\) and \(U_2\). In this case, it’s consistent that \(i = 1\) leads to \(j = 1\), but we don’t have enough counterfactuals about \(i = 2\), that is, \(\mathrm{GL}\vdash\neg A_2\) (implying that \((\vec A,\vec U)\) isn’t fully informative). This is because modal UDT doesn’t have an explicit “playing chicken” step that would make it take action \(A_2\) if it can prove that it doesn’t take this action. Now, if we did not have \(\mathrm{GL}\vdash A_1\to U_1\), then \(\mathrm{GL}\vdash\neg A_2\) would imply that the agent would take action \(2\) (because \(\neg A_2\) implies \(A_2\to U_1\)), which would lead to a contradiction (the agent takes an action that it provably doesn’t take), so we can rule out that case; but the case of \(\mathrm{GL}\vdash A_1\to U_1\) plus \(\mathrm{GL}\vdash\neg A_2\) is consistent.

So let’s say that a pair \((\vec A,\vec U)\) is “sufficiently informative” if either it’s fully informative or if there is some action \(i\) such that \(\mathrm{GL}\vdash A_i\to U_1\). Then we can say that \(\vec A\) is optimal if either (i) \((\vec A,\vec U)\) is fully informative and \(\vec A\) is optimal in the sense discussed earlier, or (ii) \(\mathbb{N}\vDash U_1\), that is, the agent actually obtains the best possible outcome. With these definitions, we can show that modal UDT is optimal whenever \((\vec{\mathrm{UDT}}(\vec U),\vec U)\) is sufficiently informative.

The reasoning is simple. In the fully informative case, our earlier proof works. In the other case, there’s some \(i\) such that \(\mathrm{GL}\vdash A_i\to U_1\), so the agent’s search is certainly going to stop when it considers \(A_i\to U_1\) at the latest; in other words, it’s going to stop at some \(i^* \le i\) such that \(\mathrm{GL}\vdash A_{i^*}\to U_1\), and the agent is going to output that action \(i^*\); i.e., we’ll have \(\mathbb{N}\vDash A_{i^*}\). But since GL is sound, we also have \(\mathbb{N}\vDash A_{i^*}\to U_1\), and hence \(\mathbb{N}\vDash U_1\), showing optimality in the extended sense.

It’s not surprising that modal UDT is “optimal” in this sense, of course! Nevertheless, as a conceptual tool, it seems useful to have this definition of logical third-person counterfactuals, to complement the first-person notion of modal UDT.

However, my not-so-secret agenda for going through this in detail is that in a follow-up post, I’ll show that there are universes \(\vec U\) such that \((\vec{\mathrm{UDT}}(\vec U),\vec U)\) is fully informative, but UDT still does the intuitively incorrect thing—because the notion of counterfactuals (and hence the notion of optimality) I’ve defined in this post doesn’t agree with intuition as well as we’d like. This failure turns out to be clearer in the context of the third-person counterfactuals described in this post than in modal UDT’s first-person ones.

by Patrick LaVictoire 1026 days ago | Benja Fallenstein likes this | link

Typo on indices, should be:

then GL\(\vdash\neg A_2\) would imply that the agent would take action 2 (because \(\neg A_{\mathbf 2}\) implies \(A_{\mathbf 2}\to U_1\))

Also, isn’t your example also fully informative, since if GL\(\vdash \neg A_2\), then GL also proves true and spurious counterfactuals about \(A_2\)?


by Benja Fallenstein 1025 days ago | Patrick LaVictoire likes this | link

Fixed—thanks, Patrick!

Regarding the example, I earlier defined “action \(i\) leads to outcome \(j\)” to mean the conjunction of \(\mathrm{GL}\vdash A_i\to U_j\) and \(\mathrm{GL}\nvdash\neg A_i\); i.e., we check for spurious counterfactuals before believing that \(\mathrm{GL}\vdash A_i\to U_j\) tells us something about what action \(i\) leads to, and we only consider ourselves “fully informed” in this sense if we have non-spurious information for each \(i\). (Of course, my follow-up post is about how that’s still unsatisfactory; the reason to define this notion of “fully informative” so explicitly was really to be able to say more clearly in which sense we intuitively have a problem even when we’ve ruled out the problems of ambiguous counterfactuals and not enough counterfactuals.)


by Patrick LaVictoire 1025 days ago | Benja Fallenstein likes this | link

Ah! I’d failed to propagate that somehow.

Given that we’re using PA+n to defeat evil problems, the true modal definition of “action \(i\) leads to outcome \(j\)” might be something like “there exists a closed formula \(\phi\) such that \(\mathbb{N}\models\phi\), GL \(\vdash \phi\to(A_i\to U_j)\), and GL \(\not\vdash \phi\to\neg A_i\)”. But that’s an unnecessary complication for this post.


by Benja Fallenstein 1025 days ago | link

Yeah, that sounds good! Of course, by the Kripke levels argument, it’s sufficient to consider \(\phi\)’s of the form \(\neg\square^n\bot\). And we might want to have a separate notion of “\(i\) leads to \(j\) at level \(n\)”, which we can actually implement in a finite modal formula. This seems to suggest a version of modal UDT that tries to prove things in PA, then if that has ambiguous counterfactuals (i.e., it can’t prove \(A_i\to U_j\) for any \(j\)) we try PA+1 and so on up to some finite \(n\); then we can hope these versions of UDT approximate optimality according to your revised version of “\(i\) leads to \(j\)” as \(n\to\infty\). Worth working out!






Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
by Stuart Armstrong on Predictable Exploration | 0 likes

Thinking about this more, I
by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
by Abram Demski on Predictable Exploration | 0 likes


Privacy & Terms