An Informal Conjecture on Proof Length and Logical Counterfactuals post by Scott Garrabrant 775 days ago | Ryan Carey, Sam Eisenstat, Jessica Taylor and Patrick LaVictoire like this | 6 comments During the May 4-6 MIRI workshop on decision theory, I made the following conjecture: If $$\phi$$ and $$\psi$$ are sentences such that the length of the shortest proof of $$\phi\rightarrow\psi$$ is much shorter than the proof of $$\phi\rightarrow\neg\psi$$, then $$\psi$$ is a legitimate counterfactual consequence of $$\phi$$. You may be thinking that this conjecture seems like it cannot be correct. This is the intuition that most of us shared until we started looking for (and failing to find) good counterexamples. After many failed attempts, we started referring to it as “The Trolljecture.” Multiple people from the workshop are currently working on writing up posts related to it. However, the conjecture is not very formal. The purpose of this post is just to give some background for those posts and to discuss the why we are keeping the conjecture in this informal state. There are a three ways in which this conjecture is not formal. I do not see a way to make any of these formal without disrupting the spirit of the conjecture. What do we mean by “legitimate counterfactual consequence”? What do we mean by “much shorter”? What do we mean by “length of the shortest proof”? I will address these issues one at a time. The term “legitimate counterfactual consequence” does not have a formal definition. If it did, many open problems in decision theory and logical uncertainty would be solved. Instead, it is a pointer to a concept that humans seem to have some intuitions about. The best I can do is give some examples of legitimate counterfactual consequence and hope that we are talking about the same cluster of things. Regardless of whether or not P=NP, “P=NP” is a legitimate counterfactual consequence of “3SAT$$\in$$P,” while “P$$\neq$$NP” is NOT a legitimate counterfactual consequence of “3SAT$$\in$$P.” In Newcomb’s problem, “The box is empty” is a legitimate counterfactual consequence of “I two-box.” This is true regardless of whether or not I am a two-boxer. Note that just because $$\phi\rightarrow \psi$$, it does not necessarily follow that $$\psi$$ is a legitimate counterfactual consequence of $$\phi$$. In this post, the notation $$\psi\in C(\phi)$$ is used for “$$\psi$$ is a legitimate counterfactual consequence of $$\phi$$.” They mean the same thing. The term “much shorter” probably could have been defined but most ways to define it seem arbitrary, and the exact definition does not matter much. There are some ways to define it which feel less arbitrary. We could just say that there exists a computable function $$f$$ for which the conjecture is true if we define much shorter by “$$n$$ is much shorter than anything greater than $$f(n)$$.” We could even put restrictions on $$f$$, like requiring it to be linear. The reason we do not try to formally define “much shorter” this way is that it would stop us from being able to in theory point at one pair of sentences to demonstrate a counterexample. For now, you can think of “much shorter” to mean “half,” and we can reevaluate if a counterexample is found. The term “length of the shortest proof” is not defined, because I have not specified a formal proof system. This is on purpose. I do not want to allow a counterexample where $$\phi\rightarrow\neg\psi$$ is not provable in PA, but has a really short proof in a stronger proof system, and that short proof is exactly why we believe $$\psi$$ is not a legitimate counterfactual consequence of $$\phi$$. It is worth pointing out that this conjecture is not a proposed definition of legitimate counterfactual consequences, it only provides sufficient conditions for something being a legitimate counterfactual consequence. Hopefully much more on this will be posted here by various people in the near future.

 by Sam Eisenstat 774 days ago | Abram Demski, Benja Fallenstein, Nate Soares and Patrick LaVictoire like this | link I (just yesterday) found a counterexample to this. The universe is a 5-and-10 variant that uses the unprovability of consistency: def U(): if A() == 2: if PA is consistent: return 10 else: return 0 else: return 5 The agent can be taken to be modal UDT, using PA as its theory. (The example will still work for other theories extending PA, we just need the universe’s theory to include the agents. Also, to simplify some later arguments we suppose that the agent uses the chicken rule, and that it checks action 1 first, then action 2.) Since the agent cannot prove the consistency of its theory, it will not be able to prove $$\tt{A() = 2} \to \tt{U() = 10}$$, so the first implication which it can prove is $$\tt{A() = 1} \to \tt{U() = 5}$$. Thus, it will end up taking action 1. Now, we work in PA and try to show $$\tt{A() = 2} \to \tt{U() = 0}$$. If PA is inconsistent (we have to account for this case since we are working in PA), then $$\tt{A() = 2} \to \tt{U() = 0}$$ follows straightforwardly. Next, we consider the case that PA is consistent and work through the agent’s decision. PA can’t prove $$\tt{A() \ne 1}$$, since we used the chicken rule, so since the sentence $$\tt{A() = 1} \to \tt{U() = 5}$$ is easily provable, the sentence $$\tt{A() = 1} \to \tt{U() = 10}$$ (ie. the first sentence that the agents checks for proofs of) must be unprovable. The next sentence we check is $$\tt{A() = 2} \to \tt{U() = 10}$$. If the agent finds a proof of this, then it takes action 2. Otherwise, it moves on to the sentence $$\tt{A() = 1} \to \tt{U() = 5}$$, which is easily provable as mentioned above, and it takes action 1. Hence, the agent takes action 2 iff it can prove $$\tt{A() = 2} \to \tt{U() = 10}$$, so $\tt{A() = 2} \leftrightarrow \square(\tt{A() = 2} \to \tt{U() = 10}).$ Löb’s theorem tells us that $\square(\tt{U = 10}) \leftrightarrow \square(\square(\tt{U = 10}) \to \tt{U() = 10}),$ so, by the uniqueness of fixed points, it follows that $$\tt{A() = 2} \leftrightarrow \square(\tt{U = 10})$$. Then, we get $$\tt{A() = 2} \rightarrow \square(\tt{U = 10})$$, so $$\tt{A() = 2} \rightarrow \square(\neg \square \bot)$$ by the definition of the universe, and so $$\tt{A() = 2} \rightarrow \square(\bot)$$ by Gödel’s second incompleteness theorem. Thus, if the agent takes action 2, then PA is inconsistent, so $$\tt{U() = 0}$$ as desired. This tells us that $$\rm{PA} \vdash \tt{A() = 2} \to \tt{U() = 0}$$. Also, $$\rm{PA} \nvdash \tt{A() \ne 2}$$ by the chicken rule, so $$\rm{PA} \nvdash \tt{A() = 2} \to \tt{U() \ne 0}$$. Since PA does not prove $$\tt{A() = 2} \to \tt{U() \ne 0}$$ at all, the shortest proof of $$\rm{PA} \vdash \tt{A() = 2} \to \tt{U() = 0}$$ is much shorter than the shortest proof of $$\tt{A() = 2} \to \tt{U() \ne 0}$$ for any definition of “much shorter”. (One can object here that there is no shortest proof, but (a) it seems natural to define the “length of the shortest proof” to be infinite if there is no proof, and (b) it is probably straightforward but tedious to modify the agent and universe so that there is a proof of $$\tt{A() = 2} \to \tt{U() \ne 0}$$, but it is very long.) However, it is clear that $$\tt{U() = 0}$$ is not a legitimate counterfactual consequence of $$\tt{A() = 2}$$. Informally, if the agent chose action 2, it would have received utility 10, since PA is consistent. Thus, we have a counterexample. One issue we discussed during the workshop is whether counterfactuals should be defined with respect to a state of knowledge. We may want to say here that we, who know a lot, are in a state of knowledge with respect to which $$\tt{A() = 2}$$ would counterfactually result in $$\tt{U() = 10}$$, but that someone who reasons in PA is in a state of knowledge w.r.t. which it would result in $$\tt{U() = 0}$$. One way to think about this is that we know that PA is obviously consistent, irrespective of how the agent acts, whereas PA does not know that it is consistent, allowing an agent using PA to think of itself as counterfactually controlling PA’s consistency. Indeed, this is roughly how the argument above proceeds. I’m not sure that this is a good way of thinking about this though. The agents goes through some weird steps, most notably a rather opaque application of the fixed point theorem, so I don’t have a good feel for why it is reasoning this way. I want to unwrap that argument before I can say whether it’s doing something that, on an intuitive level constitutes legitimate counterfactual reasoning. More worryingly, the perspective of counterfactuals as being defined w.r.t. states of knowledge seems to be at odds with PA believing a wrong counterfactual here. It would make sense for PA not to have enough information to make any statement about the counterfactual consequences of $$\tt{A() = 2}$$, but that’s not what’s happening if we think of PA’s counterfactuals as obeying this conjecture; instead, PA postulates a causal mechanism by which the agent controls the consistency of PA, which we didn’t expect to be there at all. Maybe it would all make sense if I had a deeper understanding of the proof I gave, but right now it is very odd. (This is rather long; perhaps it should be a post? Would anyone prefer that I clean up a few things and make this a post? I’ll also expand on the issue I mention at the end when I have more time to think about it.) reply
 by Benja Fallenstein 774 days ago | Patrick LaVictoire likes this | link Next, we consider the case that PA is consistent and work through the agent’s decision. PA can’t prove $$A()\neq1$$, since we used the chicken rule, so since the sentence $$A()=1\to U()=5$$ is easily provable, the sentence $$A()=1\to U()=10$$ (ie. the first sentence that the agents checks for proofs of) must be unprovable. It seems like this argument needs soundness of PA, not just consistency of PA. Do you see a way to prove in PA that if $$\mathrm{PA}\vdash A()\neq 1$$, then PA is inconsistent? [edited to add:] However, your idea reminds me of my post on the odd counterfactuals of playing chicken, and I think the example I gave there makes your idea go through: The scenario is that you get 10 if you take action 1 and it’s not provable that you don’t take action 1; you get 5 if you take action 2; and you get 0 if you take action 1 and it’s provable that you don’t. Clearly you should take action 1, but I prove that modal UDT actually takes action 2. To do so, I show that PA proves $$A() = 1 \to \neg\square\ulcorner A() = 1\urcorner$$. (Then from the outside, $$A() = 2$$ follows from the outside by soundness of PA.) This seems to make your argument go through if we can also show that PA doesn’t show $$A() \neq 1$$. But if it did, then modal UDT would take action 1 because this comes first in its proof search, contradiction. Thus, PA proves $$A() = 1 \to U() = 0$$ (because this follows from $$A() = 1 \to \neg\square\ulcorner A() = 1\urcorner$$), and also PA doesn’t prove $$A() = 1 \to U() = 10$$. As in your argument, then, the trolljecture implies that we should think “if the agent takes action 1, it gets utility 0” is a good counterfactual, and we don’t think that’s true. Still interested in whether you can make your argument go through in your case as well, especially if you can use the chicken step in a way I’m not seeing yet. Like Patrick, I’d encourage you to develop this into a post. reply
 by Sam Eisenstat 774 days ago | Benja Fallenstein and Patrick LaVictoire like this | link The argument that I had in mind was that if $$\rm{PA} \vdash \tt{A()} \ne 1$$, then $$\rm{PA} \vdash \square \ulcorner \tt{A()} \ne 1 \urcorner$$, so $$\rm{PA} \vdash \tt{A()} = 1$$ since PA knows how the chicken rule works. This gives us $$\rm{PA} \vdash \bot$$, so PA can prove that if $$\rm{PA} \vdash \tt{A()} \ne 1$$, then PA is inconsistent. I’ll include this argument in my post, since you’re right that this was too big a jump. Edit: We also need to use this argument to show that the modal UDT agent gets to the part where it iterates over utilities, rather than taking an action at the chicken rule step. I didn’t mention this explicitly, since I felt like I had seen it before often enough, but now I realize it is nontrivial enough to point out. reply
 by Patrick LaVictoire 774 days ago | link Nice! Yes, I encourage you to develop this into a post. reply
 by Eliezer Yudkowsky 774 days ago | Jessica Taylor and Patrick LaVictoire like this | link I can’t see the grandparent, so posting here: It occurs to me that maybe we could regard the agent as consistently reasoning, “If I choose of my own free will to output 2, that thereby causes Peano Arithmetic to be inconsistent, causing me to get 0 points.” I mostly don’t buy this, but it slightly defends the legitness of the counterfactual. reply

### NEW DISCUSSION POSTS

The AI defers to anything
 by Paul Christiano on Corrigibility thoughts II: the robot operator | 0 likes

Thus anything that can
 by Stuart Armstrong on Corrigibility thoughts II: the robot operator | 0 likes

Ah, thanks! That seems more
 by Stuart Armstrong on Loebian cooperation in the tiling agents problem | 0 likes

It doesn't mean computation
 by Vladimir Slepnev on Loebian cooperation in the tiling agents problem | 1 like

I'm not sure this would work,
 by Stuart Armstrong on Loebian cooperation in the tiling agents problem | 0 likes

>How can the short term
 by Stuart Armstrong on Humans are not agents: short vs long term | 0 likes

I expect a workable approach
 by Paul Christiano on Corrigibility thoughts II: the robot operator | 0 likes

 by Stuart Armstrong on Corrigibility thoughts II: the robot operator | 0 likes

It is ‘a preference for
 by Stuart Armstrong on Humans are not agents: short vs long term | 0 likes

Note that we don't need to
 by Paul Christiano on ALBA requires incremental design of good long-term... | 0 likes

If I want my boat to travel
 by Paul Christiano on Corrigibility thoughts II: the robot operator | 0 likes

I don't think it's much like
 by Abram Demski on An Approach to Logically Updateless Decisions | 0 likes

Yeah, I like tail dependence.
 by Sam Eisenstat on An Approach to Logically Updateless Decisions | 0 likes

This is basically the
 by Paul Christiano on Cooperative Oracles: Stratified Pareto Optima and ... | 1 like

I think AsDT has a limited
 by Abram Demski on An Approach to Logically Updateless Decisions | 2 likes