Proof Length and Logical Counterfactuals Revisited post by Patrick LaVictoire 711 days ago | Sam Eisenstat, Jessica Taylor and Scott Garrabrant like this | 5 comments Update: This version of the Trolljecture fails too; see the counterexample due to Sam. In An Informal Conjecture on Proof Length and Logical Counterfactuals, Scott discussed a “trolljecture” from a MIRI workshop, which attempted to justify (some) logical counterfactuals based on the lengths of proofs of various implications. Then Sam produced a counterexample, and Benja pointed to another counterexample. But at the most recent MIRI workshop, I talked with Jonathan Lee and Holger Dell about a related way of evaluating logical counterfactuals, and we came away with a revived trolljecture! Let’s say we have a recursively enumerable set of axioms $$S$$ (in the language of PA) and a statement $$\varphi$$. Then we define the “distance” $$d_S(\varphi)$$ as the length of the shortest proof of $$\varphi$$ using the axioms $$S$$. (We set $$d_S(\varphi)=\infty$$ if there is no such proof.) And now we define a valuation $\nu_S(\varphi)=\frac{d_S(\neg\varphi)}{d_S(\neg\varphi)+d_S(\varphi)}$ which we will interpret as the “probability” of the logical counterfactual that $$S$$ implies $$\varphi$$. (This is undefined when $$\varphi$$ is undecidable given $$S$$; we could define it as a limit when searching over proofs of bounded length, in which case it would equal 0.5 in such cases. But as we shall see, this is not the case we care about.) This valuation has the trivial properties that $$\nu_S(\varphi)=1$$ if $$S\vdash\varphi$$ and $$S\not\vdash\neg\varphi$$, and similarly $$\nu_S(\varphi)=0$$ if $$S\not\vdash\varphi$$ and $$S\vdash\neg\varphi$$. But what’s more interesting are the properties that this valuation has if $$S$$ is inconsistent. If there is a short proof of $$\varphi$$ from $$S$$, and a much longer proof of $$\neg\varphi$$ from $$S$$, then $$\nu_S(\varphi)\approx 1$$. Moreover, we have an approximate coherence condition $\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) \approx \nu_S(\varphi)$ so long as either $$\varphi$$ or $$\psi$$ has a proof from $$S$$ that is much shorter than the proof of $$\bot$$ from $$S$$. (See the notes below.) And this gets some key examples “correct” if we add the axioms that we actually care about (and which make the problem truly counterfactual). For instance, let’s take Sam’s counterexample to the original Trolljecture: def U(): if A() = 1: if PA is consistent: return 10 else: return 0 else: return 5 def A(): if PA ⊢ A() = 1 → U() = 10: return 1 else: return 2 In that case, $$A()$$ actually returns 2 and $$U()$$ returns 5, but PA proves the “incorrect” counterfactual that $$A()=1\to U()=0$$ and does not prove the “correct” counterfactual that $$A()=1\to U()=10$$. (Read the linked post if this puzzles you.) But what if we consider reasoning from the inconsistent premises that PA is consistent and yet $$A()=10$$? In that case, I claim that $$\nu_{\textsf{PA+Con(PA)}+(A()=1)}(U()=10)$$ is near 1 while $$\nu_{\textsf{PA+Con(PA)}+(A()=1)}(U()=0)$$ is near 0. To see this, note that there is a short proof from these axioms that $$U()=10$$ (look at the definition of $$U$$, apply the axiom that $$A()=1$$, apply the axiom $$\textsf{Con(PA)}$$, and conclude that $$U()=10$$), but the shortest proof from those axioms that $$U()=0$$ is longer: we would need to prove Godel’s Second Incompleteness Theorem to show that $$\textsf{Con(PA)}\to A()=2$$, thus $$\neg \textsf{Con(PA)}$$, and thus $$U()=0$$. (Of course, I could be missing a much shorter proof, but I think this particular example goes through. The same reasoning also holds for the example Benja mentioned.) Therefore, I will go out on a limb and propose the following: Revived Trolljecture: If we consider a modal agent $$A()$$ inside a modal universe $$U()$$, then for sufficiently large $$N$$, the “correct” logical counterfactuals of the form “if $$A()=a$$, then $$U()=u$$” are the ones such that the valuations $$\nu_{\textsf{PA+N}+(A()=a)}(U()=u)$$ are high. In particular, if for $$N$$ sufficiently large, there is a short proof in $$\textsf{PA+N}$$ of $$A()=a\to U()=u$$ and no comparably short proof of $$A()\neq a$$, then $$u$$ is unique with this property, and this is the “correct” counterfactual utility for $$A()=a$$. I’m not very confident that this is true as stated, but I have some reasonable hope that either a counterexample will be enlightening about logical counterfactuals, or that an improved analogue of the valuation will in fact be quite useful. Notes on this valuation: This is a claim about “third person” counterfactuals, not those done by an agent itself. Obviously, an agent that makes decisions based on this valuation will not be definable in modal logic, and I don’t yet make any claims about what such an agent might do in the PA versions of such problems. The approximate coherence is of the following form: if the shortest proof of contradiction from $$S$$ has length $$N$$, and one of the statements $$\varphi$$, $$\neg\varphi$$, $$\psi$$, $$\neg\psi$$ has a proof from $$S$$ of length $$n\ll N$$, then $\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) = \nu_S(\varphi) + O(n/N).$ I don’t know a cleverer proof than going through the cases, but each case is straightforward. For instance, if $$d_S(\varphi)=n$$, then we must have $$N-n\leq d_S(\neg\varphi) \leq N+1$$, and so it follows quickly that $$\nu_S(\varphi) = 1 + O(n/N)$$; then, considering the three cases $$d_S(\psi)\lesssim n$$, $$d_S(\neg\psi)\lesssim n$$, and neither, in each case it is clear that $$\nu_S(\varphi\wedge\psi) = \nu_S(\psi) + O(n/N)$$. It does not, however, seem to be the case that $$\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) \approx \nu_S(\varphi)$$ whenever there is an easily provable relation between $$\varphi$$ and $$\psi$$. I am not entirely sure that I’m not missing a necessary logical relation, but it seems like we could consider two sentences $$\varphi$$ and $$\theta$$ whose proofs are “independent” of the contradiction in $$S$$ and of each other, and then let $$\psi=\varphi\wedge\theta$$ so that $$\psi\to\varphi$$ is quickly provable from $$S$$. Say that $$d_S(\bot)=N$$, $$d_S(\varphi)=N/2$$, $$d_S(\theta)=N/3$$; if these are independent in that way, we should be able to get $$d_S(\neg\varphi)\approx N$$, $$d_S(\neg\theta)\approx N$$, $$d_S(\psi)\approx d_S(\varphi) + d_S(\theta)\approx 5N/6$$, and $$d_S(\neg\psi)\approx N$$. Then $$\nu_S(\varphi)\approx 2/3$$, $$\nu_S(\varphi\wedge\psi)\approx \nu_S(\psi) \approx 6/11$$, and $$\nu_S(\varphi\wedge\neg\psi)$$ will be at least 1/4 since $$d_S(\varphi\wedge\neg\psi)\approx N$$ and $$d_S(\neg\varphi\vee\psi)\geq N/3$$, so $$\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) - \nu_S(\varphi) > 6/11+1/4-2/3>0$$. I don’t know whether this could be used to break the revived trolljecture. If the shortest proof of contradiction from $$S$$ has length $$N$$, then for any $$\varphi$$, $$\frac{1}{N+2}\leq \nu_S(\varphi) \leq \frac{N+1}{N+2}$$; the valuations are bounded away from 0 and 1. This may be a feature and not a bug: a mathematician could reason decently confidently about what math would look like in the counterfactual where Fermat’s Last Theorem were false (presuming that its actual proof is quite long), but it is hard to say very much about what math would look like in the counterfactual where $$1+1=3$$. However, there are some examples, like naive set theory before Russell’s Paradox, where human mathematicians reasoned mostly coherently despite using axioms that (they did not notice) led quickly to contradictions; the valuation does not capture that aspect of mathematical intuition. I restricted the new trolljecture to modal universes because it is well-specified how the stronger axiom schemas $$PA+N$$ eventually decide the agent and the universe in the standard model of the natural numbers. If something like it holds up here, then there may be a more general version.

 by Patrick LaVictoire 697 days ago | link ASP has only been formalized in ways that don’t translate to modal universes, so it’s only the analogous extensions that could apply there. In the formulation here, it’s not enough that the agent one-box; it must do so provably in less than $$N$$ steps. I think that both the valuations $$\nu_{\textsf{ZFC+Sound(ZFC)}+A()=1}(U()=1M)$$ and $$\nu_{\textsf{ZFC+Sound(ZFC)}+A()=1}(U()=0)$$ will probably be about $$1/2$$, but I’m not sure. reply
 by Patrick LaVictoire 697 days ago | link I can prove the property that for each hypothesis $$A()=a$$ there is at most one $$u$$ such that $$U()=u$$ has a high valuation (for sufficiently high PA+N), with the following caveat: it can sometimes take many steps to prove that $$u\neq u'$$ in PA+N, so we’ll need to include the length of that proof in our bound. In what follows, we will take all subscripts of $$d$$ and $$\nu$$ to be $$PA+N, A()=a$$ for $$N$$ large. For any $$\phi$$, $$d(\bot) - d(\neg\phi)\leq d(\phi)\leq d(\bot)$$ and thus $1 - \frac{d(\phi)}{d(\bot)} \leq \nu(\phi) \leq \frac{d(\bot)}{d(\phi)+d(\bot)}.$ Also, $$d(U()=u)+d(U()=u')+d(u\neq u')\geq d(\bot)$$. This implies $$\max\{d(U()=u),d(U()=u')\}\geq \frac12(d(\bot)-d(u\neq u))$$, which implies $\min\{\nu(U()=u),\nu(U()=u')\}\leq \min\{\frac{d(\bot)}{d(U()=u)+d(\bot)},\frac{d(\bot)}{d(U()=u')+d(\bot)}\} \leq \frac{2d(\bot)}{3d(\bot)-d(u\neq u')}.$ So we see that $$\nu(U()=u)$$ and $$\nu(U()=u')$$ cannot both be significantly larger than 2/3 if there is a short proof that $$u\neq u'$$. reply
 by Patrick LaVictoire 657 days ago | link Sam Eisenstat produced the following counterexample to a stronger version of the revived trolljecture (one using soundness rather than consistency in the escalated theories): Take a statement $$X$$ in the language of PA such that neither $$X$$ nor $$\neg X$$ has a short proof (less than length $$L$$) in PA+Sound(PA); $$X$$ can be provably true, provably false, or undecidable in that system. Define $$U$$ to equal 10 if $$X\wedge A()=a$$, 0 if $$\neg X\wedge A()=a$$, and 5 if $$A()\neq a$$. Define $$A$$ to equal $$a$$ if there is a short proof in PA (less than length $$L$$) that $$A()=a\to U()=10$$, and $$b$$ otherwise. Clearly $$A()=b$$ and $$U()=5$$. Now I claim that in the formal system PA+Sound(PA)+$$(A()=a)$$, there is a very short proof of $$U()=10$$. And this is the case whether $$X$$ is true, false, or undecidable! In order to prove $$U()=0$$, take the axiom $$A()=a$$. It follows from the definition of $$A$$ that there is a short proof in PA (less than length $$L$$) that $$A()=a\to U()=10$$. And it follows from the soundness of PA that in fact $$A()=a\to U()=10$$, and thus $$U()=10$$. On the other hand, we cannot quickly prove $$U()=0$$, for if we could, then we could prove $$\neg X$$. Things get tricky because we’re using the extra (inconsistent) axiom $$A()=a$$, but intuitively, if $$X$$ is some arithmetical assertion that has nothing to do with the decision problem, then there shouldn’t be any short proof of $$\neg X$$ in this system either, and so the shortest proof of $$U()=0$$ should be at least comparable to length $$L$$. reply
 by Alex Appel 655 days ago | link I don’t know, that line of reasoning that U()=10 seems like a pretty clear consequence of PA+Sound(PA)+A()=a, and the lack of a counterfactual for “X is false” doesn’t violate any of my intuitions. It’s just reasoning backwards from “The agent takes action a” to the mathematical state of affairs that must have produced it (there is a short proof of X). On second thought, the thing that broke the original trolljecture was reasoning backwards from “I take action a” to the mathematical state of affairs that produced it. Making inferences about the mathematical state of affairs in your counterfactuals using knowledge of your own decision procedure does seem to be a failure mode at first glance. Maybe use the counterfactual of “find-and-replace all instances of X’s source code in the universe program U with action a, and evaluate”? But that wouldn’t work for different algorithms that depend on checking the same math facts. There needs to be some way to go from “X takes action A” to “closely related algorithm Y takes action B”. But that’s just inferring mathematical statements from the combination of actions and knowing X’s decision rule. I’ll stick with the trolljecture as the current best candidate for “objective” counterfactuals, because reasoning backwards from actions and decision rules a short way into math facts seems needed to handle “logically related” algorithms, and this counterexample looks intuitively correct. reply

### NEW DISCUSSION POSTS

I have stopped working on
 by Scott Garrabrant on Cooperative Oracles: Introduction | 0 likes

 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

So this requires the agent's
 by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

If the agent always delegates
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

 by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

Hi Tom! There is a
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

Hi Alex! I agree that the
 by Vadim Kosoy on Cooperative Oracles: Stratified Pareto Optima and ... | 0 likes

That is a good question. I
 by Tom Everitt on CIRL Wireheading | 0 likes

 by Tom Everitt on CIRL Wireheading | 0 likes

"The use of an advisor allows
 by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

 by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

Suppose that I, Paul, use a
 by Paul Christiano on Current thoughts on Paul Christano's research agen... | 0 likes

When you wrote "suppose I use
 by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

> but that kind of white-box
 by Paul Christiano on Current thoughts on Paul Christano's research agen... | 0 likes

>Competence can be an
 by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes