Intelligent Agent Foundations Forumsign up / log in
A Counterexample to an Informal Conjecture on Proof Length and Logical Counterfactuals
post by Sam Eisenstat 736 days ago | Jim Babcock, Abram Demski, Patrick LaVictoire and Scott Garrabrant like this | 1 comment

Previous: An Informal Conjecture on Proof Length and Logical Counterfactuals

This is a simplified and more complete presentation of my previous counterexample to Scott Garrabrant’s conjecture on logical counterfactuals. I present an example of two statements \(\phi\) and \(\psi\) such that \({\rm PA}\vdash\phi\rightarrow\psi\) and \({\rm PA}\nvdash\phi\rightarrow\neg\psi\), but \(\psi\) is not “really” a counterfactual consequence of \(\phi\), in an intuitive, informal sense. I also argue, based on the proof of \(\phi\rightarrow\psi\), that we should trust our intuition that \(\psi\) is not a real counterfactual consequence, rather than believing that our intuitions are being stretched too far and misleading us.

Consider the following universe and agent.

def U():
  if A() = 1:
    if PA is consistent:
      return 10
    else:
      return 0
  else: return 5

def A():
  if PA ⊢ A() = 1 → U() = 10:
    return 1
  else: return 2

Note that this agent reasons very similarly to modal UDT. It is simpler, but that’s because it’s clear that action 2 will lead to utility 5, so if the agent cannot get utility 10 by taking action 1, there is no reason for it to continue its proof search.

Consider the statements \(\phi \equiv ({\tt A()} = 1)\) and \(\psi \equiv ({\tt U()} = 0)\). We first show \({\rm PA}\vdash\phi\rightarrow\psi\). Work in \(\rm PA\) and suppose for contradiction that \(\phi\wedge\neg\psi\), i.e. \({\tt A()} = 1 \wedge {\tt U()} \ne 0\). If \(\rm PA\) were inconsistent, the agent would get 0 utility, so it must be the case that \(\rm PA\) is consistent. Also, since \({\tt A()} = 1\), we know by looking at the agent’s code that the proof search succeeded, so we have \(\square \ulcorner{\tt A()} = 1 \rightarrow {\tt U()} = 10\urcorner\). \(\rm PA\) knows that it can prove this, so it knows that the agent takes action 1, i.e. we have \(\square\ulcorner{\tt A()}=1\urcorner\). Putting this all together, consistency tells us that \(\neg\square\ulcorner{\tt A()}=1\rightarrow{\tt U()}=0\urcorner\). This is logically equivalent to \(\lozenge\ulcorner({\tt A()} = 1 \wedge {\tt U()} \ne 0)\). Thus, stepping back to the metalanguage, we see that the theory \({\rm PA} + ({\tt A()} = 1 \wedge {\tt U()} \ne 0)\) asserts its own consistency, so it is inconsistent. (Gödel’s second incompleteness theorem is used here, but the same argument can be carried out using Löb’s theorem.)

Since \({\rm PA}+\left({\tt A()}=1\wedge{\tt U()}\ne0\right)\) is inconsistent, \(\rm PA\) proves \(\neg\left({\tt A()}=1\wedge{\tt U()}\ne0\right)\), i.e. \({\tt A()}=1\rightarrow{\tt U()}=0\). (By soundness of \(\rm PA\), we can see at this point that the agent takes action 2, but this is unnecessary for the present argument.) It remains only to show that \({\rm PA}\nvdash{\tt A()}=1\to{\tt U()}\ne0\). If \(\rm PA\) is inconsistent, then the agent takes action 1, since its proof search trivially succeeds, and it receives utility 0, so there is at least one model of \(\rm PA\) where \({\tt A()}=1\) and \({\tt U()}=0\), establishing the result.

We now have both \({\rm PA}\vdash{\tt A()}=1\to{\tt U()}=0\) and \({\rm PA}\nvdash{\tt A()}=1\to{\tt U()}\ne0\), so Garrabrant’s conjecture claims that in the counterfactual world where the agent takes action 1, it receives utility 0. By the structure of \(\tt U\), in the \({\tt A()}=1\) case, we also have \({\rm PA}\vdash{\tt A()}=1\to\square\bot\) and \({\rm PA}\nvdash{\tt A()}=1\to\neg\square\bot\), so Garrabrant’s conjecture similarly claims that the inconsistency of \(\rm PA\) holds in this world. Both of these seem intuitively wrong; we would expect that the agent receives utility 10 in this world, and that \(\rm PA\) is still consistent. Even if we do not expect these things very strongly — for example, we may think that our beliefs about this counterfactual world are best modeled by a probability distribution that places weight on both \({\tt U()}=0\) and \({\tt U()}=10\) — it is surprising that a notion of counterfactual would be certain that the less intuitive option, \({\tt U()}=0\), holds in this counterfactual world.

We can obtain further evidence that our intuition is correct by examining the structure of the argument that \({\rm PA}\vdash{\tt A()}=1\to{\tt U()}=0\). Central to this argument is a step where we reason from the assumption that \({\tt A()}=1\) to the conclusion that this happened because the agent’s proof search succeeded, and thus that \(\square\ulcorner{\tt A()}=1\rightarrow{\tt U()}=10\urcorner\) holds. This reasoning is causally backwards; we regard the proof search as the cause of the agent’s action and we reason backward from the effect to the cause. This is valid logically, but it is not what we mean by the counterfactual world where \({\tt A()}=1\). We can draw an analogy to graph surgery on Bayesian networks. There, in order to perform causal reasoning, we sever the links connecting a node to its causal parents, and only allow the counterfactual to change our probability distribution through its causal children. This is a different kind of reasoning, and it is one that this example shows we do not yet have a good analogue of in a logical context.



by Sam Eisenstat 736 days ago | link

We can also construct an example where \({\rm PA} \vdash \phi \rightarrow \psi\) with a short proof and \(\rm PA\) also proves \(\phi \rightarrow \neg\psi\), but any such proof is much longer. We only need to put a bound on the proof length in A’s proof search. Then, the argument that \({\tt A()} = 1 \wedge {\tt U()} \ne 0\) proves its own consistency still works, and is rather short: \(O(\log n)\) as the proof length bound \(n\) increases. However, there cannot be a proof of \({\tt A()} = 1 \rightarrow {\tt U()} = 10\) within \({\tt A}\)’s proof length bound, since if it found one it would immediately take action 1. In this case \(\rm PA\) can still prove that \({\tt A()} = 1 \rightarrow {\tt U()} = 10\) simply by running the agent, but this argument shows that any such proof must be long.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

A few thoughts: I agree
by Sam Eisenstat on Some Criticisms of the Logical Induction paper | 0 likes

Thanks, so to paraphrase your
by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

> Why does Paul think that
by Paul Christiano on Current thoughts on Paul Christano's research agen... | 0 likes

Given that ALBA was not meant
by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

Thank you for writing this.
by Wei Dai on Current thoughts on Paul Christano's research agen... | 1 like

I mostly agree with this
by Paul Christiano on Current thoughts on Paul Christano's research agen... | 2 likes

>From my perspective, I don’t
by Johannes Treutlein on Smoking Lesion Steelman | 2 likes

Replying to Rob. I don't
by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 0 likes

Replying to Rob. Actually,
by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 0 likes

Replying to 240 (I can't
by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 0 likes

Yeah, you're right. This
by Vadim Kosoy on Smoking Lesion Steelman | 1 like

The non-smoke-loving agents
by Abram Demski on Smoking Lesion Steelman | 1 like

Replying to "240" First,
by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 0 likes

Clarification: I'm not the
by Tarn Somervell Fletcher on Some Criticisms of the Logical Induction paper | 0 likes

Alex, the difference between
by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 1 like

RSS

Privacy & Terms