Intelligent Agent Foundations Forumsign up / log in
Proof Length and Logical Counterfactuals Revisited
post by Patrick LaVictoire 662 days ago | Sam Eisenstat, Jessica Taylor and Scott Garrabrant like this | 3 comments

Update: This version of the Trolljecture fails too; see the counterexample due to Sam.

In An Informal Conjecture on Proof Length and Logical Counterfactuals, Scott discussed a “trolljecture” from a MIRI workshop, which attempted to justify (some) logical counterfactuals based on the lengths of proofs of various implications. Then Sam produced a counterexample, and Benja pointed to another counterexample.

But at the most recent MIRI workshop, I talked with Jonathan Lee and Holger Dell about a related way of evaluating logical counterfactuals, and we came away with a revived trolljecture!


Let’s say we have a recursively enumerable set of axioms \(S\) (in the language of PA) and a statement \(\varphi\). Then we define the “distance” \(d_S(\varphi)\) as the length of the shortest proof of \(\varphi\) using the axioms \(S\). (We set \(d_S(\varphi)=\infty\) if there is no such proof.)

And now we define a valuation \[\nu_S(\varphi)=\frac{d_S(\neg\varphi)}{d_S(\neg\varphi)+d_S(\varphi)}\] which we will interpret as the “probability” of the logical counterfactual that \(S\) implies \(\varphi\). (This is undefined when \(\varphi\) is undecidable given \(S\); we could define it as a limit when searching over proofs of bounded length, in which case it would equal 0.5 in such cases. But as we shall see, this is not the case we care about.)

This valuation has the trivial properties that \(\nu_S(\varphi)=1\) if \(S\vdash\varphi\) and \(S\not\vdash\neg\varphi\), and similarly \(\nu_S(\varphi)=0\) if \(S\not\vdash\varphi\) and \(S\vdash\neg\varphi\). But what’s more interesting are the properties that this valuation has if \(S\) is inconsistent.

If there is a short proof of \(\varphi\) from \(S\), and a much longer proof of \(\neg\varphi\) from \(S\), then \(\nu_S(\varphi)\approx 1\). Moreover, we have an approximate coherence condition \[\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) \approx \nu_S(\varphi)\] so long as either \(\varphi\) or \(\psi\) has a proof from \(S\) that is much shorter than the proof of \(\bot\) from \(S\). (See the notes below.)

And this gets some key examples “correct” if we add the axioms that we actually care about (and which make the problem truly counterfactual). For instance, let’s take Sam’s counterexample to the original Trolljecture:

def U():
  if A() = 1:
    if PA is consistent:
      return 10
    else:
      return 0
  else: return 5

def A():
  if PA ⊢ A() = 1 → U() = 10:
    return 1
  else: return 2

In that case, \(A()\) actually returns 2 and \(U()\) returns 5, but PA proves the “incorrect” counterfactual that \(A()=1\to U()=0\) and does not prove the “correct” counterfactual that \(A()=1\to U()=10\). (Read the linked post if this puzzles you.)

But what if we consider reasoning from the inconsistent premises that PA is consistent and yet \(A()=10\)? In that case, I claim that \(\nu_{\textsf{PA+Con(PA)}+(A()=1)}(U()=10)\) is near 1 while \(\nu_{\textsf{PA+Con(PA)}+(A()=1)}(U()=0)\) is near 0.

To see this, note that there is a short proof from these axioms that \(U()=10\) (look at the definition of \(U\), apply the axiom that \(A()=1\), apply the axiom \(\textsf{Con(PA)}\), and conclude that \(U()=10\)), but the shortest proof from those axioms that \(U()=0\) is longer: we would need to prove Godel’s Second Incompleteness Theorem to show that \(\textsf{Con(PA)}\to A()=2\), thus \(\neg \textsf{Con(PA)}\), and thus \(U()=0\).

(Of course, I could be missing a much shorter proof, but I think this particular example goes through. The same reasoning also holds for the example Benja mentioned.)

Therefore, I will go out on a limb and propose the following:

Revived Trolljecture: If we consider a modal agent \(A()\) inside a modal universe \(U()\), then for sufficiently large \(N\), the “correct” logical counterfactuals of the form “if \(A()=a\), then \(U()=u\)” are the ones such that the valuations \(\nu_{\textsf{PA+N}+(A()=a)}(U()=u)\) are high. In particular, if for \(N\) sufficiently large, there is a short proof in \(\textsf{PA+N}\) of \(A()=a\to U()=u\) and no comparably short proof of \(A()\neq a\), then \(u\) is unique with this property, and this is the “correct” counterfactual utility for \(A()=a\).

I’m not very confident that this is true as stated, but I have some reasonable hope that either a counterexample will be enlightening about logical counterfactuals, or that an improved analogue of the valuation will in fact be quite useful.

Notes on this valuation:

  • This is a claim about “third person” counterfactuals, not those done by an agent itself. Obviously, an agent that makes decisions based on this valuation will not be definable in modal logic, and I don’t yet make any claims about what such an agent might do in the PA versions of such problems.

  • The approximate coherence is of the following form: if the shortest proof of contradiction from \(S\) has length \(N\), and one of the statements \(\varphi\), \(\neg\varphi\), \(\psi\), \(\neg\psi\) has a proof from \(S\) of length \(n\ll N\), then \[\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) = \nu_S(\varphi) + O(n/N).\] I don’t know a cleverer proof than going through the cases, but each case is straightforward. For instance, if \(d_S(\varphi)=n\), then we must have \(N-n\leq d_S(\neg\varphi) \leq N+1\), and so it follows quickly that \(\nu_S(\varphi) = 1 + O(n/N)\); then, considering the three cases \(d_S(\psi)\lesssim n\), \(d_S(\neg\psi)\lesssim n\), and neither, in each case it is clear that \(\nu_S(\varphi\wedge\psi) = \nu_S(\psi) + O(n/N)\).

  • It does not, however, seem to be the case that \(\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) \approx \nu_S(\varphi)\) whenever there is an easily provable relation between \(\varphi\) and \(\psi\). I am not entirely sure that I’m not missing a necessary logical relation, but it seems like we could consider two sentences \(\varphi\) and \(\theta\) whose proofs are “independent” of the contradiction in \(S\) and of each other, and then let \(\psi=\varphi\wedge\theta\) so that \(\psi\to\varphi\) is quickly provable from \(S\). Say that \(d_S(\bot)=N\), \(d_S(\varphi)=N/2\), \(d_S(\theta)=N/3\); if these are independent in that way, we should be able to get \(d_S(\neg\varphi)\approx N\), \(d_S(\neg\theta)\approx N\), \(d_S(\psi)\approx d_S(\varphi) + d_S(\theta)\approx 5N/6\), and \(d_S(\neg\psi)\approx N\). Then \(\nu_S(\varphi)\approx 2/3\), \(\nu_S(\varphi\wedge\psi)\approx \nu_S(\psi) \approx 6/11\), and \(\nu_S(\varphi\wedge\neg\psi)\) will be at least 1/4 since \(d_S(\varphi\wedge\neg\psi)\approx N\) and \(d_S(\neg\varphi\vee\psi)\geq N/3\), so \(\nu_S(\varphi\wedge\psi) + \nu_S(\varphi\wedge\neg\psi) - \nu_S(\varphi) > 6/11+1/4-2/3>0\). I don’t know whether this could be used to break the revived trolljecture.

  • If the shortest proof of contradiction from \(S\) has length \(N\), then for any \(\varphi\), \(\frac{1}{N+2}\leq \nu_S(\varphi) \leq \frac{N+1}{N+2}\); the valuations are bounded away from 0 and 1. This may be a feature and not a bug: a mathematician could reason decently confidently about what math would look like in the counterfactual where Fermat’s Last Theorem were false (presuming that its actual proof is quite long), but it is hard to say very much about what math would look like in the counterfactual where \(1+1=3\). However, there are some examples, like naive set theory before Russell’s Paradox, where human mathematicians reasoned mostly coherently despite using axioms that (they did not notice) led quickly to contradictions; the valuation does not capture that aspect of mathematical intuition.

  • I restricted the new trolljecture to modal universes because it is well-specified how the stronger axiom schemas \(PA+N\) eventually decide the agent and the universe in the standard model of the natural numbers. If something like it holds up here, then there may be a more general version.



by Patrick LaVictoire 648 days ago | link

I can prove the property that for each hypothesis \(A()=a\) there is at most one \(u\) such that \(U()=u\) has a high valuation (for sufficiently high PA+N), with the following caveat: it can sometimes take many steps to prove that \(u\neq u'\) in PA+N, so we’ll need to include the length of that proof in our bound.

In what follows, we will take all subscripts of \(d\) and \(\nu\) to be \(PA+N, A()=a\) for \(N\) large.

For any \(\phi\), \(d(\bot) - d(\neg\phi)\leq d(\phi)\leq d(\bot)\) and thus \[1 - \frac{d(\phi)}{d(\bot)} \leq \nu(\phi) \leq \frac{d(\bot)}{d(\phi)+d(\bot)}.\]

Also, \(d(U()=u)+d(U()=u')+d(u\neq u')\geq d(\bot)\). This implies \(\max\{d(U()=u),d(U()=u')\}\geq \frac12(d(\bot)-d(u\neq u))\), which implies \[\min\{\nu(U()=u),\nu(U()=u')\}\leq \min\{\frac{d(\bot)}{d(U()=u)+d(\bot)},\frac{d(\bot)}{d(U()=u')+d(\bot)}\} \leq \frac{2d(\bot)}{3d(\bot)-d(u\neq u')}.\]

So we see that \(\nu(U()=u)\) and \(\nu(U()=u')\) cannot both be significantly larger than 2/3 if there is a short proof that \(u\neq u'\).

reply

by Patrick LaVictoire 608 days ago | link

Sam Eisenstat produced the following counterexample to a stronger version of the revived trolljecture (one using soundness rather than consistency in the escalated theories):

Take a statement \(X\) in the language of PA such that neither \(X\) nor \(\neg X\) has a short proof (less than length \(L\)) in PA+Sound(PA); \(X\) can be provably true, provably false, or undecidable in that system.

Define \(U\) to equal 10 if \(X\wedge A()=a\), 0 if \(\neg X\wedge A()=a\), and 5 if \(A()\neq a\).

Define \(A\) to equal \(a\) if there is a short proof in PA (less than length \(L\)) that \(A()=a\to U()=10\), and \(b\) otherwise. Clearly \(A()=b\) and \(U()=5\).

Now I claim that in the formal system PA+Sound(PA)+\((A()=a)\), there is a very short proof of \(U()=10\). And this is the case whether \(X\) is true, false, or undecidable!

In order to prove \(U()=0\), take the axiom \(A()=a\). It follows from the definition of \(A\) that there is a short proof in PA (less than length \(L\)) that \(A()=a\to U()=10\). And it follows from the soundness of PA that in fact \(A()=a\to U()=10\), and thus \(U()=10\).

On the other hand, we cannot quickly prove \(U()=0\), for if we could, then we could prove \(\neg X\). Things get tricky because we’re using the extra (inconsistent) axiom \(A()=a\), but intuitively, if \(X\) is some arithmetical assertion that has nothing to do with the decision problem, then there shouldn’t be any short proof of \(\neg X\) in this system either, and so the shortest proof of \(U()=0\) should be at least comparable to length \(L\).

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

The AI defers to anything
by Paul Christiano on Corrigibility thoughts II: the robot operator | 0 likes

Thus anything that can
by Stuart Armstrong on Corrigibility thoughts II: the robot operator | 0 likes

Ah, thanks! That seems more
by Stuart Armstrong on Loebian cooperation in the tiling agents problem | 0 likes

It doesn't mean computation
by Vladimir Slepnev on Loebian cooperation in the tiling agents problem | 1 like

I'm not sure this would work,
by Stuart Armstrong on Loebian cooperation in the tiling agents problem | 0 likes

>How can the short term
by Stuart Armstrong on Humans are not agents: short vs long term | 0 likes

I expect a workable approach
by Paul Christiano on Corrigibility thoughts II: the robot operator | 0 likes

Not sure what your argument
by Stuart Armstrong on Corrigibility thoughts II: the robot operator | 0 likes

It is ‘a preference for
by Stuart Armstrong on Humans are not agents: short vs long term | 0 likes

Note that we don't need to
by Paul Christiano on ALBA requires incremental design of good long-term... | 0 likes

If I want my boat to travel
by Paul Christiano on Corrigibility thoughts II: the robot operator | 0 likes

I don't think it's much like
by Abram Demski on An Approach to Logically Updateless Decisions | 0 likes

Yeah, I like tail dependence.
by Sam Eisenstat on An Approach to Logically Updateless Decisions | 0 likes

This is basically the
by Paul Christiano on Cooperative Oracles: Stratified Pareto Optima and ... | 1 like

I think AsDT has a limited
by Abram Demski on An Approach to Logically Updateless Decisions | 2 likes

RSS

Privacy & Terms