post by Abram Demski 1201 days ago | Patrick LaVictoire and Scott Garrabrant like this | 3 comments

Vadim pointed out in a comment to my post on logical counterfactuals that a very similar idea had been explained in a LessWrong post summarizing work by Vladimir Slepnev and Paul Christiano at a MIRI workshop in 2013. The algorithm which they suggested was called UDT 1.5. In fact, the essential idea is already argued by Vladimir Slepnev (cousin_it) in a post from 2012: Should logical probabilities be updateless too?.

Here, I continue to develop these ideas and those in my logical dutch-book post. I present an alternate prior to the one which was used for UDT 1.5. I show that this new prior has naturalistic logical updates, a kind of improvement-endorsing property which seems likely to work well with UDT. This property also gets around Paul Christiano’s paradox of ignorance.

## Motivation

Slepnev motivates the problem with a variant of counterfactual mugging, inspired by a post by Wei Dai but simplified:

Suppose Omega (the near-omniscient near-omnipotent and almost perfectly trustworthy agent who prefers the pronouns ve/ver) appears before you. Omega tells you that ve has computed the millionth binary digit of $$\pi$$, and that before knowing what the digit was, ve had committed to the following procedure: if the millionth digit is zero, ask you for $100. If the millionth digit is one, give you$10,000 only if you would have given Omega the $100 had the digit been zero; otherwise, give you nothing. Omega then tells you that the millionth digit was in fact zero, and asks you for$100.

Suppose you already use a variant of UDT, so you take the actions you would have approved of ahead of time. If you are a “smart” agent who already knew the millionth digit of $$\pi$$ and only calculated expected utility of universes consistent with this knowledge, you’d turn Omega down. If you’re a “dumb” agent who doesn’t, you’d give Omega the $100. (This is true even though UDT uses a kind of logical counterfactual; a UDT agent takes a counterfactual on its possible actions, not anything else, so it still doesn’t consider the worlds where digits of $$\pi$$ are different.) This means that “dumb” agents don’t want to become “smart” agents. A UDT-like agent which improves its beliefs over time by proving more theorems will be temporally inconsistent. This calls into question approaches to logical uncertainty which rely on estimates which improve given more time, such as uniform coherence. My intuition is that we want to bite the bullet and be updateless about logical uncertainty as well as empirical uncertainty, because this maximizes expected value from a logically uncertain perspective. This makes it seem like the only reflectively consistent possibility. Scott Garrabrant gave me a different argument for the same conclusion. My recollection of the argument is as follows: Problems like this are a result of someone simulating you. You can’t be sure of whether you’re in the simulation, being fooled into thinking the millionth digit of $$\pi$$ is zero, or whether you’re in the real world. You can’t trust your own logic, since it could be spoofed. If you’re in the real world, you want to not give up the$100. If you’re in the simulation, then you’d like to give it up so that the real you gets \$10,000. However, as in the absentminded driver problem, the sleeping beauty problem, and other such problems, you shouldn’t solve the problem by assigning probabilities to the two possibilities; instead, you want to act according to a policy you’d endorse ahead of time. Hence, an updateless solution.

(An interesting question: to what extent do the details of Omega’s counterfactual computation determine whether we buy the whole argument?)

It seems we need a logically ignorant prior which is used to judge policies. This is a prior over impossible possible worlds, so it is bound to include some nonsense, but it can’t be as simple as assigning probability 1/2 to everything; there needs to be enough structure to get “reasonable” results. Slepnev and Christiano offered such a prior in the workshop results I mentioned, but I don’t see that it gets me the properties I want. I will construct one according to the properties I see as desirable, along the lines I started to develop in my logical dutch book post.

A logically updateless UDT should reduce to a logically updateful DT in circumstances which do not involve entanglements with impossible possible worlds. I won’t get that property here, but I will go for something suggestive: an approximation scheme for a logical prior which “endorses” improved approximations as being the appropriate update.

An AI which improves its logical consistency over time can be modeled by a sequence of approximate probability distributions $$P_1(\phi), P_2(\phi), P_3(\phi) ...$$ which are supposed to get better and better. For this post, the probability distribution will be over sentences of $$PA$$. I make the assumption that the change between any $$P_n$$ and $$P_{n+1}$$ in this sequence is due purely to information coming from a theorem-prover; at time $$n+1$$, we’ve proved more theorems, and taken them into account as logical constraints on beliefs. The type of constraints I will use will be partitions of truth: a set of sentences of which exactly one can be true. Coherence relative to a single constraint is defined by probabilities summing to one:

Relative Coherence: Where $$\sigma$$ is a finite set of sentences, $$P(\phi)$$ is coherent relative to $$\sigma$$ if and only if $$\Sigma_{\phi \in \sigma} P(\phi)=1$$.

If a function is coherent relative to all provable partitions of truth, it is coherent in the standard sense. So, we can suppose that this is the only type of information which the theorem-prover gives us. (This is a version of coherence which I got from Scott.)

As I found when working on the logical Dutch book post, the behavior of conditional probabilities gets tricky when we only have relative coherence; we don’t necessarily have basic properties of conjunction, so we can have things like $$P(\phi | \phi) \neq 1$$ and probabilities which depend on the order in which we condition on evidence. This is a problem for getting nice properties out of such probability functions. To solve this, I will require propositional coherence: the distribution will be coherent relative to every constraint which is provable by propositional logic alone. (Although this introduces an infinite number of constraints, $$P_n(\phi)$$ can still be computable, as I will show later.) This is probably overkill, but is convenient.

Let $$\ulcorner \sigma \urcorner$$ be the sentence asserting that $$\sigma$$ is a partition of truth. (For example, if $$\sigma = \{ \phi \}$$, then $$\ulcorner \sigma \urcorner = \phi$$; on the other hand, if $$\sigma = \{ \phi, \psi, \theta \}$$, then $$\ulcorner \sigma \urcorner = (\phi \vee \psi \vee \theta) \wedge \neg (\phi \wedge \psi) \wedge \neg (\phi \wedge \theta) \wedge \neg (\psi \wedge \theta)$$.) We define a naturalistic update as follows:

Naturalistic Logical Update: $$P(\phi)$$ has a naturalistic logical update if and only if:

• Where $$\sigma$$ is a partition of truth, $$P(\phi | \Box\ulcorner \sigma \urcorner)$$ is coherent relative to $$\sigma$$.

This is the successor to my notion of “information-adding coherence” in the logical dutch book post. Note that this definition only requires coherence relative to real logical constraints. If $$\sigma$$ is not a genuine partition of truth, $$P(\phi | \Box_n\ulcorner \sigma \urcorner)$$ may or may not be coherent relative to $$\sigma$$. This is because coherence conditioning on arbitrary constraints would be impossible: we could condition on an unsatisfiable set of constraints.

We could then define $$P_{n}(\phi)$$ by updating sequentially on theorems as we prove them. This won’t be quite what’s wanted, but it will give enough intuition for now. I’ll use $$C_n$$ to denote the set of constraints $$P_{n}(\phi)$$ has updated on.

For an updateless decision theory which uses $$P_0(\phi)$$ as a logically ignorant prior, this means the segment of expected utility of a policy coming from just those worlds in which constraints $$C_n$$ are provable should be exactly the expected utility under $$P_n(\phi)$$. If this possible world is sufficiently “isolated” from everything else (negligible probability of being simulated), then the preferred policy for this case should be very close to the policy for a logically updateful agent using $$P_n(\phi)$$. This suggests there’s some hope for a well-behaved UDT based on $$P_0(\phi)$$.

This is “naturalistic” in the sense that it doesn’t care where the proof comes from. The only difference between the internal theorem-prover and an external one is that the agent has perfect knowledge of it. If we instead update on observations which make it very probable that a external theorem-prover has proved something, the resulting distribution would be (almost) coherent with respect to those constraints. Note that this solves Paul Christiano’s paradox of ignorance: the counterintiutive effect was precisely that updating on information from an external theorem-prover was different from updating on an internal one. (Christiano 2014, section 6.1.5.)

This level of self-trust might appear concerning for Lobian reasons. We can apply the naturalistic update property to prove $$P(\phi | \Box \phi) = 1$$; since $$P(\phi | \Box \phi) = P(\phi \wedge \Box \phi) / P(\Box \phi)$$, we have $$P(\phi \wedge \Box \phi) = P(\Box \phi)$$; then $$P(\Box \phi \wedge \neg \phi)=0$$; so we have $$P(\Box \phi \rightarrow \phi)=1$$. However, recall that the naturalistic update property only applied to genuine partitions of truth; this line of argument only applies if $$\phi$$ is a theorem in the first place. Furthermore, even if we construct a prior for which $$P(\Box \phi \rightarrow \phi)=1$$ for a much wider class of $$\phi$$ than only theorems, this would still be a far cry from having $$\Box (\Box \phi \rightarrow \phi)$$ in the proof system itself.

It turns out such a distribution can be constructed by surprisingly simple means.

Theorem. There exists a computable probability distribution $$P^*(\phi)$$ which is coherent with respect to propositional constraints, and which has a naturalistic logical update.

Proof. Sample a random infinite theory $$T^*$$ as follows. Let $$\mu(\phi)$$ be a positive probability distribution over sentences. Sampling from $$\mu(\phi)$$ repeatedly, accept those samples into $$T^*$$ which are consistent with everything accepted so far under the provability logic $$GLS$$. (When I say a set of sentences is consistent under $$GLS$$, I mean the following: Take the conjunction of all the sentences. Translate this sentence to modal logic, hiding all details which aren’t either propositional logic or the provability predicate. If the negation of the result is a theorem of $$GLS$$, the set of sentences is said to be inconsistent; otherwise, it is consistent.)

Define $$P^*(\psi)$$ as the probability that $$\psi$$ will occur in such a $$T^*$$. This is computable, as follows. Generate the tree of possible random theories, ending in a leaf whenever $$\psi$$ or $$\neg \psi$$ is sampled. The probability that one of the two will eventually be added to $$T^*$$ is 1, because $$GLS$$ extends propositional logic and so has the property that if both $$\alpha \wedge \psi$$ and $$\alpha \wedge \neg \psi$$ are inconsistent, $$\alpha$$ is already inconsistent. Therefore, for any desired accuracy $$\varepsilon > 0$$, we can extend the tree far enough to compute $$P^*(\psi)$$ to within $$\varepsilon$$.

Suppose that $$\sigma$$ is a partition of truth which is provable by propositional means alone. Since $$GLS$$ extends propositional logic, any $$T*$$ sampled would contain at most one $$\phi \in \sigma$$ to avoid inconsistency. Furthermore, since any $$T^*$$ eventually contains either $$\phi$$ or $$\neg \phi$$ for each $$\phi$$, it will also contain at least one: otherwise, it would contain the negation of all sentences in $$\sigma$$, which would also be an inconsistency. Therefore, the probability of any propositional partition of truth sums to one, as required.

$$P^*(\phi)$$ also has a naturalistic update. $$GLS$$ has $$\Box \phi \rightarrow \phi$$ for all $$\phi$$. Therefore, $$P^*(\phi | \Box \ulcorner \sigma \urcorner)$$ is coherent relative to constraint $$\sigma$$. $$\Box$$

It might seem strange that I’ve used $$GLS$$ rather than $$GL$$. $$GLS$$ is the “truth provability logic”, which gives every true modal statement about provability in $$PA$$, rather those provable in $$PA$$. (Boolos, 1995.) $$GL$$ represents the knowledge about provability possessed by $$PA$$, whereas $$GLS$$ represents knowledge beyond $$PA$$. For example, $$GLS$$ proves $$\neg \Box \bot$$. It might seem odd to impose this level of knowledge about provability on the probability distribution. However, coherence relative to $$GL$$ would be odd in other ways. $$GL$$ obeys “necessitation”: $$\phi \rightarrow \Box \phi$$. This means that anything true must be provable. This seems wrong for our logical prior: we’re sampling random sentences to create a probability distribution in part because we know not everything is provable or disprovable, but want to have beliefs about those things anyway. Consistency with respect to $$GL$$ would make us wrongly believe that $$P(\Box \phi) + P(\Box \neg \phi) \geq 1$$. (Everything is provable or refutable or both.)

Of course, the main motivation here is that $$GL$$ wouldn’t get us a naturalistic logical update: we want to be consistent with constraint $$\sigma$$ by conditioning on the provability of $$\sigma$$, $$\Box \ulcorner \sigma \urcorner$$. In $$GL$$, $$\Box \psi$$ is consistent with $$\neg \psi$$, so conditioning on new provability facts coming from a theorem prover wouldn’t get us much at the object level. We would have to condition on $$\ulcorner \sigma \urcorner$$ directly, making the role of the theorem prover non-naturalistic. This would (I suspect) be bad news if we then try to apply it to an updateless decision theory.

## Updateless Decision Theory

Given that the motivation for naturalistic updates was an updateless one, it wouldn’t do to finish the post without at least sketching a UDT based on $$P^*$$.

As in UDT 1.5, we have an agent $$A()$$ search for a program with high expected value according to $$P^*(\phi)$$ and according to a utility function $$U()$$. ($$U()$$ must be encoded in $$PA$$, so I’ll make the assumption that it has natural-numbered values.) Define this as:

$argmax_{p \in programs} \Sigma_x x \cdot P^*(U()=x | A()=p)$

The set of programs may be finite, for example when operating on a finite machine. In this case, the operation is computable (if infeasible).

## Limitations

The above construction is not uniformly coherent. I haven’t even shown that it converges to something coherent as we update!

It’s not totally clear what happens as we update on constraints provable in $$PA$$. One concern is that it sounds dangerous to mix $$GLS$$ with $$PA$$ – one might get a system which obeys Lob’s theorem but also contains $$\Box \phi \rightarrow \phi$$. Things should be fine here, because we’re effectively just adding arithmetically true statements to $$PA$$. Although this creates a stronger logic which endorses things like $$\neg \Box \bot$$, the $$\Box$$ still refers to provability in $$PA$$ rather than provability in the new system.

One thing which I considered in the logical dutch book post which I didn’t manage to achieve here is updating on failed proof attempts. Trying to prove something for a long time without success should make us find it less likely. For this purpose, I introduce the notation $$\Box_n \phi$$, which says that there is a proof of $$\phi$$ of length at most $$n$$. We want updating on this to obey conservation of expected evidence:

Conservation of Expected Proof:

• $$P(\phi) = P(\phi | \Box_n \psi) P(\Box_n \psi) + P(\phi | \neg \Box_n \psi) P(\neg \Box_n \psi)$$

Note, however, that this already follows from propositional coherence alone. To make this meaningful, we also want:

Bounded Naturalistic Update

• Where $$\sigma$$ is a partition of truth, $$P(\phi | \Box_n\ulcorner \sigma \urcorner)$$ is coherent relative to $$\sigma$$.

This could likely be achieved by enforcing $$\Box_n \phi \rightarrow \Box \phi$$ for all $$\phi$$, in addition to everything already being enforced in $$P^*(\phi)$$. Showing that the result is still well-defined and computable requires more than just the decidability of $$GLS$$, which is why I didn’t do this here.

We could then define $$P_n(\phi)$$ by updating sequentially on all our proof attempts, successful or unsuccessful. (Differentiating proof attempts by a length-bound only is very course-grained, but we could make this finer-grained as desired.)

Works Cited:

Paul Christiano. 2014. Non-Omniscience, Probabilistic Inference, and Metamathematics. Technical report 2014–3. MIRI.

Boolos, George. The logic of provability. Cambridge university press, 1995.

 by Vladimir Slepnev 1179 days ago | Abram Demski likes this | link Nice work! I’m not too attached to the particular prior described in the “UDT 1.5” post, and I’m very happy that you’re trying to design something better to solve the same problem. That said, it’s a bit suspicious that you’re conditioning on provability rather than truth, which leads you to use GLS etc. Maybe I just don’t understand the “non-naturalistic” objection? Also I don’t understand why you have to use both propositional coherence and partitions of truth. Can’t you replace partitions of truth with arbitrary propositional combinations of sentences? reply
 by Abram Demski 1141 days ago | link The main reason to think it’s important to be “naturalistic” about the updates is reflective consistency. When there is negligible interaction between possible worlds, an updateless agent should behave like an updateful one. The updateless version of the agent would not endorse updating on $$\phi$$ rather than $$\Box \phi$$! Since it does not trust what its future self proves, it would self-modify to remove such an update if it were hard-wired. So I think updating on provability rather than truth is really the right thing. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes