Intelligent Agent Foundations Forumsign up / log in
Using modal fixed points to formalize logical causality
post by Vladimir Slepnev 1136 days ago | Abram Demski, Benja Fallenstein and Patrick LaVictoire like this | 3 comments

This post is a simplified introduction to existing ideas by Eliezer Yudkowsky, Wei Dai, Vladimir Nesov and myself. For those who already understand them, this post probably won’t contain anything new. As always, I take no personal credit for the ideas, only for the specific mathematical model.

People usually think about decision-making in terms of causality, where an agent’s action causes an outcome to happen. In this post I will outline a different idea of “causality”, which can work even if regular causality isn’t available. For example, in the world of natural numbers it doesn’t make sense to say that the sentence \(1+1=2\) somehow “causes” the sentence \(2+2=4\) to be true. Yet we can devise a notion of “logical causality” that will work in such worlds, and allow us to make decisions which maximize utility in some sense. The rest of this post is devoted to making these claims precise.

We start off with a simple formal model. Recall that provability logic is a modal logic where the operator \(\Box\) is interpreted as provability in Peano arithmetic. Let’s define a propositional formula with two variables \(F(a,o)=\Box(a→o)\). Since all occurrences of propositional variables are inside \(\Box\), we can build fixed points \(A\) such that \(A↔F(A,o)\) for different values of \(o\), and see what happens.

  1. If \(o=\top\) (the true sentence), then \(A=\top\), which is easy to verify directly.

  2. If \(o=\bot\) (the false sentence), then \(A=\Box\bot\), which can be verified by Löb’s theorem.

Now let’s go a little deeper, and let \(o\) itself be defined in terms of \(a\), using the fixed point to tie everything together:

  1. In the simplest case where \(o=a\), we see that \(A=\top\).

  2. If \(o=\neg a\) (the negation of \(a\)), we see that \(A=\Box\bot\). You can verify that by hand, using the methods from the previous points.

By now an interesting pattern is starting to emerge. Let’s say \(O\) is the formula that results from the variable \(o\) after applying the fixed point. In all cases except (2) where \(O\) is false by fiat, we see that \(A\) tries and succeeds to make \(O\) true! Metaphorically, here’s what \(A\) is doing: “If I can prove that my truth implies the truth of \(O\), then I choose to be true, otherwise I choose to be false”. That’s the basic idea of “logical causality”, in the simplest possible setting where it works.

Note that all sentences above can be interpreted as sentences about natural numbers within Peano arithmetic, by using Gödel numbering to spell out the provability operator. For example, in point (4) \(A\) would be a long sentence about the natural numbers, and \(O\) would be a slightly longer sentence that has \(A\) embedded inside it. In decision-making terms, \(A\) is the “action” and \(O\) is the “outcome” that logically depends on the action, but \(O\)’s dependence on \(A\) is not explicit, because both are closed formulas without free variables. Instead, the dependence is due to \(A\) knowing the Gödel number of \(O\).

To skip ahead a bit, it’s easy to go from a formalism about natural numbers to a formalism about computer programs, which know their own source code by quining. The examples above can be directly translated to programs that have access to a provability oracle for Peano arithmetic, or (with some caveats) to programs that successively search for proofs and check them manually. In fact, that was the original motivation for this line of research, because “programs trying to influence a larger program that they are embedded in” might be a good description of our own world, at least from the perspective of a program :-)

Going back to the math, perhaps the approach will break down if we have more possible choices than just true or false? Let’s assume a more general setting, where we have formulas \(F_1(a_1,\dotsc,a_m,o_1,\dotsc,o_n),\dotsc,F_m(a_1,\dotsc,a_m,o_1,\dotsc,o_n)\). We will denote truth assignments to \(a_1,\dotsc,a_m\) as \(\vec{a}\), and truth assignments to \(o_1,\dotsc,o_n\) as \(\vec{o}\). We will have a preference ordering on all possible \(\vec{o}\), and will be interested in fixed points \(A=\{A_1,\dotsc,A_m\}\) such that \(A_i↔F_i(A_1,\dotsc,A_m,o_1,\dotsc,o_n)\) for all \(i\). The formulas \(F_1,\dotsc,F_m\) will encode the execution of this algorithm:

  1. There are finitely many sentences of the form “if such-and such \(\vec{a}\) holds, then such-and-such \(\vec{o}\) holds”. Find all such sentences that are provable. If no such sentences were found, choose any \(\vec{a}\), e.g. all false.

  2. From all pairs \((\vec{a},\vec{o})\) found in the previous step, choose the \(\vec{a}\) whose \(\vec{o}\) is highest in our preference ordering. If there are multiple such \(\vec{a}\), choose any one, e.g. the lexicographically smallest.

  3. For each \(i\), define \(F_i\) to be true iff the chosen \(\vec{a}\) assigns true to \(a_i\).

To illustrate the definition of \(F_i\) in more detail, let’s work through the case where \(m=n=1\) and our preference ordering wants \(o_1\) to be true. On step 1 we have four possible sentences in lexicographic order: \(a_1→o_1\), \(a_1→\neg o_1\), \(\neg a_1→o_1\), and \(\neg a_1→\neg o_1\). Then \(F_1\) is true iff the chosen \(\vec{a}\) assigns true to \(a_1\), which can only happen on step 2. The corresponding \(\vec{o}\) can either assign true to \(o_1\), which happens iff the first sentence is provable, or assign false to \(o_1\), which happens iff the second sentence is provable but the first and third aren’t. Simplifying, we obtain the definition for \(F_1(a_1,o_1)=\Box(a_1→o_1)\vee(\Box(a_1→\neg o_1)\wedge\neg\Box(\neg a_1→o_1))\). By now it should be clear how to use the same algorithm for \(m,n>1\).

Which choices of \(o_1,\dotsc,o_n\) are amenable to this approach? Intuitively, it seems that “fair” deterministic problems are those where every choice of “actions” \(\vec{a}\) logically implies at least one “outcome” \(\vec{o}\), and these implications are apparent to the agent (i.e. provable). But that’s exactly the class of problems where our approach obviously gives the right answer! So it seems that having multiple possible choices doesn’t cause problems.

For example, let’s take \(m=2,n=1,o_1=a_1\wedge\neg a_2\), and assume that the preference ordering wants \(o_1\) to be true. Then it’s easy to see that the chosen \(\vec{a}\) is either (true, false), which provably implies \(o_1\), or something else that also provably implies \(o_1\). But the latter is impossible, because choosing any other \(\vec{a}\) makes \(o_1\) false, so it can’t be provable as long as the logic is sound. (Of course the logic doesn’t prove its own soundness, but we’re reasoning from the outside now.) Therefore the chosen \(\vec{a}\) is (true, false), and \(o_1\) is true.

One counterintuitive feature of our approach is that some “actions” \(\vec{a}\) might logically imply multiple different “outcomes” \(\vec{o}\) after taking the fixed point, because if an action is in fact not taken, it logically implies anything at all. However, the approach is designed so that the existence of such “spurious” logical implications can never lead to a suboptimal outcome. The proof of that is left as an easy exercise.

by Benja Fallenstein 1136 days ago | Patrick LaVictoire likes this | link

However, the approach is designed so that these “spurious” logical implications are unprovable, so they don’t interfere with decision-making. The proof of that is left as an easy exercise.

I don’t think this is technically true as stated; it seems to be possible that the agent proves some spurious counterfactuals as long as the outcome it does in fact obtain is the best possible one. (This is of course harmless!) Say the agent has two possible actions, \(\bar a^{(5)}\) and \(\bar a^{(10)}\), leading to outcomes \(\bar o^{(5)}\) and \(\bar o^{(10)}\), respectively. The latter is preferred, and these are the only two outcomes. Suppose that \(\bar a^{(10)}\) happens to be lexicographically lower than \(\bar a^{(5)}\) in the agent’s reasoning. Then it seems to be provable that the agent will in fact choose \(\bar a^{(10)}\), meaning that it’s provable that it won’t choose \(\bar a^{(5)}\), meaning that it finds both \((\bar a^{(5)},\bar o^{(5)})\) and the spurious \((\bar a^{(5)},\bar o^{(10)})\) in the first step.

So I think the correct statement is a disjunction: The agent obtains the highest possible outcome or it finds no spurious counterfactuals.


by Patrick LaVictoire 1131 days ago | link

Ooh, nice: we don’t need to eliminate all spurious counterfactuals, only the malignant ones!


by Benja Fallenstein 1136 days ago | link

I know this is supposed to be just introductory, but I actually think that the complete reformulation of UDT-with-a-halting-oracle in terms of modal logic is really interesting! For starters, it allows us to compare UDT and modal agents in the same framework (with the right \(o\)’s, we can see this version of UDT as a modal agent). It would also be really neat if we could write an “interpreter” that allows us to write UDT as a program calling a halting oracle, and then evaluate what it does by way of modal logic.

But also, it allows us to give a nice definition of “decision theory” and “decision problem” in the context with halting oracles. I was planning on posting soon about the definition that I showed you when you were visiting, which is designed for actually computable agents with bounded proof lengths, and is more complicated because of that. As a stepping stone to that, using the provability logic framework, I think we can define:

  • a decision theory to be a set of fully modalized formulas \(F_i(a_1,\dotsc,a_m,o_1,\dotsc,o_n)\), for \(i=1,\dotsc,m\) (fully modalized meaning that the propositional arguments only appear inside boxes);
  • and a decision problem to be a set of formulas \(G_j(a_1,\dotsc,a_m)\), for \(j=1,\dotsc,n\), which do not need to be modalized (which isn’t a problem because they’re not self-referential).

The condition that \(F_i\) must be fully modalized, but \(G_j\) doesn’t need to be, seems to be the natural thing corresponding to how in the bounded case, we allow the universe to run the agent, but the agent must use abstract reasoning about the universe, it can’t just run it.

Given such a set, we can define \(\tilde F_i(a_1,\dotsc,a_m)\), for \(i=1,\dotsc,m\), as follows: \[\tilde F_i(a_1,\dotsc,a_m) \,:=\, F_i(a_1,\dotsc,a_m,G_1(a_1,\dotsc,a_m),\dotsc,G_n(a_1,\dotsc,a_m)).\] Then the actual action taken by decision theory \(F\) when run on the decision problem \(G\) is the modal fixed point of the equations \(A_i\leftrightarrow\tilde F_i(A_1,\dotsc,A_m)\), for \(i=1,\dotsc,m\).






This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes


Privacy & Terms