Intelligent Agent Foundations Forumsign up / log in
"Evil" decision problems in provability logic
discussion post by Benja Fallenstein 1043 days ago | Kaya Stechly, Jessica Taylor, Nate Soares and Patrick LaVictoire like this | 4 comments

A while ago, drnickbone showed in a LessWrong post that for any decision theory, there is a “fair” decision problem on which that decision theory fails—that is, gets a low reward, even though a simple alternative decision theory gets a high reward. Although drnickbone’s argument was given in words, it’s intuitively clear that it could be formalized as math—except, what exactly do “decision theory” and “decision problem” mean in this context?

In this post, I’ll propose definitions of these notions in the context of provability logic, as used in Vladimir Slepnev’s modal version of UDT, and give a formalization of drnickbone’s result. This implies that no single modal decision theory, including UDT, can behave optimally on all “fair” decision problems. But in my next post, I’ll show that modal UDT is nevertheless optimal in a weaker sense: For every provably “fair” modal decision problem, there is a finite set of true axioms that can be added to PA, such that modal UDT using this extension of PA performs optimally on that decision problem. (This is a modal logic version of a result which Kenny Easwaran, Nate Soares and I recently showed in the context of bounded proof length, and which hasn’t been published yet.)

Prerequisite: A primer on provability logic.


Here’s an example of how we usually talk about decision problems (Vladimir Slepnev’s formalization of Newcomb’s problem from “A model of UDT with a halting oracle”):

  def U():
    # Fill boxes, according to predicted action.
    box1 = 1000
    box2 = 1000000 if (A() == 1) else 0
    # Compute reward, based on actual action.
    return box2 if (A() == 1) else (box1 + box2)

In other words, we have a universe program, \(U()\), which makes calls to an agent program, \(A()\).

The agent program, in turn, uses quining to refer to the source code of the universe program. But while the universe is allowed to call the agent to figure out how it will behave, the agent can’t just call the universe—or we’ll have an infinite regress. Instead, we generally consider agents that use abstract reasoning in the form of proofs to reason about the universe they’re in.

So a universe is a parameterless program, and a decision problem is a universe with a slot in it for an agent program. Similarly, an agent is a parameterless program, and a decision theory is an agent with a slot in it for the universe program. The universe can refer to the agent by calling it, but the agent can refer to the universe’s source (and to its own source) only inside quoted formulas.


Let’s say that a sequence of modal formulas \(\varphi_1,\dotsc,\varphi_n\) are provably mutually exclusive and exhaustive, or p.m.e.e. for short, if \(\mathrm{GL}\) proves that exactly one of them is true: For example, \(\varphi_1\), \(\varphi_2\) and \(\varphi_3\) are p.m.e.e. if \[\mathrm{GL}\,\vdash\, (\varphi_1\wedge\neg\varphi_2\wedge\neg\varphi_3) \,\vee\, (\neg\varphi_1\wedge\varphi_2\wedge\neg\varphi_3) \,\vee\, (\neg\varphi_1\wedge\neg\varphi_2\wedge\varphi_3).\]

We’ll say that a (\(k\)-outcome) modal universe is a sequence \(O_1,\dotsc,O_k\), or \(\vec O\) for short, of closed p.m.e.e. formulas, where \(O_i\) is interpreted as the proposition that the \(i\)’th of the \(k\) possible outcomes occurs.

Given a preference ranking on these \(k\) possible outcomes, which has \(n\le k\) equivalence classes, we’ll write \(U_i\) for the modal formula asserting that one of the outcomes in the \(i\)’th ranked equivalence class has occurred; for example, if the preference ordering is such that \(O_2\), \(O_5\) and \(O_7\) are the most preferred outcomes, then \(U_1\equiv O_2\vee O_5\vee O_7\). Clearly, \(U_1,\dotsc,U_n\) (or \(\vec U\) for short) is a sequence of p.m.e.e. formulas; for brevity, we can avoid talking about \(\vec O\), and call the sequence \(\vec U\) an \(n\)-level universe.

(I’m using \(\equiv\) to mean “same formula”.)

Similarly, an (\(m\)-action) modal agent is a sequence \(A_1,\dotsc,A_m\), or \(\vec A\) for short, of closed p.m.e.e. formulas, where \(A_i\) is interpreted as saying that the agent takes its \(i\)’th available action.

A modal (\(m\)-action, \(n\)-level) decision problem is an \(n\)-level universe with a slot for an \(m\)-action agent \(\vec A\): that is, a p.m.e.e. sequence \(P_1(\vec a),\dotsc,P_n(\vec a)\), or \(\vec P(\vec a)\) for short, where \(\vec a\equiv(a_1,\dotsc,a_m)\).

Similarly, a decision theory is an agent with a slot for a universe. However, the definition of decision theories isn’t exactly analogous to that of decision problems, because we want to reflect the idea that the agent can’t directly simulate the universe it’s living in, but can only make reference to it inside quoted formulas; in the modal logic world, this translates to: “inside boxes”. Thus, we define a modal (\(m\)-action, \(n\)-level) decision theory to be a p.m.e.e. sequence of fully modalized formulas \(T_1(\vec u),\dotsc,T_m(\vec u)\), or \(\vec T(\vec u)\) for short, where \(\vec u = (u_1,\dotsc,u_n)\).

(Aside: We could require one of these to be provably mutually exclusive and exhaustive only under the assumption that its argument is p.m.e.e., but my current guess is that it’s not worth it.)

Fixed points

Now we can use the fixed point theorem to show that any modal decision theory has a well-defined output on any modal decision problem, and vice versa: Since \(\vec T(\vec u)\) is fully modalized, so is \(\vec T(\vec P(\vec a))\)—if we substitute the \(P_i(\vec a)\) for \(u_i\), and all instances of \(u_i\) were inside boxes, then all instances of \(\vec a\) in the resulting formula will be inside boxes as well. Thus, we can find a fixed point \(\vec A\) satisfying \(\mathrm{GL}\vdash\vec A\leftrightarrow\vec T(\vec P(\vec A))\) (which I’m using as shorthand for stating that \(\mathrm{GL}\vdash A_i\leftrightarrow T_i(\vec P(\vec A))\) holds for \(i=1,\dotsc,m\)).

By the uniqueness theorem, for any \(\vec B\) satisfying \(\mathrm{GL}\vdash\vec B\leftrightarrow\vec T(\vec P(\vec B))\), we have \(\mathrm{GL}\vdash\vec A\leftrightarrow\vec B\), so our decision theory’s action is well-defined. Moreover, by the rule about substitution of equivalent formulas, it follows that \(\mathrm{GL}\vdash \vec P(\vec A)\leftrightarrow\vec P(\vec B)\), meaning that the preference level \(\vec U :\equiv \vec P(\vec A)\) that our decision theory achieves on this decision problem is well-defined as well.

Thus, for every pair of sequences \((\vec T(\vec u),\vec P(\vec a))\), there’s a pair \((\vec A,\vec U)\) such that \(\mathrm{GL}\vdash\vec A\leftrightarrow T(\vec U)\) and \(\mathrm{GL}\vdash\vec U\leftrightarrow P(\vec A)\); and this pair is unique up to provable equivalence.

Extensional (= “fair”) decision problems

drnickbone’s result states that for every decision theory, there is a “fair” decision problem which on which this decision theory fails, where “fair” means that an agent’s reward depends only on its actions, not on its source code: Any two agents which output the same actions will achieve the same outcome. We can formalize this in modal logic as \[\big(\vec a\leftrightarrow\vec b\big)\,\to\,\big(\vec P(\vec a)\leftrightarrow\vec P(\vec b)\big),\] where I’m writing \(\vec\varphi\leftrightarrow\vec\psi\) for the conjunction of the \(\varphi_i\leftrightarrow\psi_i\). We’ll say that a modal decision problem is provably extensional if GL proves the above fairness condition.

There is one case in which provable extensionality is easy to show: If all occurrences of propositional variables in \(\vec P(\vec a)\) are outside boxes, then the fairness condition is a simple propositional tautology, and all such tautologies are provable in GL.

(Aside: One might wonder whether provable extensionality doesn’t follow for all decision problems from the rule about substitution of equivalent formulas, but that’s only an inference rule: if you can prove in GL that \(\varphi\leftrightarrow\varphi'\), and you can also prove \(\psi(\varphi)\), then it follows that you can prove \(\psi(\varphi')\), but GL does not in general prove \((\varphi\leftrightarrow\varphi')\to(\psi(\varphi)\leftrightarrow\psi(\varphi'))\). Here’s a counterexample: Note that \({\square}\bot\) says “PA proves a contradiction”, so \(\neg{\square}\bot\) says “PA is consistent”. Hence, GL does not prove that \((\top\leftrightarrow\neg{\square}\bot)\to({\square}\top\leftrightarrow{\square}(\neg{\square}\bot))\), because that reduces to \(\neg{\square}\bot\to{\square}\neg{\square}\bot\), which says “if PA is consistent, then PA proves its own consistency”. That statement is false, and by soundness, GL only proves true things.)

Evil problems

Intuitively, given a \(m\)-action, \(n\)-level decision theory \(\vec T(\vec u)\) (where \(m,n\ge 2\)), drnickbone’s “evil” extensional decision problem \(\vec P(\vec a)\) works as follows: First, \(\vec P(\vec a)\) checks what \(\vec T(\vec u)\) would do on this same decision problem. Then it checks whether \(\vec a\) does the same thing. If so, the agent gets the second-best outcome; but if \(\vec a\) differs from \(\vec T(\vec u)\)’s action, the agent gets the best outcome.

To implement this idea, we first define a sequence \(F_1(\vec a,\vec b),\dotsc,F_m(\vec a,\vec b)\), where \(\vec a\) stands for the action of the actual agent and \(\vec b\equiv(b_1,\dotsc,b_m)\) stands for the action that \(\vec T(\vec u)\) would have taken, as follows:

  • \(F_1(\vec a,\vec b) :\equiv \neg\big(\vec a \leftrightarrow \vec b\big)\);
  • \(F_2(\vec a,\vec b) :\equiv \phantom{\neg}\big(\vec a \leftrightarrow \vec b\big)\); and
  • \(F_{j+2}(\vec a,\vec b) :\equiv \bot\).

Now, we use the fixed point theorem to find a sequence \(\vec B\) such that \(\mathrm{GL}\vdash B_i\leftrightarrow T_i(\vec F(\vec B,\vec B))\) for \(i=1,\dotsc,m\). Finally, we let \(\vec P(\vec a) :\equiv \vec F(\vec a,\vec B)\). It’s clear from the definition of \(\vec F(\vec a,\vec b)\) that the formulas in \(\vec P(\vec a)\) are p.m.e.e., as required for a decision problem; and \(\vec a\) only occurs outside boxes in \(\vec P(\vec a)\), so \(\vec P(\vec a)\) is provably extensional.

The point of this is that \(\vec B\) is now clearly the action of \(\vec T(\vec u)\) on the decision problem \(\vec P(\vec a)\): \(B_i\) is equivalent to \(T_i(\vec F(\vec B,\vec B))\), which is equivalent to \(T_i(\vec P(\vec B))\), and \(\mathrm{GL}\vdash\vec B\leftrightarrow \vec T(\vec P(\vec B))\) is exactly the defining condition of “\(\vec B\) says what \(\vec T(\vec u)\) does on \(\vec P(\vec a)\)”. Therefore, \(\vec P(\vec a)\) now implements the intuitive definition of the evil problem: it checks whether \(\vec a\) takes the same action as \(\vec T(\vec u)\); if so, it returns the second-best outcome; otherwise, it returns the best.

So \(\vec T(\vec u)\) gets only the second best outcome, but the following decision theory \(\vec T'(\vec u)\) gets the best one:

  • \(T'_1(\vec u) :\equiv \neg T_1(\vec u)\);
  • \(T'_2(\vec u) :\equiv \phantom{\neg}T_1(\vec u)\);
  • \(T'_{j+2}(\vec u) :\equiv \bot\).

by Paul Christiano 1038 days ago | link

If the agent is randomized (and the universe can’t observe this randomness) then this problem goes away. So it still seems (a bit) plausible that one can have an “optimal” decision theory. The formalization of this would be non-trivial. Intuitively, it seems like if UDT gets to condition on the output of a competitor X, then UDT ought to be able to produce an output at least as good as X’s (and it is essentially the unique decision theory with this property), for any extensional problem.

The modal logic setting might be OK for exploring these questions, but I suspect the results would have to be “pulled back” to a more conventional environment in order to be widely appreciated.

(Sorry if this would be more appropriate as a comment on your follow-up post.)


by Benja Fallenstein 1038 days ago | link

Actually, drnickbone’s original LessWrong post introducing evil problems also gives an extension to the case you are considering: The evil decision problem gives the agent three or more options, and rewards the one that the “victim” decision theory assigns the least probability to (breaking ties lexicographically). Then, no decision theory can put probability \(>1/N\) on the action that is rewarded in its probabilistic evil problem.


by Paul Christiano 1038 days ago | link

This setup plays a computational trick, and as a result I don’t think it violates the optimality standard I proposed. In order to decide what it should do, the CDT agent needs to think strictly longer than the UDT agent. But if the CDT agent thinks longer than the UDT agent, it’s totally unsurprising that it does better! (Basically, the problem just consists of a computational question which is chosen to be slightly too complex for the UDT agent. But the CDT agent is allowed to think as long as it likes. This entire family of problems appears to be predicated on the lack of computational limits for our agents.)

As a result, if the UDT agent is told what the CDT agent decides, then it can get the same performance as the CDT agent. This seems to illustrate that the CDT agent isn’t doing better by being wiser, just by knowing something the UDT agent doesn’t. (I wasn’t actually thinking about this case when I introduced the weakened criterion; the weakening is obviously necessary for UDT with 10 years of time to compete with CDT with 11 years of time, and I included it for that reason.)

Does this seem right? If so, is there a way to set up the problem that violates my weakened standard?

Incidentally, this problem involves a discontinuous dependence on UDT’s decision(both by the competitor and by the environment). I wonder if this discontinuous dependence is necessary?


by Paul Christiano 1038 days ago | link

Nevermind, we get the same problem in the Newcomb’s problem case when the simulated agent is running UDT, i.e. where P(big box contains $1M) = P(UDT agent 1-boxes after being told that the CDT agent 2-boxes).






Note that the problem with
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Typos on page 5: *
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Ah, you're right. So gain
by Abram Demski on Smoking Lesion Steelman | 0 likes

> Do you have ideas for how
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I think I understand what
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>You don’t have to solve
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

Your confusion is because you
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

My confusion is the
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

> First of all, it seems to
by Abram Demski on Smoking Lesion Steelman | 0 likes

> figure out what my values
by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I agree that selection bias
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>It seems quite plausible
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

> defending against this type
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

2. I think that we can avoid
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I hope you stay engaged with
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes


Privacy & Terms