Summary: I define a memoryless Cartesian environments (which can model many familiar decision problems), note the similarity to memoryless POMDPs, and define a local optimality condition for policies, which can be roughly stated as “the policy is consistent with maximizing expected utility using CDT and subjective probabilities derived from SIA”. I show that this local optimality condition is necesssary but not sufficient for global optimality (UDT).
Memoryless Cartesian environments
I’ll define a memoryless Cartesian environment to consist of:
 a set of states \(\mathcal{S}\)
 a set of actions \(\mathcal{A}\)
 a set of observations \(\mathcal{O}\)
 an initial state \(s_1 \in \mathcal{S}\)
 a transition function \(t : \mathcal{S} \times \mathcal{A} \rightarrow \Delta \mathcal{S}\), determining the distribution of states resulting from starting in a state and taking a certain action
 an observation function \(m : \mathcal{S} \rightarrow \mathcal{O}\), determining what the agent sees in a given state
 a set \(\mathcal{S}_T \subset \mathcal{S}\) of terminal states. If the environment reaches a terminal state, the game ends.
 a utility function \(U : \mathcal{S}_T \rightarrow [0, 1]\), measuring the value of each terminal state.
On each iteration, the agent observes some observation, and takes some action. Unlike in a POMDP, the agent has no memory of previous observations: the agent’s policy must take into account only the current observation. That is, the policy is of type \(\mathcal{O} \rightarrow \Delta \mathcal{A}\). In this analysis I’ll assume that, for any state and policy, the expected number of iterations in the Cartesian environment starting from that state and using that policy is finite.
Memoryless Cartesian environments can be used to define many familier decision problems (for example, the absentminded driver problem, Newcomb’s problem with opaque or transparent boxes (assuming Omega runs a copy of the agent to make its prediction), counterfactual mugging (also assuming Omega simulates the agent)). Translating a decision problem to a memoryless Cartesian environment obviously requires making some Cartesian assumptions/decisions, though; in the case of Newcomb’s problem, we have to isolate Omega’s simulation of the agent as a copy of the agent.
Globally and locally optimal policies
Memoryless Cartesian environments are much like memoryless POMDPs, and the following analysis is quite similar to that given in some previous work on memoryless POMDPs: the main difference is that I am targeting (local) optimality given a known world model, while previous work usually targets asymptotic (local) optimality given an unknown world model.
Let us define the expected utility of a particular state, given a policy:
\[V_\pi(s) := U(s) \text{ if } s \in \mathcal{S}_T\] \[V_\pi(s) := \sum_a \pi(a  m(s)) \sum_{s'} t(s'  s, a) V_{\pi}(s') \text{ otherwise}\]
Although this definition is recursive, the recursion is wellfounded (since the expected number of iterations starting from any particular state is finite). Note that the agent’s initial expected utility is just \(V_{\pi}(s_1)\). Now we can also define a Q function, determining the expected utility of being in a certain state and taking a certain action:
\[Q_{\pi}(s, a) := \sum_{s'} t(s'  s, a) V_{\pi}(s')\]
Let \(N\) be a random variable indicating the total number of iterations, and \(S_1, ..., S_N\) be random variables indicating the state on each iteration. It is now possible to define the frequency of a given state (i.e. the expected number of times the agent will encounter this state):
\[F_{\pi}(s) := \mathbb{E}\left[\sum_{i=1}^N [S_i = s] \middle \pi\right]\]
These frequencies are bounded since the expectation of \(N\) is bounded. Given an observation, the agent may be uncertain which state it is in (since multiple states might result in the same observation). It is possible to use SIA to define subjective state probabilities using these frequencies:
\[SIA_{\pi}(s  o) := [m(s) = o] F_{\pi}(s)\]
Note that I’ve defined SIA to return an unnormalized probability distribution; this turns out to be useful later, since it naturally handles the case when the observation \(o\) occurs with probability 0.
How might an agent decide which action to take? Under one approach (UDT), the agent simply computes the globally optimal policy \(\pi\) that results in maximum expected utility (that is, a policy \(\pi\) maximizing \(V_{\pi}(s_1)\)) and takes the action recommended by this policy (perhaps stochastically). While UDT is philosophically satisfying, it is not a very direct algorithm. It would be nice to have a better intuition for how an agent using UDT acts, such that we could (in some cases) derive a polynomialtime algorithm.
So let’s consider a local optimality condition. Intuitively, the condition states that if the agent has a nonzero probability of taking an action \(a\) given observation \(o\), then that action should maximize expected utility (given the agent’s uncertainty about which state it is in). More formally, the local optimality condition states:
\[\forall o \in \mathcal{O}, a \in \mathcal{A}: \pi(a  o) > 0 \rightarrow a \in \arg\max_{a'} \sum_s SIA_{\pi}(s  o) Q_{\pi}(s, a')\]
Philosophically, a policy is locally optimal iff it is consistent with CDT (using SIA probabilities). This local optimality condition is not sufficient for global optimality (for the same reason that not all Nash equilibria in cooperative games are optimal), but it is necessary. The proof follows.
Global optimality implies local optimality
Let \(s\) be a state and \(\pi\) be a policy. Consider a perturbation of the policy \(\pi\): given observation \(o\), the agent will take action \(a_+\) more often, and action \(a_\) less often. How will this slight perturbation affect expected utility \(V_{\pi}(s)\)?
\[d_{\pi}(o, a_+, a_, s) := \frac{\partial}{\partial (\pi(a_+  o)  \pi(a_  o))} V_{\pi}(s)\]
\[d_{\pi}(o, a_+, a_, s) = 0 \text { if } s \in \mathcal{S}_T\] \[d_{\pi}(o, a_+, a_, s) = \sum_{a} \pi(a  m(s)) \sum_{s'} t(s'  s, a) d_{\pi}(o, a_+, a_, s') + [m(s) = o] (Q_{\pi}(s, a_+)  Q_{\pi}(s, a_)) \text{ otherwise}\]
This has a natural interpretation: to compute \(d_{\pi}(o, a_+, a_, s)\), we compute the expected value of simulating a run starting from \(s\) using policy \(\pi\) and summing \(Q_{\pi}(s', a_+)  Q_{\pi}(s', a_)\) for all visited states \(s'\) with \(m(s') = o\).
To determine the optimal policy, we are concerned with \(d_{\pi}(o, a_+, a_, s_1)\) for different observations \(o\) and actions \(a_+, a_\). To compute this, we imagine starting from the state \(s_1\) and following policy \(\pi\), and sum \(Q_{\pi}(s, a_+)  Q_{\pi}(s, a_)\) for all visited states \(s\) with \(m(s) = o\). This expected sum is actually equivalent to
\[\sum_{s, m(s) = o} F_{\pi}(s) (Q_{\pi}(s, a_+)  Q_{\pi}(s, a_))\] \[= \sum_{s} SIA_{\pi}(s  o) (Q_{\pi}(s, a_+)  Q_{\pi}(s, a_))\]
i.e. the expected value of of \(Q_{\pi}(s, a_+)  Q_{\pi}(s, a_)\) with \(s\) having \(SIA\) probabilities (up to a multiplicative constant). From here the implication should be clear: if a policy \(\pi\) is not locally optimal, then there is some \(o, a_+, a_\) triple such that a small change in making \(a_+\) more likely and \(a_\) less likely given observation \(o\) will increase expected utility (just set \(a_\) to the nonoptimal action having nonzero probability given \(o\), and set \(a_+\) to be a better alternative action). So this policy \(\pi\) would not be globally optimal either.
Conclusion
In memoryless Cartesian environments, policies consistent with CDT+SIA are locally optimal in some sense, and all globally optimal (UDT) policies are locally optimal in this sense. Therefore, if we look at (Cartesian) UDT the right way, it’s doing CDT+SIA with some method for making sure the resulting policy is globally optimal rather than just locally optimal. It is not clear how to extend this analysis to nonCartesian environments where logical updatelessness is important (e.g. agent simulates predictor), but this seems like a useful research avenue.
