by 258 23 days ago | Alex Appel and Abram Demski like this | link | parent Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below. Given some existing policy $$\pi$$, EDT+SSA recommends that upon receiving observation $$o$$ we should choose an action from $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n).$ (For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, $$\pi_{o\rightarrow a}(o')=a$$ if $$o=o'$$ and $$\pi_{o\rightarrow a}(o')=\pi(o')$$ otherwise. $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is the SSA probability of being in state $$s_i$$ of the environment trajectory $$s_1,...,s_n$$ given the observation $$o$$ and the fact that one uses the policy $$\pi_{o\rightarrow a}$$. The SSA probability $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is zero if $$m(s_i)\neq o$$ and $SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a}) = P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)}$ otherwise. Here, $$\#(o,s_1,...,s_n)=\sum_{i=1}^n \left[ m(s_i)=o \right]$$ is the number of times $$o$$ occurs in $$\#(o,s_1,...,s_n)$$. Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1/2 probability to tails in the Sleeping Beauty problem and sticks with 1/2 if it’s told that it’s Monday). Inserting this into the above, we get $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n)=\arg\max_a \sum_{s_1,...,s_n\text{ with }o} \sum_{i=1...n, m(s_i)=o} \left( P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)} \right) U(s_n),$ where the first sum on the right-hand side is over all histories that give rise to observation $$o$$ at some point. Dividing by the number of agents with observation $$o$$ in a history and setting the policy for all agents at the same time cancel each other out, such that this equals $\arg\max_a \sum_{s_1,...,s_n\text{ with }o} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n)=\arg\max_a \sum_{s_1,...,s_n} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n).$ Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; multiple policies satisfy the right-hand side of this equation and not all of these are optimal. [1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf

### NEW DISCUSSION POSTS

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes