Quantilal control for finite MDPs
post by Vanessa Kosoy 441 days ago | Ryan Carey, Alex Appel and Abram Demski like this | discuss

We introduce a variant of the concept of a “quantilizer” for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the “quantilal” policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the “quantilum” value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter $$\lambda$$ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.

## Background

Quantilization (introduced in Taylor 2015) is a method of dealing with “Extremal Goodhart’s Law”. According to Extremal Goodhart, when we attempt to optimize a utility function $${\operatorname{U}}^*: {\mathcal{A}}\rightarrow {\mathbb{R}}$$ by aggressively optimizing a proxy $${\operatorname{U}}: {\mathcal{A}}\rightarrow {\mathbb{R}}$$, we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function $${C}: {\mathcal{A}}\rightarrow [0,\infty)$$ whose expectation $${\underset{\zeta}{\operatorname{E}}{\left[{C}\right]}}$$ w.r.t. some reference probability measure $$\zeta \in \Delta{\mathcal{A}}$$ is bounded by 1. $$\zeta$$ can be thought of as defining the “domain” within which $${\operatorname{U}}$$ is well-behaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize $${\underset{}{\operatorname{E}}{\left[{\operatorname{U}}\right]}}$$ while constraining $${\underset{}{\operatorname{E}}{\left[{C}\right]}}$$ by a fixed bound $$C_{\max}$$:

$\tilde{\xi}^* :\in {\underset{\xi \in \Delta{\mathcal{A}}}{\operatorname{arg\,max}}\,}{\left\{{\underset{\xi}{\operatorname{E}}{\left[{\operatorname{U}}\right]}}\;\middle\vert\;\forall {C}: {\mathcal{A}}\rightarrow [0,\infty): {\underset{\zeta}{\operatorname{E}}{\left[{C}\right]}} \leq 1 \implies {\underset{\xi}{\operatorname{E}}{\left[{C}\right]}} \leq C_{\max}\right\}}$

Alternatively, we can choose some parameter $$\eta\in(0,\infty)$$ and maximize the minimal guaranteed expectation of $${\operatorname{U}}-\eta {C}$$:

$\xi^* :\in {\underset{\xi \in \Delta{\mathcal{A}}}{\operatorname{arg\,max}}\,}\inf_{{C}:{\mathcal{A}}\rightarrow[0,\infty)}{\left\{{\underset{\xi}{\operatorname{E}}{\left[{\operatorname{U}}-\eta {C}\right]}}\;\middle\vert\;{\underset{\zeta}{\operatorname{E}}{\left[{C}\right]}} \leq 1\right\}}$

These two formulations are almost equivalent since both amount to finding a strategy that is Pareto efficient w.r.t. to the two objectives $${\underset{}{\operatorname{E}}{\left[{\operatorname{U}}\right]}}$$ and $$-{\underset{}{\operatorname{E}}{\left[{C}\right]}}$$. For $$\tilde{\xi}^*$$ the tradeoff is governed by the parameter $$C_{\max}$$ and for $$\xi^*$$ the tradeoff is governed by the parameter $$\eta$$. Indeed, it is easy to see that any $$\xi^*$$ is also optimal for the first criterion if we take $$C_{\max}={\underset{\xi^*}{\operatorname{E}}{\left[{C}\right]}}$$, and any $$\tilde{\xi}^*$$ is also optimal for the latter criterion for an appropriate choice of $$\eta$$ (it needs to be a subderivative of the Pareto frontier).

In the following, we will use as our starting point the second formulation, which can be thought of as a zero-sum game in which $${\operatorname{U}}-\eta {C}$$ is the utility function of an agent whose strategy set is $${\mathcal{A}}$$, and $${C}$$ is chosen by the adversary. The quantilal strategy $$\xi^*$$ is the Nash equilibrium of the game.

This formulation seems natural if we take $$\eta:={\underset{\zeta}{\operatorname{E}}{\left[\max{\left({\operatorname{U}}-{\operatorname{U}}^*,0\right)}\right]}}$$ (a measure of how “optimistic” is $${\operatorname{U}}$$ in the domain $$\zeta$$) and $${C}:=\eta^{-1}\max{\left({\operatorname{U}}-{\operatorname{U}}^*,0\right)}$$. In particular, the quantilum value (the value of the game) is a lower bound on the expectation of $${\operatorname{U}}^*$$.

In principle, this formalism can be applied to sequential interaction between an agent and an environment, if we replace $${\mathcal{A}}$$ by the set of policies. However, if it is possible to make structural assumptions about $${\operatorname{U}}$$ and $$C$$, we can do better. Taylor explores one such structural assumption, namely a sequence of independent games in which both $${\operatorname{U}}$$ and $${C}$$ are additive across the games. We consider a more general setting, namely that of a finite Markov decision process (MDP).

## Notation

Given a set $$A$$, the notation $$A^*$$ will denote the set of finite strings over alphabet $$A$$, i.e.

$A^* := \bigsqcup_{n = 0}^\infty A^n$

$$A^\omega$$ denotes the space of infinite strings over alphabet $$A$$, equipped with the product topology and the corresponding Borel sigma-algebra. Given $$x\in A^\omega$$ and $$n \in {\mathbb{N}}$$, $$x_n \in A$$ is the $$n$$-th symbol of the string $$x$$ (in our conventions, $$0 \in {\mathbb{N}}$$ so the string begins from the 0th symbol.) Given $$h \in A^*$$ and $$x \in A^\omega$$, the notation $$h \sqsubset x$$ means that $$h$$ is a prefix of $$x$$.

Given a set $$A$$, $$x\in A^\omega$$ and $$n\in{\mathbb{N}}$$, the notation $$x_{:n}$$ will indicate the prefix of $$x$$ of length $$n$$. That is, $$x_{:n} \in A^n$$ and $$x_{:n} \sqsubset x$$.

Given a measurable space $$X$$, we denote $$\Delta X$$ the space of probability measures on $$X$$.

Given measurable spaces $$X$$ and $$Y$$, the notation $$K: X {\xrightarrow{\text{k}}}Y$$ means that $$K$$ is a Markov kernel from $$X$$ to $$Y$$. Given $$x \in X$$, $$K(x)$$ is the corresponding probability measure on $$Y$$. Given $$A \subseteq Y$$ measurable, $$K(A \mid x) := K(x)(A)$$. Given $$y \in Y$$, $$K(y \mid x):=K{\left(\{y\}\;\middle\vert\;x\right)}$$. Given $$J: Y {\xrightarrow{\text{k}}}Z$$, $$JK: X {\xrightarrow{\text{k}}}Z$$ is the composition of $$J$$ and $$K$$, and when $$Y = X$$, $$K^n$$ is the $$n$$-th composition power.

## Results

A finite MDP is defined by a non-empty finite set of states $${\mathcal{S}}$$, a non-empty finite set of actions $${\mathcal{A}}$$, a transition kernel $${\mathcal{T}}: {\mathcal{S}}\times {\mathcal{A}}{\xrightarrow{\text{k}}}{\mathcal{S}}$$ and a reward function $${\mathcal{R}}: {\mathcal{S}}\rightarrow {\mathbb{R}}$$. To specify the utility function, we also need to fix a time discount function $$\gamma \in \Delta{\mathbb{N}}$$. This allows defining $${\operatorname{U}}: {\mathcal{S}}^\omega \rightarrow {\mathbb{R}}$$ by

${\operatorname{U}}(x):={\underset{n\sim\gamma}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}\right]}}$

We also fix an initial distribution over states $$\zeta_0 \in \Delta {\mathcal{S}}$$. In “classical” MDP theory, it is sufficient to consider a deterministic initial state $$s_0 \in {\mathcal{S}}$$, since the optimal policy doesn’t depend on the initial state anyway. However, quantilization is different since the worst-case cost function depends on the initial conditions.

We now assume that the cost function $${C}: {\mathcal{S}}^\omega \rightarrow {\mathbb{R}}$$ (or, the true utility function $${\operatorname{U}}^*$$) has the same form. That is, there is some penalty function $${P}: {\mathcal{S}}\rightarrow [0,\infty)$$ s.t.

${C}(x) = {\underset{n\sim\gamma}{\operatorname{E}}{\left[{P}{\left(x_n\right)}\right]}}$

Given a policy $$\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ (where the $${\mathcal{S}}^*$$ factor represents the past history the factor $${\mathcal{S}}$$ represents the current state), we define $$\operatorname{H}{\pi} \in \Delta{\mathcal{S}}^\omega$$ in the usual way (the distribution over histories resulting from $$\pi$$). Finally, we fix $$\eta \in (0,\infty)$$. We are now ready to define quantilization in this setting

# Definition 1

$$\pi^* : {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ is said to be quantilal relatively to reference policy $$\sigma: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}A$$ when

$\pi^* :\in {\underset{\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}}{\operatorname{arg\,max}}\,}\inf_{{P}:{\mathcal{S}}\rightarrow[0,\infty)}{\left\{{\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}-\eta{P}{\left(x_n\right)}\right]}}\;\middle\vert\;{\underset{\substack{x\sim\operatorname{H}{\sigma}\\n\sim\gamma}}{\operatorname{E}}{\left[{P}{\left(x_n\right)}\right]}}\leq1\right\}}$

We also define the quantilum value $${\operatorname{QV}}\in {\mathbb{R}}$$ by

${\operatorname{QV}}:= \sup_{\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}}\inf_{{P}:{\mathcal{S}}\rightarrow[0,\infty)}{\left\{{\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}-\eta{P}{\left(x_n\right)}\right]}}\;\middle\vert\;{\underset{\substack{x\sim\operatorname{H}{\sigma}\\n\sim\gamma}}{\operatorname{E}}{\left[{P}{\left(x_n\right)}\right]}}\leq1\right\}}$

($${\operatorname{QV}}$$ cannot be $$-\infty$$ since taking $$\pi=\sigma$$ yields a lower bound of $${\underset{\substack{x\sim\operatorname{H}{\sigma}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}\right]}}-\eta$$.)

In the original quantilization formalism, the quantilal strategy can be described more explicitly, as sampling according to the reference measure $$\zeta$$ from some top fraction of actions ranked by expected utility. Here, we don’t have an analogous description, but we can in some sense evaluate the infimum over $${P}$$.

For any $$\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$, define $$\operatorname{Z}{\pi}\in\Delta{\mathcal{S}}$$ by

$\operatorname{Z}{\pi}(s):={\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{Pr}}{\left[x_n=s\right]}}$

For any $$\mu,\nu\in\Delta{\mathcal{S}}$$, the notation $${\operatorname{D}_{\infty}{\left(\mu\middle\vert\middle\vert\nu\right)}}$$ signifies the Renyi divergence of order $$\infty$$:

${\operatorname{D}_{\infty}{\left(\mu\middle\vert\middle\vert\nu\right)}} := \ln \max_{\substack{s\in{\mathcal{S}}\\ \mu(s)> 0}}\frac{\mu(s)}{\nu(s)}$

In general, $${\operatorname{D}_{\infty}{\left(\mu\middle\vert\middle\vert\nu\right)}} \in [0,\infty]$$.

# Proposition 1

$$\pi^* : {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ is quantilal relatively to reference policy $$\sigma: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}A$$ if and only if

$\pi^* \in {\underset{\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}}{\operatorname{arg\,max}}\,}{\left({\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}}-\eta \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}{\pi}\middle\vert\middle\vert\operatorname{Z}{\sigma}\right)}}}\right)}$

Also, we have

${\operatorname{QV}}= \sup_{\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}}{\left({\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}}-\eta \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}{\pi}\middle\vert\middle\vert\operatorname{Z}{\sigma}\right)}}}\right)}$

If the maximization in Proposition 1 was over arbitrary $$\xi \in \Delta{\mathcal{S}}$$ rather than $$\xi$$ of the form $$\operatorname{Z}{\pi}$$, we would get ordinary quantilization and sampling $$\xi$$ would be equivalent to sampling some top fraction of $$\zeta:=\operatorname{Z}{\sigma}$$. However, in general, the image of the $$\operatorname{Z}$$ operator is some closed convex set which is not the entire $$\Delta{\mathcal{S}}$$.

So far we considered arbitrary (non-stationary) policies. From classical MDP theory, we know that an optimal policy can always be chosen to be Markov:

# Definition 2

$$\pi: {\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ is said to be a Markov policy, when there is some $$\pi_\text{M}: {\mathbb{N}}\times S {\xrightarrow{\text{k}}}{\mathcal{A}}$$ s.t. $$\pi(h,s)=\pi_\text{M}{\left({\left\vert h \right\vert},s\right)}$$.

Note that a priori it might be unclear whether even a non-stationary quantilal policy. However, we have

# Proposition 2

For any $$\sigma: {\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$, there exists a Markov policy which is quantilal relatively to $$\sigma$$.

Now assume that our time discount function is geometric, i.e. there exists $$\lambda\in[0,1)$$ s.t. $$\gamma(n)=(1-\lambda)\lambda^n$$. Then it is known than an optimal policy can be chosen to be stationary:

# Definition 3

$$\pi: {\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ is said to be a stationary policy, when there is some $$\pi_\text{T}: S {\xrightarrow{\text{k}}}{\mathcal{A}}$$ s.t. $$\pi(h,s)=\pi_\text{T}{\left(s\right)}$$.

Once again, the situation for quantilal policies is analogous:

# Proposition 3

If $$\gamma$$ is geometric, then for any $$\sigma: {\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$, , there exists a stationary policy which is quantilal relatively to $$\sigma$$.

What is not analogous is that an optimal policy can be chosen to be deterministic whereas, of course, this is not the case for quantilal policies.

It is known that the value of an optimal policy depends on the parameters as a piecewise rational function, and in particular it converges as $$\lambda \rightarrow 1$$ and has a Taylor expansion at $$\lambda=1$$. The quantilum value has the same property.

# Proposition 4

$${\operatorname{QV}}$$ is a piecewise rational continuous function of $$\lambda$$, $$\eta$$, the matrix elements of $${\mathcal{T}}$$, the values of $${\mathcal{R}}$$, the values of $$\zeta_0$$ and the values of $$\operatorname{Z}\sigma$$, with a final number of “pieces”.

# Corollary 1

Assume that $$\sigma$$ is a stationary policy. Then, $${\operatorname{QV}}$$ converges as $$\lambda \rightarrow 1$$, holding all other parameters fixed (in the sense that, $$\sigma$$ is fixed whereas $$\operatorname{Z}\sigma$$ changes as a function of $$\lambda$$). It is analytic in $$\lambda$$ for some interval $${\left[\lambda_0,1\right]}$$ and therefore can be described by a Taylor expansion around $$\lambda=1$$ inside this interval.

Note that for optimal policies, Proposition 4 holds for a simpler reason. Specifically, the optimal policy is piecewise constant (since it’s deterministic) and there is a Blackwell policy i.e. a fixed policy which is optimal for any $$\lambda$$ sufficiently close to 1. And, it is easy to see that for a constant policy, the value is a rational function of $$\lambda$$. On the other hand, the quantilal policy is not guaranteed to be locally constant anywhere. Nevertheless the quantilum value is still piecewise rational.

Finally, we address the question of the computational complexity of quantilization. We prove the following

# Proposition 5

Assume geometric time discount. Assume further that $${\mathcal{R}}(s)$$, $${\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}$$, $$\zeta_0(s)$$, $$\lambda$$, $$\operatorname{Z}{\sigma}(s)$$ and $$\eta$$ are rational numbers. Then:

1. There is an algorithm for computing $${\operatorname{QV}}$$ which runs in time polynomial in the size of the input $${\mathcal{R}}$$, $${\mathcal{T}}$$, $$\zeta_0$$, $$\lambda$$, $$\operatorname{Z}\sigma$$ and $$\eta$$. Also, if $$\sigma$$ is stationary and $$\sigma{\left(t\;\middle\vert\;s,a\right)}$$ are rational, then $$\operatorname{Z}{\sigma}(s)$$ are also rational and can be computed from $$\sigma$$, $${\mathcal{T}}$$, $$\zeta_0$$ and $$\lambda$$ in polynomial time.

2. Given an additional rational input parameter $$\epsilon\in(0,1)$$, there is an algorithm for computing a stationary policy which is an $$\epsilon$$-equilibrium in the zero-sum game associated with quantilization, which runs in time polynomial in the size of the input and $$\ln\frac{1}{\epsilon}$$.

EDIT: In fact, it is possible to do better and compute an exact quantilal policy in polynomial time.

## Future Directions

To tease a little, here are some developments of this work that I’m planning:

• Apply quantilization to reinforcement learning, i.e. when we don’t know the MDP in advance. In particular, I believe that the principles of quantilization can be used not only to deal with misspecified rewards, but also to deal with traps to some extent (assuming it a priori known that the reference policy has at most a small probability of falling into a trap). This has some philosophical implications on how humans avoid traps.

• Unify that formalism with DRL. The role of the reference policy will be played by the advisor (thus the reference policy is not known in advance but is learned online). This means we can drop the sanity condition for the advisor, at the price of (i) having a regret bound defined w.r.t. some kind of quantilum value rather than optimal value (ii) having a term in the regret bound proportional to the (presumably small) rate of falling into traps when following the reference (advisor) policy. It should be possible to further develop that by unifying it with the ideas of catastrophe DRL.

• Deal with more general environments, e.g. POMDPs and continuous state spaces.

# Proposition A.1

$\inf_{{P}:{\mathcal{S}}\rightarrow[0,\infty)}{\left\{{\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}-\eta{P}{\left(x_n\right)}\right]}}\;\middle\vert\;{\underset{\substack{x\sim\operatorname{H}{\sigma}\\n\sim\gamma}}{\operatorname{E}}{\left[{P}{\left(x_n\right)}\right]}}\leq1\right\}}={\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}}-\eta \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}{\pi}\middle\vert\middle\vert\operatorname{Z}{\sigma}\right)}}}$

# Proof of Proposition A.1

The definition of $$\operatorname{Z}$$ implies that

${\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}-\eta{P}{\left(x_n\right)}\right]}}={\underset{\operatorname{Z}\pi}{\operatorname{E}}{\left[{\mathcal{R}}-\eta P\right]}}$

Also

${\underset{\substack{x\sim\operatorname{H}{\sigma}\\n\sim\gamma}}{\operatorname{E}}{\left[{P}{\left(x_n\right)}\right]}}={\underset{\operatorname{Z}{\sigma}}{\operatorname{E}}{\left[P\right]}}$

Observe that

${\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[P\right]}} = \sum_{s\in{\mathcal{S}}}\operatorname{Z}{\pi}(s)P(s) \leq \max_{\substack{s \in {\mathcal{S}}\\ \operatorname{Z}{\pi}(s) > 0}}\frac{\operatorname{Z}{\pi}(s)}{\operatorname{Z}{\sigma}(s)}\cdot\sum_{s\in{\mathcal{S}}}\operatorname{Z}{\sigma}(s)P(s) = \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}\pi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}} \cdot {\underset{\operatorname{Z}{\sigma}}{\operatorname{E}}{\left[P\right]}}$

It follows that for any $$P$$ that satisfies the constraint $${\underset{\substack{x\sim\operatorname{H}{\sigma}\\n\sim\gamma}}{\operatorname{E}}{\left[{P}{\left(x_n\right)}\right]}} \leq 1$$, we have $${\underset{\operatorname{Z}{\sigma}}{\operatorname{E}}{\left[{P}\right]}} \leq 1$$ and therefore

${\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}-\eta{P}{\left(x_n\right)}\right]}} = {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{P}\right]}} \geq {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta\exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}\pi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}}$

To show that the inequality can be arbitrarily close to equality, choose $$s^* \in {\mathcal{S}}$$ s.t. $$\operatorname{Z}{\pi}(s^*) > 0$$ and $$\frac{\operatorname{Z}{\pi}{\left(s^*\right)}}{\operatorname{Z}{\sigma}{\left(s^*\right)}} = \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}\pi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}}$$. If $$\operatorname{Z}{\sigma}{\left(s^*\right)} > 0$$, we can define $${P}^*$$ by

${P}^*(s):=\begin{cases} \operatorname{Z}{\sigma}(s^*)^{-1} \text{ if } s=s^* \\ 0 \text{ otherwise} \end{cases}$

Clearly $${\underset{\operatorname{Z}{\sigma}}{\operatorname{E}}{\left[{P}^*\right]}} = 1$$ and $${\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[P^*\right]}} = \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}\pi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}}$$. We get

${\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}-\eta{P}^*{\left(x_n\right)}\right]}} = {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{P}^*\right]}} = {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta\exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}\pi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}}$

In the case $$\operatorname{Z}{\sigma}{\left(s^*\right)} > 0$$, we can take any $$M > 0$$ and define $${P}^M$$ by

${P}^M(s):=\begin{cases} M \text{ if } s=s^* \\ 0 \text{ otherwise} \end{cases}$

Clearly $${\underset{\operatorname{Z}{\sigma}}{\operatorname{E}}{\left[{P}^M\right]}} = 0 \leq 1$$ and $${\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[P^M\right]}} = M \cdot \operatorname{Z}{\pi}{\left(s^*\right)}$$. We get

${\underset{\substack{x\sim\operatorname{H}{\pi}\\n\sim\gamma}}{\operatorname{E}}{\left[{\mathcal{R}}{\left(x_n\right)}-\eta{P}^M{\left(x_n\right)}\right]}} = {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{P}^M\right]}} = {\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - M \cdot \operatorname{Z}{\pi}{\left(s^*\right)}$

Since $$\operatorname{Z}{\pi}{\left(s^*\right)} > 0$$, we can make this expression arbitrarily low and therefore the infimum is $$-\infty$$ which is what we need since in this case $${\operatorname{D}_{\infty}{\left(\operatorname{Z}{\pi}\middle\vert\middle\vert\operatorname{Z}{\sigma}\right)}} = \infty$$.

Proposition 1 now follows immediately from Proposition A.1.

We will use the notation $$\Pi:={\left\{{\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\right\}}$$ (the space of all policies). We also have $$\Pi_{\text{M}}:={\left\{{\mathbb{N}}\times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\right\}}$$ (the space of Markov policies) and $$\Pi_{\text{T}}:={\left\{ {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\right\}}$$ (the space of stationary policies). Mildly abusing notation, we will view $$\Pi_{\text{M}}$$ as a subspace of $$\Pi$$ and $$\Pi_{\text{T}}$$ as a subspace of $$\Pi_{\text{M}}$$.

# Proposition A.2

For any $$\sigma: {\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$, there exists some policy which is quantilal relatively to $$\sigma$$.

# Proof of Proposition A.2

$$\Pi$$ is the product of a countable number of copies of $$\Delta{\mathcal{A}}$$ (indexed by $${\mathcal{S}}^* \times {\mathcal{S}}$$). $$\Delta{\mathcal{A}}$$ is naturally a topological space (a simplex), and we can thereby equip $$\Pi$$ by the product topology. By Tychonoff’s theorem, $$\Pi$$ is compact. Moreover, it is easy to see that $$\operatorname{Z}: \Pi \rightarrow \Delta{\mathcal{S}}$$ is a continuous mapping. Finally, observe that $${\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}{\sigma}\right)}}$$ is lower semicontinuous in $$\xi$$ (since it is the maximum of a number of continuous functions) and therefore $${\underset{\xi}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta\exp{{\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}{\sigma}\right)}}}$$ is upper semicontinuous in $$\xi$$. By the extreme value theorem, it follows that a quantilal policy exists.

For any $$n\in{\mathbb{N}}$$, we define $$\operatorname{Z}_n: \Pi {\xrightarrow{\text{k}}}{\mathcal{S}}$$ by

$\operatorname{Z}_n{\pi}(s) := {\underset{x\sim\operatorname{H}{\pi}}{\operatorname{Pr}}{\left[x_n=s\right]}}$

# Proof of Proposition 2

By Proposition A.2, there is a quantilal policy $$\pi^*: {\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$. We define $$\pi^\dagger: {\mathbb{N}}\times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ by

$\pi^\dagger(n,s):={\underset{x\sim\operatorname{H}{\pi^*}}{\operatorname{E}}{\left[\pi^*(x_{:n},s)\;\middle\vert\;x_n=s\right]}}$

We now prove by induction that for any $$n\in{\mathbb{N}}$$, $$\operatorname{Z}_n{\pi^*} = \operatorname{Z}_n{\pi^\dagger}$$.

For $$n=0$$, we have $$\operatorname{Z}_0{\pi^*} = \operatorname{Z}_0{\pi^\dagger} = \zeta_0$$.

For any $$n \in {\mathbb{N}}$$ and any $$\pi: {\mathcal{S}}^* \times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$, we have

$\operatorname{Z}_{n+1}{\pi}(t) = {\underset{x\sim\operatorname{H}{\pi}}{\operatorname{Pr}}{\left[x_{n+1}=t\right]}} = {\underset{s\sim\operatorname{Z}_n{\pi}}{\operatorname{E}}{\left[{\underset{x\sim\operatorname{H}{\pi}}{\operatorname{Pr}}{\left[x_{n+1} = t\;\middle\vert\;x_n=s\right]}}\right]}} = {\underset{s\sim\operatorname{Z}_n{\pi}}{\operatorname{E}}{\left[{\underset{\substack{x\sim\operatorname{H}{\pi} \\ a\sim\pi{\left(x_{:n},s\right)}}}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\;\middle\vert\;x_n=s\right]}}\right]}}$

To complete the induction step, observe that, by definition of $$\pi^\dagger$$

${\underset{\substack{x\sim\operatorname{H}{\pi} \\ a\sim\pi^*{\left(x_{:n},s\right)}}}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\;\middle\vert\;x_n=s\right]}} = {\underset{a\sim\pi^\dagger{\left(n,s\right)}}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\right]}}={\underset{\substack{x\sim\operatorname{H}{\pi} \\ a\sim\pi^\dagger{\left(n,s\right)}}}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\;\middle\vert\;x_n=s\right]}}$

Now, for any $$\pi$$, $$\operatorname{Z}{\pi}={\underset{n\sim\gamma}{\operatorname{E}}{\left[\operatorname{Z}_n{\pi}\right]}}$$ and therefore $$\operatorname{Z}{\pi^\dagger} = \operatorname{Z}{\pi^*}$$. We conclude that $$\pi^\dagger$$ is also quantilal.

# Proof of Proposition 3

By Proposition 2, there is $$\pi^*: {\mathbb{N}}\times {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ which is a Markov quantilal policy. We have

$\operatorname{Z}_{n+1}{\pi^*}(t) = {\underset{x\sim\operatorname{H}{\pi^*}}{\operatorname{Pr}}{\left[x_{n+1}=t\right]}} = {\underset{s\sim\operatorname{Z}_n{\pi^*}}{\operatorname{E}}{\left[{\underset{x\sim\operatorname{H}{\pi^*}}{\operatorname{Pr}}{\left[x_{n+1} = t\;\middle\vert\;x_n=s\right]}}\right]}} = {\underset{\substack{s\sim\operatorname{Z}_n{\pi^*}\\a\sim\pi^*{\left(n,s\right)}}}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\right]}}$

Taking the expected value of this equation w.r.t. $$n\sim\gamma$$ we get

${\underset{n\sim\gamma}{\operatorname{E}}{\left[\operatorname{Z}_{n+1}{\pi^*}\right]}} = {\underset{\substack{n\sim\gamma\\s\sim\operatorname{Z}_n{\pi^*}\\a\sim\pi^*{\left(n,s\right)}}}{\operatorname{E}}{\left[{\mathcal{T}}(s,a)\right]}}$

Also, we have

$\operatorname{Z}\pi^* = {\underset{n\sim\gamma}{\operatorname{E}}{\left[\operatorname{Z}_n \pi^*\right]}} = (1-\lambda)\sum_{n=0}^\infty \lambda^n \operatorname{Z}_n \pi^* = (1-\lambda){\left(\zeta_0 + \lambda \sum_{n=0}^\infty \lambda^n \operatorname{Z}_{n+1}\pi^*\right)}=(1-\lambda)\zeta_0 + \lambda {\underset{n\sim\gamma}{\operatorname{E}}{\left[\operatorname{Z}_{n+1}{\pi^*}\right]}}$

It follows that

$\operatorname{Z}\pi^* = (1-\lambda)\zeta_0 + \lambda {\underset{\substack{n\sim\gamma\\s\sim\operatorname{Z}_n{\pi^*}\\a\sim\pi^*{\left(n,s\right)}}}{\operatorname{E}}{\left[{\mathcal{T}}(s,a)\right]}}$

$\operatorname{Z}\pi^* = (1-\lambda)\zeta_0 + \lambda {\underset{s\sim\operatorname{Z}\pi^*}{\operatorname{E}}{\left[{\underset{\substack{n\sim\gamma\\t\sim\operatorname{Z}_n\pi^*\\a\sim\pi^*(n,s)}}{\operatorname{E}}{\left[{\mathcal{T}}(s,a)\;\middle\vert\;t=s\right]}}\right]}}$

Define $$\pi^\dagger: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ by

$\pi^\dagger(s) := {\underset{\substack{n\sim\gamma\\t\sim\operatorname{Z}_n\pi^*}}{\operatorname{E}}{\left[\pi^*(n,s)\;\middle\vert\;t=s\right]}}$

We get

$\operatorname{Z}\pi^* = (1-\lambda)\zeta_0 + \lambda {\underset{s\sim\operatorname{Z}\pi^*}{\operatorname{E}}{\left[{\underset{a\sim\pi^\dagger(s)}{\operatorname{E}}{\left[{\mathcal{T}}(s,a)\right]}}\right]}}$

Define the linear operator $$T^\dagger: {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{\mathcal{S}}$$ by the matrix

$T^\dagger_{ts} := {\underset{a\sim\pi^\dagger(s)}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\right]}}$

Viewing $$\Delta{\mathcal{S}}$$ as a subset of $${\mathbb{R}}^{\mathcal{S}}$$, we get

$\operatorname{Z}\pi^* = (1-\lambda)\zeta_0 + \lambda T^\dagger\operatorname{Z}\pi^*$

$\operatorname{Z}\pi^* = (1-\lambda){\left(\boldsymbol{1}-\lambda T^\dagger\right)}^{-1}\zeta_0$

On the other hand, we have

$\operatorname{Z}\pi^\dagger = (1-\lambda)\sum_{n=0}^\infty \lambda^n T^{\dagger n} \zeta_0 = (1-\lambda){\left(\boldsymbol{1}-\lambda T^\dagger\right)}^{-1}\zeta_0$

Therefore, $$\operatorname{Z}\pi^\dagger = \operatorname{Z}\pi^*$$ and $$\pi^\dagger$$ is also quantilal.

# Proposition A.3

Assume geometric time discount. Consider any $$\zeta\in\Delta{\mathcal{S}}$$. Define the linear operators $$I: {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{{\mathcal{S}}\times{\mathcal{A}}}$$ and $$T: {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{{\mathcal{S}}\times{\mathcal{A}}}$$ by the matrices

$I_{sa,t}:=[[t=s]]$

$T_{sa,t}:={\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}$

Define the linear operator $$A: {\mathbb{R}}^{\mathcal{S}}\oplus {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{{\mathcal{S}}\times{\mathcal{A}}}\oplus{\mathbb{R}}^{{\mathcal{S}}}\oplus{\mathbb{R}}$$ by

$A(V,P):={\left({\left(I-\lambda T\right)}V + \eta(1-\lambda)IP,\ P,\ -{\underset{\zeta}{\operatorname{E}}{\left[P\right]}}\right)}$

Define $$B\in {\mathbb{R}}^{{\mathcal{S}}\times{\mathcal{A}}}\oplus{\mathbb{R}}^{{\mathcal{S}}}\oplus{\mathbb{R}}$$ by

$B:={\left((1-\lambda)I{\mathcal{R}},\ \boldsymbol{0},\ -1\right)}$

Define $$D \subseteq {\mathbb{R}}^{\mathcal{S}}\oplus {\mathbb{R}}^{\mathcal{S}}$$ as follows

$D:={\left\{X \in {\mathbb{R}}^{\mathcal{S}}\oplus {\mathbb{R}}^{\mathcal{S}}\;\middle\vert\;AX \geq B\right\}}$

Here, vector inequalities are understood to be componentwise.

Then,

${\operatorname{QV}}^\zeta := \sup_{\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}}{\left({\underset{\operatorname{Z}{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}}-\eta \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}{\pi}\middle\vert\middle\vert\zeta\right)}}}\right)} = \inf_{(V,P)\in D}{{\underset{\zeta_0}{\operatorname{E}}{\left[V\right]}}}$

In particular, if $$\zeta=\operatorname{Z}\sigma$$, the above expression describes $${\operatorname{QV}}$$

# Proof of Proposition A.3

Denote $$\Pi_{\det}:={\left\{{\mathcal{S}}^* \times {\mathcal{S}}\rightarrow {\mathcal{A}}\right\}}$$. As usual in extensive-form games, behavioral strategies are equivalent to mixed strategies and therefore the image of the pushforward operator $$\operatorname{Z}_*: \Delta\Pi_{\det} \rightarrow \Delta{\mathcal{S}}$$ is the same as the image of $$\operatorname{Z}: \Pi \rightarrow \Delta{\mathcal{S}}$$. It follows that

${\operatorname{QV}}= \sup_{\mu \in \Delta\Pi_{\det}}\inf_{{P}:{\mathcal{S}}\rightarrow[0,\infty)}{\left\{{\underset{\operatorname{Z}_*\mu}{\operatorname{E}}{\left[{\mathcal{R}}-\eta{P}\right]}}\;\middle\vert\;{\underset{\zeta}{\operatorname{E}}{\left[{P}\right]}}\leq1\right\}}$

$$\Pi_{\det}$$ is a compact Polish space (in the sense of the product topology) and therefore $$\Delta\Pi_{\det}$$ is compact in the weak topology. By Sion’s minimax theorem

${\operatorname{QV}}= \inf_{{P}:{\mathcal{S}}\rightarrow[0,\infty)}\max_{\pi \in \Pi_{\det}}{\left\{{\underset{\operatorname{Z}\pi}{\operatorname{E}}{\left[{\mathcal{R}}-\eta{P}\right]}}\;\middle\vert\;{\underset{\zeta}{\operatorname{E}}{\left[{P}\right]}}\leq1\right\}}$

Now consider any $$X=(V,P) \in D$$. $$AX \geq B$$ implies (looking at the $${\mathbb{R}}^{{\mathcal{S}}\times{\mathcal{A}}}$$ component) that

${\left(I-\lambda T\right)}V + \eta(1-\lambda)IP \geq (1-\lambda)I{\mathcal{R}}$

${\left(I-\lambda T\right)}V\geq (1-\lambda)I({\mathcal{R}}-\eta P)$

$V(s) - \lambda \sum_{t \in {\mathcal{S}}} {\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)} V(t) \geq (1-\lambda){\left({\mathcal{R}}(s)-\eta P(s)\right)}$

$V(s) \geq (1-\lambda){\left({\mathcal{R}}(s)-\eta P(s)\right)} + \lambda \max_{a \in {\mathcal{A}}} \sum_{t \in {\mathcal{S}}} {\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)} V(t)$

Therefore, there is some $${\mathcal{R}}' \geq {\mathcal{R}}$$ s.t.

$V(s) = (1-\lambda){\left({\mathcal{R}}'(s)-\eta P(s)\right)} + \lambda \max_{a \in {\mathcal{A}}} \sum_{t \in {\mathcal{S}}} {\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)} V(t)$

The above is the Bellman equation for a modified MDP with reward function $${\mathcal{R}}'-\eta P$$. Denote $$\operatorname{Z}_s$$ the version of the $$\operatorname{Z}$$ operator for the initial state $$s$$ (instead of $$\zeta_0$$). We get

$V(s) = \max_{\pi\in\Pi_{\det}}{{\underset{\operatorname{Z}_s\pi}{\operatorname{E}}{\left[{\mathcal{R}}'-\eta P\right]}}} \geq \max_{\pi\in\Pi_{\det}}{{\underset{\operatorname{Z}_s\pi}{\operatorname{E}}{\left[{\mathcal{R}}-\eta P\right]}}}$

Observing that $${\underset{s\sim\zeta_0}{\operatorname{E}}{\left[\operatorname{Z}_s\pi\right]}} = \operatorname{Z}\pi$$, we get

${\underset{\zeta_0}{\operatorname{E}}{\left[V\right]}} \geq {\underset{s\sim\zeta_0}{\operatorname{E}}{\left[\max_{\pi\in\Pi_{\det}}{{\underset{\operatorname{Z}_s\pi}{\operatorname{E}}{\left[{\mathcal{R}}-\eta P\right]}}}\right]}} \geq \max_{\pi\in\Pi_{\det}} {\underset{\operatorname{Z}\pi}{\operatorname{E}}{\left[{\mathcal{R}}- \eta P\right]}}$

On the other hand, for any $$P\in{\mathbb{R}}^{\mathcal{S}}$$ s.t. $$P\geq\boldsymbol{0}$$ and $${\underset{\operatorname{Z}\sigma}{\operatorname{E}}{\left[P\right]}} \leq 1$$ (these inequalities correspond to the $${\mathbb{R}}^{\mathcal{S}}\oplus{\mathbb{R}}$$ component of $$AX \geq B$$), there is $$(V,P)\in D$$ s.t.

${\underset{\zeta_0}{\operatorname{E}}{\left[V\right]}} = \max_{\pi\in\Pi_{\det}} {\underset{\operatorname{Z}\pi}{\operatorname{E}}{\left[{\mathcal{R}}- \eta P\right]}}$

Namely, this $$V$$ is the solution of the Bellman equation for the reward function $${\mathcal{R}}-\eta P$$. Therefore, we have

$\max_{\pi\in\Pi_{\det}} {\underset{\operatorname{Z}\pi}{\operatorname{E}}{\left[{\mathcal{R}}- \eta P\right]}} = \min_{V\in{\mathbb{R}}^{\mathcal{S}}}{\left\{{\underset{\zeta_0}{\operatorname{E}}{\left[V\right]}}\;\middle\vert\;(V,P)\in D\right\}}$

Taking the infimum of both sides over $$P$$ inside the domain $${\left\{P\geq\boldsymbol{0},\ {\underset{\zeta}{\operatorname{E}}{\left[P\right]}} \leq 1\right\}}$$ we get

${\operatorname{QV}}^\zeta = \inf_{{P}:{\mathcal{S}}\rightarrow[0,\infty)}\max_{\pi \in \Pi_{\det}}{\left\{{\underset{\operatorname{Z}\pi}{\operatorname{E}}{\left[{\mathcal{R}}-\eta{P}\right]}}\;\middle\vert\;{\underset{\zeta}{\operatorname{E}}{\left[{P}\right]}}\leq1\right\}} = \inf_{(V,P) \in D}{{\underset{\zeta_0}{\operatorname{E}}{\left[V\right]}}}$

# Proof of Proposition 4

Consider the characterization of $${\operatorname{QV}}$$ in Proposition A.3. By general properties of systems of linear inequalities, there is some $$J \subseteq ({\mathcal{S}}\times {\mathcal{A}}) \sqcup {\mathcal{S}}\sqcup \{\bullet\}$$ s.t.

$D^\sharp_J := {\left\{X \in {\mathbb{R}}^{\mathcal{S}}\oplus {\mathbb{R}}^{\mathcal{S}}\;\middle\vert\;A_J X = B_J\right\}} \subseteq {\underset{(V,P) \in D}{\operatorname{arg\,min}}\,}{{\underset{\zeta_0}{\operatorname{E}}{\left[V\right]}}}$

Here, the notation $$A_J$$ means taking the submatrix of $$A$$ consisting of the rows corresponding to $$J$$. Similarly, $$B_J$$ is the subvector of $$B$$ consisting of the components corresponding to $$J$$.

(To see this, use the fact that the minimum of a linear function on a convex set is always attained on the boundary, and apply induction by dimension.)

Moreover, $$D_J^\sharp$$ has to be a single point $$X_J \in {\mathbb{R}}^{\mathcal{S}}\oplus {\mathbb{R}}^{\mathcal{S}}$$. Indeed, if it has more than one point then it contains a straight line $$L$$. The projection of $$L$$ on the second $${\mathbb{R}}^{\mathcal{S}}$$ has to be a single point $$P_0$$, otherwise some point in $$L$$ violates the inequality $$P \geq \boldsymbol{0}$$ which would contradict $$L \subseteq D$$. Therefore, the projection of $$L$$ on the first $${\mathbb{R}}^{\mathcal{S}}$$ is also a straight line $$L'$$. As in the proof of Proposition A.3, for any $$(V,P) \in D$$, $$V(s)$$ is an upper bound on the value of state $$s$$ in the MDP with reward function $${\mathcal{R}}-\eta P$$. In particular, if $${\left(V,P_0\right)} \in D$$ then $$V(s) \geq \min_{t\in{\mathcal{S}}} {\left(R(t)-\eta P_0(t)\right)}$$. However, since $$L'$$ is a line, there must be some $$s^*\in{\mathcal{S}}$$ s.t. $$V{\left(s^*\right)}$$ can be any real number for some $${\left(V,P_0\right)} \in L$$. This is a contradiction.

Denote $$Q: {\mathbb{R}}^{\mathcal{S}}\oplus {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{\mathcal{S}}$$ the projection operator on the first direct summand. It follows that we can always choose $$J$$ s.t. $${\left\vert J \right\vert} = 2{\left\vert {\mathcal{S}}\right\vert}$$ and we have

$X_J = A_J^{-1} B_J$

${\operatorname{QV}}= {\underset{\zeta_0}{\operatorname{E}}{\left[QA_J^{-1} B_J\right]}}$

For each $$J$$, the expression on the right hand side is a rational function of $$\zeta_0$$ and the matrix elements of $$A$$ and $$B$$ which, in turn, are polynomials in the parameters the dependence on which we are trying to establish. Therefore, this expression is a rational function in those parameters (unless $$\det A_J$$ vanishes identically, in which case this $$J$$ can ignored). So, $${\operatorname{QV}}$$ always equals one of a finite set of rational functions (corresponding to difference choices of $$J$$). The continuity of $${\operatorname{QV}}$$ also easily follows from its characterization as $$\min_{(V,P) \in D}{{\underset{\zeta_0}{\operatorname{E}}{\left[V\right]}}}$$.

# Proposition A.4

Assume geometric time discount and stationary $$\sigma$$. Then, $$\operatorname{Z}\sigma$$ is a rational function of $$\sigma$$, $${\mathcal{T}}$$, $$\zeta_0$$ and $$\lambda$$ with rational coefficients.

# Proof of Proposition A.4

Define the linear operator $$T^\sigma: {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{\mathcal{S}}$$ by the matrix

$T^\sigma_{ts} = {\underset{a\sim\sigma(s)}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\right]}}$

We have

$\operatorname{Z}\sigma = (1-\lambda)\sum_{n=0}^\infty \lambda^n T^{\sigma n} \zeta_0 = (1-\lambda){\left(\boldsymbol{1}-\lambda T^\sigma\right)}^{-1}\zeta_0$

# Proof of Corollary 1

By Propositions 4 and A.4, $${\operatorname{QV}}$$ is a continuous piecewise rational function of $$\lambda$$ with a finite number of pieces. Two rational functions can only coincide at a finite number of points (since a polynomial can only have a finite number of roots), therefore there is only a finite number of values of $$\lambda$$ in which $${\operatorname{QV}}$$ “switches” from one rational function to another. It follows that there is some $$\lambda_0\in[0,1)$$ s.t. $${\operatorname{QV}}$$ is a fixed rational function of $$\lambda$$ in the interval $$[\lambda_0,1)$$.

Moreover, it always holds that

$\min_{s\in{\mathcal{S}}} {\mathcal{R}}(s) - \eta \leq {\operatorname{QV}}\leq \max_{s\in{\mathcal{S}}} {\mathcal{R}}(s)$

The first inequality holds since, the guaranteed performance of the quantilal policy is at least as good as the guaranteed performance of $$\sigma$$. The second inequality is a consequence of the requirement that $$P$$ is non-negative.

It follows that $${\operatorname{QV}}$$ cannot have a pole at $$\lambda = 1$$ and therefore must converge to a finite value there.

# Proof of Proposition 5

The algorithm for $${\operatorname{QV}}$$ is obtained from Proposition A.3 using linear programming.

The claim regarding $$\operatorname{Z}\sigma$$ for stationary $$\sigma$$ follows from Proposition A.4.

We now describe the algorithm for computing an $$\epsilon$$-equilibrium.

For any $$a \in {\mathcal{A}}$$, define $$d_a: {\mathcal{A}}\sqcup \{\bot\} \rightarrow {\mathcal{A}}$$ by

$d_a(b) := \begin{cases} a \text{ if } b=\bot \\ b \text{ otherwise} \end{cases}$

Consider any $$\beta: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\sqcup \{\bot\}$$. We define $${\mathcal{T}}_\beta: {\mathcal{S}}\times {\mathcal{A}}{\xrightarrow{\text{k}}}{\mathcal{S}}$$ by

${\mathcal{T}}_\beta(s,a) := {\underset{b \sim \beta(s)}{\operatorname{E}}{\left[{\mathcal{T}}{\left(s,d_a(b)\right)}\right]}}$

Let $$\operatorname{Z}_\beta: \Pi {\xrightarrow{\text{k}}}{\mathcal{S}}$$ be the $$\operatorname{Z}$$-operator for the MDP with transition kernel $${\mathcal{T}}_\beta$$, and define

${\operatorname{QV}}_\beta := \sup_{\pi: {\mathcal{S}}^*\times{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}}{\left({\underset{\operatorname{Z}_\beta{\pi}}{\operatorname{E}}{\left[{\mathcal{R}}\right]}}-\eta \exp{{\operatorname{D}_{\infty}{\left(\operatorname{Z}_\beta{\pi}\middle\vert\middle\vert\operatorname{Z}{\sigma}\right)}}}\right)}$

It is easy to see that $${\operatorname{QV}}_\beta = {\operatorname{QV}}$$ if an only if there is $$\pi: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}$$ quantilal s.t. $$\pi{\left(a\;\middle\vert\;s\right)} \geq \beta{\left(a\;\middle\vert\;s\right)}$$. Indeed, the MDP with kernel $${\mathcal{T}}_\beta$$ is equivalent to the original MDP under the constraint that, when in state $$s$$, any action $$a$$ has to be taken with the minimal probability $$\beta{\left(a\;\middle\vert\;s\right)}$$. In particular $${\operatorname{QV}}_\beta \leq {\operatorname{QV}}$$ (since we constrained the possible policies but not the possible penalties). So, if $$\pi$$ as above exists, then we can use it to construct a stationary policy for the new MDP with guaranteed value $${\operatorname{QV}}$$. Conversely, if the new MDP has a stationary policy with guaranteed value $${\operatorname{QV}}$$ then it can be used to construct the necessary $$\pi$$.

Using Proposition A.3, we can compute $${\operatorname{QV}}_\beta$$ for any rational $$\beta$$ by linear programming. This allows us to find the desired policy by binary search on $$\beta$$, one (state,action) pair at a time.

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes