Intelligent Agent Foundations Forumsign up / log in
Computing an exact quantilal policy
discussion post by Vadim Kosoy 45 days ago | discuss

Earlier we established that the quantilal policy can be computed in polynomial time to any given approximation (see “Proposition 5”). Now we show that an exact quantilal policy can be computed in polynomial time (in particular there is always a rational quantilal policy).

We assume geometric time discount throughout.


Consider \(\xi \in \Delta{\mathcal{S}}\). Define the linear operators \(E: {\mathbb{R}}^{{\mathcal{S}}\times {\mathcal{A}}} \rightarrow {\mathbb{R}}^{\mathcal{S}}\) and \(T: {\mathbb{R}}^{{\mathcal{S}}\times {\mathcal{A}}} \rightarrow {\mathbb{R}}^{\mathcal{S}}\) by

\[E_{t,sa} := [[t = s]]\]

\[T_{t,sa} := {\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\]

(Note that this \(T\) is the transpose of the \(T\) defined in “Proposition A.3” of the previous essay.)

Then, \(\xi \in \operatorname{Im}{\operatorname{Z}}\) if and only if there is \(\phi \in \Delta({\mathcal{S}}\times{\mathcal{A}})\) s.t.

\[E\phi = \xi\]

\[(E-\lambda T) \phi = (1-\lambda)\zeta_0\]


(This is actually well known, but we spell out the proof to be self-contained.)

Suppose that \(\xi \in \operatorname{Im}{\operatorname{Z}}\). We already know that this implies that there is a stationary policy \(\pi:{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. \(\operatorname{Z}\pi = \xi\) (we abuse notation in the obvious way): see the proofs of “Proposition 2” and “Proposition 3”. Define the linear operator \(T^\pi: {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{\mathcal{S}}\) by

\[T_{ts}^\pi := {\underset{a\sim\pi(s)}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\right]}}\]

It follows that

\[\xi = (1-\lambda)\sum_{n=0}^\infty \lambda^n T^{\pi n} \zeta_0 = (1-\lambda){\left(\boldsymbol{1}-\lambda T^\pi\right)}^{-1} \zeta_0\]

\[{\left(\boldsymbol{1}-\lambda T^\pi\right)}\xi = (1-\lambda)\zeta_0\]

Define \(\phi\) by

\[\phi(s,a) := \xi(s) \pi(a \mid s)\]

We have

\[T^\pi\xi = \sum_{s\in{\mathcal{S}}} {\underset{a\sim\pi(s)}{\operatorname{E}}{\left[{\mathcal{T}}(s,a)\right]}} \xi(s) = \sum_{\substack{s\in{\mathcal{S}}\\a\in{\mathcal{A}}}}\pi(a \mid s){\mathcal{T}}(s,a)\xi(s) = \sum_{\substack{s\in{\mathcal{S}}\\a\in{\mathcal{A}}}} {\mathcal{T}}(s,a) \phi(s,a) = T\phi\]

Also, obviously \(E\phi = \xi\). We get

\[(E-\lambda T)\phi = \xi - \lambda T^\pi\xi = {\left(\boldsymbol{1}-\lambda T^\pi\right)}\xi = (1-\lambda) \zeta_0\]

Conversely, suppose that \(\phi\) is as above. Since \(E\phi=\xi\), there is \(\pi: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. for any \(s\in{\mathcal{S}}\), if \(\xi(s) \ne 0\) then

\[\pi(a \mid s) = \frac{\phi(s,a)}{\xi(s)}\]

Again, we have

\[\operatorname{Z}\pi = (1-\lambda){\left(\boldsymbol{1} - \lambda T^\pi\right)}^{-1} \zeta_0\]

Also, for the same reason as before

\[(E - \lambda T)\phi = {\left(\boldsymbol{1}-\lambda T^\pi\right)}\xi\]

By the assumption, the left hand side equals \((1-\lambda) \zeta_0\). We conclude

\[\xi = (1-\lambda) {\left(\boldsymbol{1} - \lambda T^\pi\right)}^{-1} \zeta_0 = \operatorname{Z}\pi\]


Assuming all parameters are rational like before, there is a polynomial time algorithm that computes a quantilal policy.


The algorithm starts by solving the following linear program. The indeterminates are \(\phi \in {\mathbb{R}}^{{\mathcal{S}}\times{\mathcal{A}}}\) and \({\operatorname{QV}}\in{\mathbb{R}}\). The goal is maximizing \({\operatorname{QV}}\). The constraints are

\[\forall s\in{\mathcal{S}},a\in{\mathcal{A}}: \phi(s,a) \geq 0\]

\[\sum_{\substack{s\in{\mathcal{S}}\\ a\in{\mathcal{A}}}} \phi(s,a) = 1\]

\[(E - \lambda T) \phi = (1-\lambda) \zeta_0\]

\[\forall s \in {\mathcal{S}}\setminus\operatorname{supp}{\operatorname{Z}\sigma},a\in{\mathcal{A}}: \phi(s,a) = 0\]

\[\forall s \in \operatorname{supp}{\operatorname{Z}\sigma}: {\operatorname{QV}}\leq \sum_{t\in{\mathcal{S}}} {\mathcal{R}}(t)\sum_{a\in{\mathcal{A}}}\phi(t,a) - \frac{\eta}{\operatorname{Z}\sigma(s)} \sum_{a\in{\mathcal{A}}}\phi(s,a)\]

Then, the algorithm computes \(\pi: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. for any \(s\in{\mathcal{S}}\), if \(\sum_{b\in{\mathcal{A}}}\phi(s,b) > 0\) then

\[\pi(a \mid s) := \frac{\phi(s,a)}{\sum_{b\in{\mathcal{A}}}\phi(s,b)}\]

For \(s\in{\mathcal{S}}\) s.t. \(\sum_{b\in{\mathcal{A}}}\phi(s,b) = 0\), \(\pi(s)\) is arbitrary.

Now we need to explain why this algorithm is correct.

Observe that, the first 3 constraints mean that \(\xi\in{\mathbb{R}}^{\mathcal{S}}\) defined by \(\xi(s) := \sum_{b\in{\mathcal{A}}} \phi(s,b)\) lies in \(\operatorname{Im}{\operatorname{Z}}\) (by Lemma 1) and, moreover, \(\phi(s,a) = \xi(s)\pi(a \mid s)\) for \(\pi:{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. \(\xi = \operatorname{Z}\pi\). It remains to show that the linear program amounts to maximizing \({\underset{\xi}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta\exp{\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}\) inside \(\operatorname{Im}\operatorname{Z}\). Indeed, the 4th constraint just means that \({\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}} < \infty\). The last constraint implies that we are actually maximizing

\[\min_{s\in\operatorname{supp}\operatorname{Z}\sigma} {\left({\underset{\xi}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \frac{\eta}{\operatorname{Z}\sigma(s)} \xi(s)\right)}\]

The latter is indeed \({\underset{\xi}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta\exp{\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}\), since every \(s\in\operatorname{supp}{\operatorname{Z}\sigma}\) corresponds to a pure strategy of the adversary in the corresponding zero-sum game: namely, this strategy is setting the penalty function \(P: {\mathcal{S}}\rightarrow [0,\infty)\) to

\[P(t) = \frac{[[t=s]]}{\operatorname{Z}\sigma(s)}\]

(Strategies that place non-vanishing penalty on states outside of \(\operatorname{supp}{\operatorname{Z}\sigma}\) become irrelevant after imposing the 4th constraint. The remaining penalty functions form a simplex with vertices as above.)





Note: I currently think that
by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
by Alex Appel on Smoking Lesion Steelman | 1 like

Two minor comments. First,
by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
by Jessica Taylor on Musings on Exploration | 0 likes

A few comments. Traps are
by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes

If you drop the
by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like


Privacy & Terms