Intelligent Agent Foundations Forumsign up / log in
Computing an exact quantilal policy
discussion post by Vadim Kosoy 100 days ago | discuss

Earlier we established that the quantilal policy can be computed in polynomial time to any given approximation (see “Proposition 5”). Now we show that an exact quantilal policy can be computed in polynomial time (in particular there is always a rational quantilal policy).

We assume geometric time discount throughout.

Lemma

Consider \(\xi \in \Delta{\mathcal{S}}\). Define the linear operators \(E: {\mathbb{R}}^{{\mathcal{S}}\times {\mathcal{A}}} \rightarrow {\mathbb{R}}^{\mathcal{S}}\) and \(T: {\mathbb{R}}^{{\mathcal{S}}\times {\mathcal{A}}} \rightarrow {\mathbb{R}}^{\mathcal{S}}\) by

\[E_{t,sa} := [[t = s]]\]

\[T_{t,sa} := {\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\]

(Note that this \(T\) is the transpose of the \(T\) defined in “Proposition A.3” of the previous essay.)

Then, \(\xi \in \operatorname{Im}{\operatorname{Z}}\) if and only if there is \(\phi \in \Delta({\mathcal{S}}\times{\mathcal{A}})\) s.t.

\[E\phi = \xi\]

\[(E-\lambda T) \phi = (1-\lambda)\zeta_0\]

Proof

(This is actually well known, but we spell out the proof to be self-contained.)

Suppose that \(\xi \in \operatorname{Im}{\operatorname{Z}}\). We already know that this implies that there is a stationary policy \(\pi:{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. \(\operatorname{Z}\pi = \xi\) (we abuse notation in the obvious way): see the proofs of “Proposition 2” and “Proposition 3”. Define the linear operator \(T^\pi: {\mathbb{R}}^{\mathcal{S}}\rightarrow {\mathbb{R}}^{\mathcal{S}}\) by

\[T_{ts}^\pi := {\underset{a\sim\pi(s)}{\operatorname{E}}{\left[{\mathcal{T}}{\left(t\;\middle\vert\;s,a\right)}\right]}}\]

It follows that

\[\xi = (1-\lambda)\sum_{n=0}^\infty \lambda^n T^{\pi n} \zeta_0 = (1-\lambda){\left(\boldsymbol{1}-\lambda T^\pi\right)}^{-1} \zeta_0\]

\[{\left(\boldsymbol{1}-\lambda T^\pi\right)}\xi = (1-\lambda)\zeta_0\]

Define \(\phi\) by

\[\phi(s,a) := \xi(s) \pi(a \mid s)\]

We have

\[T^\pi\xi = \sum_{s\in{\mathcal{S}}} {\underset{a\sim\pi(s)}{\operatorname{E}}{\left[{\mathcal{T}}(s,a)\right]}} \xi(s) = \sum_{\substack{s\in{\mathcal{S}}\\a\in{\mathcal{A}}}}\pi(a \mid s){\mathcal{T}}(s,a)\xi(s) = \sum_{\substack{s\in{\mathcal{S}}\\a\in{\mathcal{A}}}} {\mathcal{T}}(s,a) \phi(s,a) = T\phi\]

Also, obviously \(E\phi = \xi\). We get

\[(E-\lambda T)\phi = \xi - \lambda T^\pi\xi = {\left(\boldsymbol{1}-\lambda T^\pi\right)}\xi = (1-\lambda) \zeta_0\]

Conversely, suppose that \(\phi\) is as above. Since \(E\phi=\xi\), there is \(\pi: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. for any \(s\in{\mathcal{S}}\), if \(\xi(s) \ne 0\) then

\[\pi(a \mid s) = \frac{\phi(s,a)}{\xi(s)}\]

Again, we have

\[\operatorname{Z}\pi = (1-\lambda){\left(\boldsymbol{1} - \lambda T^\pi\right)}^{-1} \zeta_0\]

Also, for the same reason as before

\[(E - \lambda T)\phi = {\left(\boldsymbol{1}-\lambda T^\pi\right)}\xi\]

By the assumption, the left hand side equals \((1-\lambda) \zeta_0\). We conclude

\[\xi = (1-\lambda) {\left(\boldsymbol{1} - \lambda T^\pi\right)}^{-1} \zeta_0 = \operatorname{Z}\pi\]

Theorem

Assuming all parameters are rational like before, there is a polynomial time algorithm that computes a quantilal policy.

Proof

The algorithm starts by solving the following linear program. The indeterminates are \(\phi \in {\mathbb{R}}^{{\mathcal{S}}\times{\mathcal{A}}}\) and \({\operatorname{QV}}\in{\mathbb{R}}\). The goal is maximizing \({\operatorname{QV}}\). The constraints are

\[\forall s\in{\mathcal{S}},a\in{\mathcal{A}}: \phi(s,a) \geq 0\]

\[\sum_{\substack{s\in{\mathcal{S}}\\ a\in{\mathcal{A}}}} \phi(s,a) = 1\]

\[(E - \lambda T) \phi = (1-\lambda) \zeta_0\]

\[\forall s \in {\mathcal{S}}\setminus\operatorname{supp}{\operatorname{Z}\sigma},a\in{\mathcal{A}}: \phi(s,a) = 0\]

\[\forall s \in \operatorname{supp}{\operatorname{Z}\sigma}: {\operatorname{QV}}\leq \sum_{t\in{\mathcal{S}}} {\mathcal{R}}(t)\sum_{a\in{\mathcal{A}}}\phi(t,a) - \frac{\eta}{\operatorname{Z}\sigma(s)} \sum_{a\in{\mathcal{A}}}\phi(s,a)\]

Then, the algorithm computes \(\pi: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. for any \(s\in{\mathcal{S}}\), if \(\sum_{b\in{\mathcal{A}}}\phi(s,b) > 0\) then

\[\pi(a \mid s) := \frac{\phi(s,a)}{\sum_{b\in{\mathcal{A}}}\phi(s,b)}\]

For \(s\in{\mathcal{S}}\) s.t. \(\sum_{b\in{\mathcal{A}}}\phi(s,b) = 0\), \(\pi(s)\) is arbitrary.

Now we need to explain why this algorithm is correct.

Observe that, the first 3 constraints mean that \(\xi\in{\mathbb{R}}^{\mathcal{S}}\) defined by \(\xi(s) := \sum_{b\in{\mathcal{A}}} \phi(s,b)\) lies in \(\operatorname{Im}{\operatorname{Z}}\) (by Lemma 1) and, moreover, \(\phi(s,a) = \xi(s)\pi(a \mid s)\) for \(\pi:{\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{A}}\) s.t. \(\xi = \operatorname{Z}\pi\). It remains to show that the linear program amounts to maximizing \({\underset{\xi}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta\exp{\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}\) inside \(\operatorname{Im}\operatorname{Z}\). Indeed, the 4th constraint just means that \({\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}} < \infty\). The last constraint implies that we are actually maximizing

\[\min_{s\in\operatorname{supp}\operatorname{Z}\sigma} {\left({\underset{\xi}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \frac{\eta}{\operatorname{Z}\sigma(s)} \xi(s)\right)}\]

The latter is indeed \({\underset{\xi}{\operatorname{E}}{\left[{\mathcal{R}}\right]}} - \eta\exp{\operatorname{D}_{\infty}{\left(\xi\middle\vert\middle\vert\operatorname{Z}\sigma\right)}}\), since every \(s\in\operatorname{supp}{\operatorname{Z}\sigma}\) corresponds to a pure strategy of the adversary in the corresponding zero-sum game: namely, this strategy is setting the penalty function \(P: {\mathcal{S}}\rightarrow [0,\infty)\) to

\[P(t) = \frac{[[t=s]]}{\operatorname{Z}\sigma(s)}\]

(Strategies that place non-vanishing penalty on states outside of \(\operatorname{supp}{\operatorname{Z}\sigma}\) become irrelevant after imposing the 4th constraint. The remaining penalty functions form a simplex with vertices as above.)



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms