Why we want unbiased learning processes
post by Stuart Armstrong 30 days ago | discuss

Crossposted at Lesserwrong.

tl;dr: if an agent has a biased learning process, it may choose actions that are worse (with certainty) for every possible reward function it could be learning.

An agent learns its own reward function if there is a set $$\mathcal{R}$$ of possible reward functions, and there is a learning process $$P$$ that maps world-histories (and policies) to distributions over $$\mathcal{R}$$. Thus by interacting with the environment and choosing its own policies, the agent can learn which is the correct reward function it should be maximising.

Given a policy $$\pi$$, a history $$h$$, an environment $$\mu$$, and a reward $$R$$, we can compute the expected probability of
$$R$$:

• $$\mathbb{E}^\mu_\pi P(R|h)$$.

Then a learning process is unbiased if that expression is independent of $$\pi$$, and biased otherwise. Biased processes are less desirable, as they allow the agent to manipulate the process through its choice of policy.

## Simple biased learning process

The most trivial example of a biased learning process is an agent that completely determins its reward by its actions. Let $$\mathcal{R} = \{R_0, R_1\}$$, let the agent only act once with two actions available, $$\{a_0, a_1\}$$, (hence a choice of “policy” is a choice of action), and set

• $$P(R_0|a_0)=P(R_1|a_1)=1$$.

Thus the agent can simply choose its reward function through its actions.

Note that some designs are a bit more sophisticated, and don’t allow the agent to choose its reward function directly through its actions. But this doesn’t matter, if the reward function is a consequence of anything that is a predictable consequence of the agent’s actions (eg if the agent can trick/coerce/manipulate a human into saying “yes” or “no”, and if $$P$$ is determined by the human’s response, it doesn’t matter that $$P$$ is not defined directly through the agent’s actions: it is defined indirectly through them).

[Note that all $$P$$ that involve learning about external facts are unbiased learning processes, so it’s not as if unbiased means trivial]

## Strictly dominated behaviour

Then an agent with a biased learning process that wants to maximise the expectation of the true reward, can sometimes follow strictly dominated policies. That means that there is are policies $$\pi_0$$ and $$\pi_1$$, such that for all histories $$h_i$$ possible given $$\pi_i$$, and all reward function $$R\in\mathcal{R}$$,

• $$R(h_1) > R(h_0)$$.

And yet the agent will still choose $$\pi_0$$ to maximise reward.

For example, with $$P$$ and $$\mathcal{R}$$ defined as above, define $$R_0$$ and $$R_1$$ to be:

• $$\begin{array}{|c|c|c|}\hline & a_0 & a_1 \\ \hline R_0 & \mathbf{2} & 3 \\ \hline R_1 & 0 & \mathbf{1} \\ \hline \end{array}$$

Thus $$a_1$$ is always the better action, for both $$R_0$$ and for $$R_1$$; it is strictly dominant. However, since
$$a_i$$ also determines which reward function is correct, the possible rewards the agent gets are the two bold numbers in the table: $$2$$, by choosing $$a_0$$ and hence making $$R_0$$ the correct reward function, and $$1$$, by choosing $$a_1$$ and hence making $$R_1$$ the correct reward function.

Then in order to maximise reward, the agent will choose the strictly dominated policy/action $$a_0$$.

## Unbiased learning

It’s possible to prove that if $$P$$ is unbiased, then this behaviour won’t occur, but doing so involves introducing a bit more definition and machinery that presented here, so I’ll defer this to my forthcoming paper.

## Note on expected dominance

[Reading the following is not relevant to understanding the main point of this post]

The dominant policy is defined so that for all $$R\in\mathcal{R}$$, $$R(h_1)>R(h_0)$$ for all histories $$h_i$$, possible given $$\pi_i$$.

We could instead talk about the expected reward given $$\pi_i$$. But in fact, it makes sense to choose policies which are strictly dominated in the expected reward sense.

For example, let $$\pi_\emptyset$$ be a policy that does nothing (all rewards stay at $$0$$), and let $$\pi_P$$ be the policy that first checks which of $$R_0$$ and $$R_1$$ is correct (for a given $$P$$) and then maximises the correct one. If $$R_i$$ is maximised, it goes to $$1$$, while the other reward will go to $$-2$$.

Assume that the probability of either $$R_0$$ or $$R_1$$ being correct is $$1/2$$. Then it’s clear that $$\pi_P$$ is dominated in expectation by $$\pi_\emptyset$$, since

• $$\mathbb{E}_{\pi_\emptyset}^\mu R_i = 0$$,
• $$\mathbb{E}_{\pi_P}^\mu R_i = 1/2(1)+1/2(-2) = -1/2$$.

Yet $$\pi_P$$ is clearly the right thing to do, since it allows us to maximise the correct reward ($$R_i$$ is only negative in worlds where it is not the correct reward).

So an unbiased agent can still choose a policy that is worse for every reward in expectation, if it’s confident that the (currently unknown) correct reward will get maximised more by this policy.

### NEW DISCUSSION POSTS

If you drop the
 by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
 by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes