Biased reward learning
post by Stuart Armstrong 372 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here.

What are the biggest failure modes of reward learning agents?

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

For instance, imagine a domestic robot that can be motivated to tidy (reward $$R_0$$) or cook (reward $$R_1$$). It has a switch that allows the human to choose the correct reward function. However, cooking gives a higher expected reward than tidying, and the agent may choose to set the switch directly (or manipulate the human’s choice). In that case, it will set it to cook’.

In that case, the agent biases its reward learning process.

A second failure mode (this version due to Jessica, original idea here) is when the agent influences its reward function without biasing it.

For example, the domestic robot might be waiting for the human to arrive in an hour’s time. It expected the human will be 50% likely to choose $$R_0$$ (tidying) versus 50% likely to choose $$R_1$$ (cooking). If instead the robot can randomise its reward switch now (with equal odds on $$R_0$$ and $$R_1$$), it can know its reward function early, and get in a full extra hour of tidying/cooking.

A subsequent post will formalise influence, here let’s look at bias.

# Formalising bias


First of all, for a given policy $$\pi$$, we can say that $$\vdh$$ is unbiased for $$\pi$$, if $$\pi$$ preserves the expectation of $$\vdh$$. That is:

• For all histories $$h_t$$ with $$t < m$$, $$\vdh(\cdot\mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vdh(\cdot\mid h_{t+1})\mid h_t]$$.

If the expectation of $$\vdh$$ is preserved by any policy, then we can say that $$\vdh$$ itself is unbiased:

• The prior $$\vdh$$ is unbiased is $$\vdh$$ is unbiased for $$\pi$$ for all policies $$\pi$$.

Recall that $$\vdh=\vd$$ on histories of length $$m$$. So $$\vdh$$ being unbiased implies restrictions on $$\vd$$:

• If $$\vdh$$ is unbiased, then for all $$h_t$$ with $$t<m$$ and for all policies $$\pi$$, $$\vdh(\cdot \mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vd(\cdot\mid h_m)\mid h_t]$$.

Since $$\vdh$$ being unbiased imposes restrictions on $$\vd$$, we can directly define:

• The posterior $$\vd$$ is unbiased if there exists a possible prior $$\vdh'$$ with $$\vdh'=\vd$$ on histories of length $$m$$, and $$\vdh'$$ is unbiased.

So what does unbiased mean in practice? It simply means that whatever actions or policies the agent follows, they cannot change the expectation of their values.

# Bias and learning incentives

This is an opportunity to put the learning and biasing graph:

The x-axis represents the probability $$\vdh(R_1\mid h_t)$$ of $$R_1$$ being the correct’ reward function. The current value is $$\vdh(R_1\mid h_t)=p$$.

The orange curve (which is always convex, though not necessarily strictly so) represents, for a given probability $$q$$ of correctness of $$R_1$$, the expected value the agent could get if it knew it would never learn anything more about the correct value.

If it learnt immediately and costlessly about the correct values, it would go to $$q=0$$ or $$q=1$$ with probability $$1-p$$ and $$p$$, respectively. Thus its expected reward is the point on the blue curve at the x-coordinate $$p$$.

Thus the green arrow represents the incentive to learn. But, if it can’t learn easily, it may try and randomise its reward function, so the green arrow also represents the incentive to (unbiased) influence.

The shape of the orange curve itself represents the incentive to bias.

If the orange curve is flat, it is equal to the blue one, so there is no incentive to learn. If the orange curve is flat and horizontal, there is no incentive to bias, either.

### NEW DISCUSSION POSTS

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes