Rigged reward learning
post by Stuart Armstrong 592 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here.

NOTE: What used to be called ‘bias’, is now called ‘rigging’, because ‘bias’ is very overloaded. The post has not yet been updated with the new terminology, however.

What are the biggest failure modes of reward learning agents?

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

For instance, imagine a domestic robot that can be motivated to tidy (reward $$R_0$$) or cook (reward $$R_1$$). It has a switch that allows the human to choose the correct reward function. However, cooking gives a higher expected reward than tidying, and the agent may choose to set the switch directly (or manipulate the human’s choice). In that case, it will set it to cook’.

In that case, the agent biases its reward learning process.

A second failure mode (this version due to Jessica, original idea here) is when the agent influences its reward function without biasing it.

For example, the domestic robot might be waiting for the human to arrive in an hour’s time. It expected the human will be 50% likely to choose $$R_0$$ (tidying) versus 50% likely to choose $$R_1$$ (cooking). If instead the robot can randomise its reward switch now (with equal odds on $$R_0$$ and $$R_1$$), it can know its reward function early, and get in a full extra hour of tidying/cooking.

A subsequent post will formalise influence, here let’s look at bias.

# Formalising bias


First of all, for a given policy $$\pi$$, we can say that $$\vdh$$ is unbiased for $$\pi$$, if $$\pi$$ preserves the expectation of $$\vdh$$. That is:

• For all histories $$h_t$$ with $$t < m$$, $$\vdh(\cdot\mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vdh(\cdot\mid h_{t+1})\mid h_t]$$.

If the expectation of $$\vdh$$ is preserved by any policy, then we can say that $$\vdh$$ itself is unbiased:

• The prior $$\vdh$$ is unbiased is $$\vdh$$ is unbiased for $$\pi$$ for all policies $$\pi$$.

Recall that $$\vdh=\vd$$ on histories of length $$m$$. So $$\vdh$$ being unbiased implies restrictions on $$\vd$$:

• If $$\vdh$$ is unbiased, then for all $$h_t$$ with $$t<m$$ and for all policies $$\pi$$, $$\vdh(\cdot \mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vd(\cdot\mid h_m)\mid h_t]$$.

Since $$\vdh$$ being unbiased imposes restrictions on $$\vd$$, we can directly define:

• The posterior $$\vd$$ is unbiased if there exists a possible prior $$\vdh'$$ with $$\vdh'=\vd$$ on histories of length $$m$$, and $$\vdh'$$ is unbiased.

So what does unbiased mean in practice? It simply means that whatever actions or policies the agent follows, they cannot change the expectation of their values.

# Bias and learning incentives

This is an opportunity to put the learning and biasing graph:

The x-axis represents the probability $$\vdh(R_1\mid h_t)$$ of $$R_1$$ being the correct’ reward function. The current value is $$\vdh(R_1\mid h_t)=p$$.

The orange curve (which is always convex, though not necessarily strictly so) represents, for a given probability $$q$$ of correctness of $$R_1$$, the expected value the agent could get if it knew it would never learn anything more about the correct value.

If it learnt immediately and costlessly about the correct values, it would go to $$q=0$$ or $$q=1$$ with probability $$1-p$$ and $$p$$, respectively. Thus its expected reward is the point on the blue curve at the x-coordinate $$p$$.

Thus the green arrow represents the incentive to learn. But, if it can’t learn easily, it may try and randomise its reward function, so the green arrow also represents the incentive to (unbiased) influence.

The shape of the orange curve itself represents the incentive to bias.

If the orange curve is flat, it is equal to the blue one, so there is no incentive to learn. If the orange curve is flat and horizontal, there is no incentive to bias, either.

### NEW DISCUSSION POSTS

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes