Rigged reward learning
post by Stuart Armstrong 464 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here.

NOTE: What used to be called ‘bias’, is now called ‘rigging’, because ‘bias’ is very overloaded. The post has not yet been updated with the new terminology, however.

What are the biggest failure modes of reward learning agents?

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

For instance, imagine a domestic robot that can be motivated to tidy (reward $$R_0$$) or cook (reward $$R_1$$). It has a switch that allows the human to choose the correct reward function. However, cooking gives a higher expected reward than tidying, and the agent may choose to set the switch directly (or manipulate the human’s choice). In that case, it will set it to cook’.

In that case, the agent biases its reward learning process.

A second failure mode (this version due to Jessica, original idea here) is when the agent influences its reward function without biasing it.

For example, the domestic robot might be waiting for the human to arrive in an hour’s time. It expected the human will be 50% likely to choose $$R_0$$ (tidying) versus 50% likely to choose $$R_1$$ (cooking). If instead the robot can randomise its reward switch now (with equal odds on $$R_0$$ and $$R_1$$), it can know its reward function early, and get in a full extra hour of tidying/cooking.

A subsequent post will formalise influence, here let’s look at bias.

Formalising bias


First of all, for a given policy $$\pi$$, we can say that $$\vdh$$ is unbiased for $$\pi$$, if $$\pi$$ preserves the expectation of $$\vdh$$. That is:

• For all histories $$h_t$$ with $$t < m$$, $$\vdh(\cdot\mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vdh(\cdot\mid h_{t+1})\mid h_t]$$.

If the expectation of $$\vdh$$ is preserved by any policy, then we can say that $$\vdh$$ itself is unbiased:

• The prior $$\vdh$$ is unbiased is $$\vdh$$ is unbiased for $$\pi$$ for all policies $$\pi$$.

Recall that $$\vdh=\vd$$ on histories of length $$m$$. So $$\vdh$$ being unbiased implies restrictions on $$\vd$$:

• If $$\vdh$$ is unbiased, then for all $$h_t$$ with $$t<m$$ and for all policies $$\pi$$, $$\vdh(\cdot \mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vd(\cdot\mid h_m)\mid h_t]$$.

Since $$\vdh$$ being unbiased imposes restrictions on $$\vd$$, we can directly define:

• The posterior $$\vd$$ is unbiased if there exists a possible prior $$\vdh'$$ with $$\vdh'=\vd$$ on histories of length $$m$$, and $$\vdh'$$ is unbiased.

So what does unbiased mean in practice? It simply means that whatever actions or policies the agent follows, they cannot change the expectation of their values.

Bias and learning incentives

This is an opportunity to put the learning and biasing graph:

The x-axis represents the probability $$\vdh(R_1\mid h_t)$$ of $$R_1$$ being the correct’ reward function. The current value is $$\vdh(R_1\mid h_t)=p$$.

The orange curve (which is always convex, though not necessarily strictly so) represents, for a given probability $$q$$ of correctness of $$R_1$$, the expected value the agent could get if it knew it would never learn anything more about the correct value.

If it learnt immediately and costlessly about the correct values, it would go to $$q=0$$ or $$q=1$$ with probability $$1-p$$ and $$p$$, respectively. Thus its expected reward is the point on the blue curve at the x-coordinate $$p$$.

Thus the green arrow represents the incentive to learn. But, if it can’t learn easily, it may try and randomise its reward function, so the green arrow also represents the incentive to (unbiased) influence.

The shape of the orange curve itself represents the incentive to bias.

If the orange curve is flat, it is equal to the blue one, so there is no incentive to learn. If the orange curve is flat and horizontal, there is no incentive to bias, either.

NEW DISCUSSION POSTS

Caveat: The version of EDT
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like