A putative new idea for AI control; index here.
What are the biggest failure modes of reward learning agents?
The first failure mode is when the agent directly (or indirectly) chooses its reward function.
For instance, imagine a domestic robot that can be motivated to tidy (reward \(R_0\)) or cook (reward \(R_1\)). It has a switch that allows the human to choose the correct reward function. However, cooking gives a higher expected reward than tidying, and the agent may choose to set the switch directly (or manipulate the human’s choice). In that case, it will set it to `cook’.
In that case, the agent biases its reward learning process.
A second failure mode (this version due to Jessica, original idea here) is when the agent influences its reward function without biasing it.
For example, the domestic robot might be waiting for the human to arrive in an hour’s time. It expected the human will be 50% likely to choose \(R_0\) (tidying) versus 50% likely to choose \(R_1\) (cooking). If instead the robot can randomise its reward switch now (with equal odds on \(R_0\) and \(R_1\)), it can know its reward function early, and get in a full extra hour of tidying/cooking.
A subsequent post will formalise influence, here let’s look at bias.
\(\newcommand{\vd}{P}\newcommand{\vdh}{\widehat{\vd}}\)We can define bias in terms of \(\vd\) and \(\vdh\).
First of all, for a given policy \(\pi\), we can say that \(\vdh\) is unbiased for \(\pi\), if \(\pi\) preserves the expectation of \(\vdh\). That is:
 For all histories \(h_t\) with \(t < m\), \(\vdh(\cdot\mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vdh(\cdot\mid h_{t+1})\mid h_t]\).
If the expectation of \(\vdh\) is preserved by any policy, then we can say that \(\vdh\) itself is unbiased:
 The prior \(\vdh\) is unbiased is \(\vdh\) is unbiased for \(\pi\) for all policies \(\pi\).
Recall that \(\vdh=\vd\) on histories of length \(m\). So \(\vdh\) being unbiased implies restrictions on \(\vd\):
 If \(\vdh\) is unbiased, then for all \(h_t\) with \(t<m\) and for all policies \(\pi\), \(\vdh(\cdot \mid h_t)=\mathbb{E}_{\mu}^{\pi}[\vd(\cdot\mid h_m)\mid h_t]\).
Since \(\vdh\) being unbiased imposes restrictions on \(\vd\), we can directly define:
 The posterior \(\vd\) is unbiased if there exists a possible prior \(\vdh'\) with \(\vdh'=\vd\) on histories of length \(m\), and \(\vdh'\) is unbiased.
So what does unbiased mean in practice? It simply means that whatever actions or policies the agent follows, they cannot change the expectation of their values.
Bias and learning incentives
This is an opportunity to put the learning and biasing graph:
The xaxis represents the probability \(\vdh(R_1\mid h_t)\) of \(R_1\) being the `correct’ reward function. The current value is \(\vdh(R_1\mid h_t)=p\).
The orange curve (which is always convex, though not necessarily strictly so) represents, for a given probability \(q\) of correctness of \(R_1\), the expected value the agent could get if it knew it would never learn anything more about the correct value.
If it learnt immediately and costlessly about the correct values, it would go to \(q=0\) or \(q=1\) with probability \(1p\) and \(p\), respectively. Thus its expected reward is the point on the blue curve at the xcoordinate \(p\).
Thus the green arrow represents the incentive to learn. But, if it can’t learn easily, it may try and randomise its reward function, so the green arrow also represents the incentive to (unbiased) influence.
The shape of the orange curve itself represents the incentive to bias.
If the orange curve is flat, it is equal to the blue one, so there is no incentive to learn. If the orange curve is flat and horizontal, there is no incentive to bias, either.
