A putative new idea for AI control; index here.
When dealing with the problem of bias, I stumbled upon what I believe is the best way of getting value indifference, one that solves almost all of the problems with the previous methods.
The problems with the old setups
My initial setup has a counterfactual which I’ll call a causal counterfactual, following Jessica’s terminology. Her whole post is a good way of illustrating what is happening. A button is due to be pressed or not pressed, by a human; this pressing will change the agent’s utility function.
There are lowprobability quantum events that will artificially depress the button or prevent it from being pressed. The agent only cares about those lowprobability events, which allows us to setup the formalism without worrying about changes in the probability of the human pressing the button. Then the events A=“the human presses the button” and A’=“the quantum event presses the button” were taken to be interchangeable, as were B=“the human doesn’t press the button” and B’=“the quantum event prevents the button from being pressed”.
The main problem is that these events are not really the same. So the agent will behave as if certain events are impossible (such as the human genuinely pressing the button), and is likely to face human resistance if the human tries to press the button and the button doesn’t press due to quantum events, or vice versa. This resistance will either lower the expected reward of the agent, or make them manipulate the human actions.
Benya has sketched out a problem with another approach to counterfactuals, which we can call “evidential counterfactuals”. Here we’re trying to fiddle with utilities to enforce that the agent treats the probability of A and B as being fixed (no A’ and B’ here).
The problem with that is that the agent can change the probabilities conditional on some rare event, but then behave as if the probabilities were still unchanged same.
For instance, the agent could ensure that A only happens if it wins the lottery. And then it would behave, in effect, as if it had a large chance of winning the lottery (equal to the initial chance of A happening, before the agent took any action). That problem is illustrated here.
A final problem was that neither method allowed the agent to learn usefully: the methods did not distinguish between the agent merely learning that humans intended/didn’t intend to press the button (which is presumably fine), versus the agent influencing the the humans to do so (which isn’t).
Solving the problem: prior modification
Here we’ll try and solve the problem in the reward learning POMDP framework, but the solution is easy to generalise. Note that the changing of probabilities of A versus B is similar to biasing, so this approach will attempt to make the agent biasindifferent in a sensible way.
\(\newcommand{\vd}{P}\newcommand{\vdh}{\widehat{\vd}}\newcommand{\env}{\mu}\newcommand{\mR}{\mathcal{R}}\newcommand{\expect}{\mathbb{E}}\)To do this, take a standard reward learning agent, with posterior \(\vd\) and prior \(\vdh\) on the POMDP \(\env\).
The agent will be modified as follows: it will be given a different prior \(\vdh'\) (which will ultimately lead to a different posterior \(\vd'\)).
The prior \(\vdh'\) and \(\vdh\) will start off equal on the empty history:
 \(\vdh'(\cdot\mid\emptyset)=\vdh(\cdot\mid\emptyset)\).
Then the \(\vdh'\) will be defined inductively. Recall that action \(a\) is unbiased given history \(h_t\) if \(\vdh(\cdot\mid h_{t})=\expect_{\env}^a[\vdh(\cdot\mid h_{t+1})\mid h_t]\). Define the bias of \(a\) as:
 \(B(a,h_t)=\vdh'(\cdot\mid h_{t})\expect_{\env}^a[\vdh(\cdot\mid h_{t+1})\mid h_t]\).
Thus \(B(a,h_t)\) is the bias of action \(a\), as measure between \(\vdh'\) and the expectation of \(\vdh\). This bias is used as a corrective term to \(\vdh'\), to make the agent suitably indifferent to biasing actions.
Indifference in this setting is defined by three criteria:
 The agent cannot benefit from a biasing action, if the agent gets no further information about the correct reward.
 The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is.
 Given the above, \(\vdh'\) maintains the distribution and structure of \(\vdh\) as much as possible.
The first criteria is implied by \(\vdh'(\cdot\mid h_{t})=\expect_{\env}^a[\vdh'(\cdot\mid h_{t+1})\mid h_t]\) for all actions \(a\), ie \(\vdh'\) being unbiased (as if there is no further information about the correct reward, then there is a single well defined \(\vdh'(\cdot\mid h_{t+1})\), conditional on \(h_t\) and \(a\), and being unbiased means that this is equal to \(\vdh'(\cdot\mid h_t)\)).
Then, given that \(\vdh'\) is unbiased, the second criteria simply means that \(\vdh'\) remains a distribution over \(\mR\) (as the consequences of perfect information is just a weighted average of the `pure \(R_i\)’ points). Which seems obvious. What would \(\vdh'\) be but a distribution over \(\mR\)? See the next sections for what this criteria really means.
Indifference for small biases
Let \(S_{h_t,a}=\{\vdh(\cdot\mid h_{t+1}) \mid h_t, a\}\) be the set of possible future values of \(\vdh\) (given the history \(h_t\) and the action \(a\)). Note that \(S_{h_t,a}\) is a subset of the simplex \(\Delta\mR\), the set of probability distributions on \(\mR\).
The bias \(B(h_t,a)\) is `small’ if for all \(q\in S_{h_t,a}\), then \(q+B(h_t,a)\) is also an element of the simplex \(\Delta\mR\).
In that case, \(\vdh'(\cdot\mid h_{t+1})\) is simply defined as \(\vdh(\cdot\mid h_{t+1})+B(h_t,a)\). By assumption, this is an element of \(\Delta\mR\). The expectation of this expression is:
 \(\expect_{\env}^a[\vdh'(\cdot\mid h_{t+1})\mid h_t]=\expect_{\env}^a[\vdh(\cdot\mid h_{t+1})\mid h_t]+B(h_t,a) = \vdh'(\cdot\mid h_t)\).
Thus \(\vdh'\) is unbiased.
And since this is simply a translation, it preserves the structure of \(\vdh\), this satisfies all the criteria above.
Indifference for large biases
If the bias is large, in that there exists a possible value of \(\vdh(\cdot\mid h_{t+1})\) with \(\vdh(\cdot\mid h_{t+1})+B(h_t,a)\) not a point on \(\Delta\mR\), then we need to proceed differently.
As before, let \(S_{h_t,a}=\{\vdh(\cdot\mid h_{t+1}) \mid h_t, a\}\) be the set of possible future values of \(\vdh\) (given the history \(h_t\) and the action \(a\)), and for \(q\in S_{h_t,a}\), let \(p(q)\) be the probability of \(q\), given \(h_t\) and \(a\).
The we want to replace \(q\) with \(\tau(q)\), where \(\tau(q)\) is `as close to’ \(q+B(h_t,a)\) as possible. Since \(\Delta\mR\) embeds in \(\mathbb{R}^\mR\), the Euclidean metric \(\cdot\) on the later restricts to the former.
Thus consider the constrained optimisation problem for \(b\):
 Minimise \(\sum_{q \in S_{h_t,a}} p(q)\tau(q)(q+B(h_t,a))^2\) subject to:
 \(\sum_{q\in S_{h_t,a}} p(q)\tau(q) = \vdh'(\cdot\mid h_t)\),
 \(\forall q\in S_{h_t,a}: \tau(q)\in\Delta\mR\).
Then define \(\vdh'(\cdot\mid h_{t+1})\) as \(\tau(\vdh(\cdot\mid h_t))\).
If we see \(\vdh(\cdot\mid h_{t+1})\) and \(\vdh'(\cdot\mid h_{t+1})\) as random variables dependent of \(h_t\) and \(a\), the optimisation problem is the same as saying that \(\vdh'\) is biasfree while \(\vdh(\cdot\mid h_{t+1})\vdh'(\cdot\mid h_{t+1})\) has minimised variance.
The constraints are not contradictory: for instance \(\tau(q)=\vdh'(\cdot\mid h_t)\) will satisfy them. In fact they are all affine constraints. Then the must exist a unique set of elements \(\tau(q)\) that minimise the strictly convex quadratic function.
And obviously, if \(q+B(h_t,a)\) is always in \(\Delta\mR\), then \(\tau(q)=q+B(h_t,a)\) is the optimal set of solutions solution, so this optimisation reproduces the `small biases’ case.
The final values
This alternate prior leads to an alternate posterior \(\vd'\), simply defined by having it equal to \(\vdh'\) on complete histories: \(\vd'(\cdot\mid h_m)=\vdh'(\cdot\mid h_m)\).
Another alternative
It should be noted that if we’re willing to drop the condition `The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is’, then there’s a simpler solution: simply always define \(\vdh'(\cdot\mid h_{t+1})\) as \(\vdh(\cdot\mid h_{t+1})+B(h_t,a)\), applying the solution for small biases to large biases.
This means that \(\vdh'\) (and ultimately \(\vd'\)) need not be elements of \(\Delta\mR\). However, \(\vd'\) can still define a reward the agent can optimise, in the following sense:
 Given a complete history \(h_m\), the agent will maximise the reward \(R'=\sum_i R_i\vd'(R_i\mid h_m)\).
Since \(\vd'\) need not be in \(\Delta\mR\), some of these coefficients can be negative, but that still results in a consistent \(R'\) to maximise.
Properties of the approach
It’s clear the agent is indifferent to bias, but notice that this doesn’t prevent the agent from learning: once it gets an observation, \(\vdh'\) can change significantly. It’s just changes to its expectation that are controlled.
Notice also that the agent doesn’t believe, or act as if it believed, anything unlikely: its bets will be accurate.
And it doesn’t have the problem of lotteries. Assume that the agent has \(\vdh(R_0)=\vdh(R_1)=0.5\), and there is a lottery which the agent has one chance in a million of winning.
Then if it takes action \(a\) which ensures that \(\vd\) chooses \(R_0\) if and only if it wins the lottery, then with probability \(10^{6}\) it ends up with reward function \(R_0\) and a won lottery, and with probability \(110^{6}\) it ends up with reward function \(R_0\frac{12\cdot 10^{6}}{2(110^{6})} +R_1 \frac{1}{2(110^{6})}\) and a lost lottery. The expected reward function is still \(R_0(0.5)+R_1(0.5)\); it has simply split this expectation differently across worlds where it’s won or not won the lottery.
One thing that this approach doesn’t solve is the issue of the agent not following the exact reward function the humans want it to follow, due to accumulated bias. But first note that this will typically encourage the agent not to bias their reward learning, as it will tend to get higher reward when the humans agree with the agent’s reward function. Note secondly that even if the agent manipulates the human values, at the end, to agree with its own, this manipulation, in expectation, simply undoes previous manipulations the agent has done (which caused the biasing in the first place).
Those who find this still unsatisfactory can wait for the next post, where the agent is not simply indifferent to biasing actions, but is penalised for them.
Indifference and bias
Why has indifference been connected with bias, rather than the more general influence? Simply because the evidential counterfactual has problems with bias, meaning that that needs to be corrected first (the causal counterfactual is unbiased and uninfluenceable).
Indeed, we can generalise this solution to the influence problem, where it becomes the counterfactual approach (which I used to call stratification, before I realised what it was). See subsequent posts for this.
