A putative new idea for AI control; index here.
The aim of this post is simply to collect all the different “indifference” proposals that I know of. Let me know if I’ve missed any!
The basic idea is that the agent is following policy \(\pi_0\) with reward \(R_0\), and there is a button that if pressed (event \(B\)) will change the agent to following policy \(\pi_1\) with reward \(R_1\).
Policy versus reward
My initial proposal considered only change of utility functions – similar to reward functions – and assumed the policy would always be to maximise that utility.
Conversely, the Safely Interruptible Agents paper just looked at policy changes, while sticking to the same reward function.
In general, policy and reward changes can be treated similarly – what matters is the changes between \(V^{\pi_0}(R_0)\) (the expected value of \(R_0\) under \(\pi_0\), as estimated by the agent) and \(V^{\pi_1}(R_1)\).
Indeed, you could just treat it as a policy change, with the added caveat that \(\pi_i\) also ‘chooses’ to claim their reward from \(R_i\).
Method: Bias penalty or indifference
If we have an initial estimate \(p\) for the probability of \(B\) (hence of the agent following \(\pi_1/R_1\)), then if the agent takes an action \(a\), this probability changes to \(p_a\).
One way of fighting this, is to penalise the agent for large changes \(p \to p_a\). In the terms of this post, the agent is taking biased actions.
And to combat large biases, the agent could get an extra bias penalty, as explained in the Towards Interactive Inverse Reinforcement Learning poster.
But the rest of this post will focus on biasindifference, rather than bias penalty.
Counterfactual: evidential, causal, translation
When comparing \(V^{\pi_0}(R_0)\) and \(V^{\pi_1}(R_1)\), it’s important to note that this is a counterfactual comparison: it’s comparing what would have happened under \(\pi_0\) versus \(\pi_1\), but only one of those policies is actually going to be happening, depending on whether the button press \(B\) happens.
Evidential counterfactual
The first idea is to have some sort of fixed probability for \(B\) versus \(\neg B\). And then the agent acts as if the probability cannot be changed by anything the agent does.
Call this the “evidential” counterfactual. The problem with it is that it causes the agent to want to manipulate conditional probabilities (similarly how evidential decision theory can desire to manipulate the news). Basically, if the agent can decrease the probability of \(B\) while making the reward for \(B\) much larger, it will do so (as it doesn’t track the reduction of probability, by construction).
The factoring out variables approach seems to use this evidential counterfactual, and thus suffers from this problem.
Causal counterfactual
My original idea was that there was some lowprobability event \(e\) such that if \(B\) happened AND \(e\) happened, then the utility/policy would not change.
The problem with this is that, as Eliezer and Jessica clarified, this contrasts \(V^{\pi_0}(R_0B,e)\) with \(V^{\pi_1}(R_1B,\neg e)\), whereas we really want to contrast \(V^{\pi_0}(R_0\neg B)\) with \(V^{\pi_1}(R_1B)\).
Now, if \(e\) is of low probability, \(V^{\pi_1}(R_1B) \approx V^{\pi_1}(R_1B,\neg e)\), except for bets that turn on the value of \(e\). However the difference between \(V^{\pi_0}(R_0\neg B)\) and \(V^{\pi_0}(R_0B,e)\) is more problematic. In the first case, humans have chosen not to press the button; in the second, humans have tried to press the button but failed to change the agent’s utility. Obviously human behaviour can be very different in both cases, translating to a very different value estimation.
Weight Translation
This method relies on some sort of initial ‘default’ probability of \(B\), call this \(p\). Then the agent considers an action \(a\), and this probability changes to \(p_a\).
Thus after \(a\), the agent expects to be maximising reward \(R_0\) with probability \(1p_a\), and maximising \(R_1\) with probability \(p_a\).
Let \(w_0\) and \(w_1\) be the ultimate expected weights of the rewards \(R_0\) and \(R_1\). By ultimate expected weights, we mean the expected weights after \(B\) or \(\neg B\), and what the weights themselves are will be defined in later sections. The default expected weights are \(1p\) and \(p\), while the expected weights given \(a\) are \(1p_a\) and \(p_a\).
Then the translation approach wants to map the weights back to the default. Let \(w'_0\) and \(w'_1\) be the adjusted weights, then:
 \(\mathbb{E}(w_0)=1p_a, \mathbb{E}(w_1)=p_a, \mathbb{E}(w'_0)=1p, \mathbb{E}(w'_1)=p\).
The rest of this post will focus mainly on the weight translation “counterfactual”.
Update defaults: no or observations only
The bias penalty, the evidential counterfactual, and the translation approach depend on some default assessment of the probability of \(B\). One could either set some fixed probability at the beginning and never update them, or allow them to be updated only by observations, and not by the agent’s own actions.
Translation type: simple, mean, or specific
Given a prior history \(h\), let \(p_h\) be the default probability of \(B\) and \(p_{ha}\) be the probability of \(B\) given that the agent takes action \(a\). Let the current weight of \(R_0\) be \(w'(h)\).
The translation approach aims to correct the expected weights of \(R_0\) and \(R_1\) from \(w(ha)\) back to \(w(h)\). By definition we know that, if \(o\) is the next observation and \(\mu(oha)\) is the probability of \(o\) given \(h\) and \(a\):
 \(\sum_{o}\mu(oha)w_i'(hao)=w_i'(h)\).
A simple translation is one where there is a vector \(v\) such that all \(w_i'(hao)\) are equal to \(w_i(hao)+v\).
A mean translation is one where having the mean equality is the key requirement; the \(w_i'(hao)\) may be constrained in various ways, but the mean equality is the main requirement.
A specific translation is one that has specific values for \(w_i'(hao)\), where the mean equality is a consequence of those specific values. The counterfactual approach can be seen
Finally, we have to sort out what we mean by these ‘weights’.
The simplest is that these weights are probabilities. So, just before the choice between \(B\) and \(\neg B\), the agent could have \(w_{Bi}'\) for \(B\) and \(w_{\neg Bi}'\). Then if \(B\) happens, the agent has reward \(R_i\) with probability \(w_{Bi}'\); and if \(\neg B\) happens, it has \(R_i\) with probability \(w_{\neg Bi}'\).
If the AI didn’t take any actions at all, then \(w_{B1}'=w_{\neg B0}'=1\) and \(w_{B0}'=w_{\neg B1}'=0\), same as the original values.
The problem with that approach is that we must ensure the weights are constrained between \(0\) and \(1\).
Alternatively, changes in weights can be seen instead as adding extra rewards to the final rewards, rather than changing their probabilities. Thus if \(w_{B1}' = 1+a\), and \(B\) happens, the agent’s reward function is \((1+a)R_0  aR_1\), and if \(w_{\neg B 0}=1+b\), the agent’s reward function is \((1+b)R_1bR_0\).
This has the advantage that the weights may be negative or higher than one, but disadvantage that it may result in unusual mixed reward functions.
Examples
Given these terms, the indifference approach I described as the best is Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple for small translations, mean for large ones, and Weights: probabilities.
One could imagine slightly tweaking that approach, by using extra rewards for weights, and dropping the complicated conditions needed to keep the weights bounded between \(0\) and \(1\), allowing simple translations always. This would result in: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple, and Weights: extra rewards.
Finally, the counterfactual approach can be seen as: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: specific, and Weights: probabilities.
