All the indifference designs
discussion post by Stuart Armstrong 297 days ago | Patrick LaVictoire likes this | 1 comment

A putative new idea for AI control; index here.

The aim of this post is simply to collect all the different “indifference” proposals that I know of. Let me know if I’ve missed any!

The basic idea is that the agent is following policy $$\pi_0$$ with reward $$R_0$$, and there is a button that if pressed (event $$B$$) will change the agent to following policy $$\pi_1$$ with reward $$R_1$$.

## Policy versus reward

My initial proposal considered only change of utility functions – similar to reward functions – and assumed the policy would always be to maximise that utility.

Conversely, the Safely Interruptible Agents paper just looked at policy changes, while sticking to the same reward function.

In general, policy and reward changes can be treated similarly – what matters is the changes between $$V^{\pi_0}(R_0)$$ (the expected value of $$R_0$$ under $$\pi_0$$, as estimated by the agent) and $$V^{\pi_1}(R_1)$$.

Indeed, you could just treat it as a policy change, with the added caveat that $$\pi_i$$ also ‘chooses’ to claim their reward from $$R_i$$.

## Method: Bias penalty or indifference

If we have an initial estimate $$p$$ for the probability of $$B$$ (hence of the agent following $$\pi_1/R_1$$), then if the agent takes an action $$a$$, this probability changes to $$p_a$$.

One way of fighting this, is to penalise the agent for large changes $$p \to p_a$$. In the terms of this post, the agent is taking biased actions.

And to combat large biases, the agent could get an extra bias penalty, as explained in the Towards Interactive Inverse Reinforcement Learning poster.

But the rest of this post will focus on bias-indifference, rather than bias penalty.

## Counterfactual: evidential, causal, translation

When comparing $$V^{\pi_0}(R_0)$$ and $$V^{\pi_1}(R_1)$$, it’s important to note that this is a counterfactual comparison: it’s comparing what would have happened under $$\pi_0$$ versus $$\pi_1$$, but only one of those policies is actually going to be happening, depending on whether the button press $$B$$ happens.

# Evidential counterfactual

The first idea is to have some sort of fixed probability for $$B$$ versus $$\neg B$$. And then the agent acts as if the probability cannot be changed by anything the agent does.

Call this the “evidential” counterfactual. The problem with it is that it causes the agent to want to manipulate conditional probabilities (similarly how evidential decision theory can desire to manipulate the news). Basically, if the agent can decrease the probability of $$B$$ while making the reward for $$B$$ much larger, it will do so (as it doesn’t track the reduction of probability, by construction).

The factoring out variables approach seems to use this evidential counterfactual, and thus suffers from this problem.

# Causal counterfactual

My original idea was that there was some low-probability event $$e$$ such that if $$B$$ happened AND $$e$$ happened, then the utility/policy would not change.

The problem with this is that, as Eliezer and Jessica clarified, this contrasts $$V^{\pi_0}(R_0|B,e)$$ with $$V^{\pi_1}(R_1|B,\neg e)$$, whereas we really want to contrast $$V^{\pi_0}(R_0|\neg B)$$ with $$V^{\pi_1}(R_1|B)$$.

Now, if $$e$$ is of low probability, $$V^{\pi_1}(R_1|B) \approx V^{\pi_1}(R_1|B,\neg e)$$, except for bets that turn on the value of $$e$$. However the difference between $$V^{\pi_0}(R_0|\neg B)$$ and $$V^{\pi_0}(R_0|B,e)$$ is more problematic. In the first case, humans have chosen not to press the button; in the second, humans have tried to press the button but failed to change the agent’s utility. Obviously human behaviour can be very different in both cases, translating to a very different value estimation.

# Weight Translation

This method relies on some sort of initial ‘default’ probability of $$B$$, call this $$p$$. Then the agent considers an action $$a$$, and this probability changes to $$p_a$$.

Thus after $$a$$, the agent expects to be maximising reward $$R_0$$ with probability $$1-p_a$$, and maximising $$R_1$$ with probability $$p_a$$.

Let $$w_0$$ and $$w_1$$ be the ultimate expected weights of the rewards $$R_0$$ and $$R_1$$. By ultimate expected weights, we mean the expected weights after $$B$$ or $$\neg B$$, and what the weights themselves are will be defined in later sections. The default expected weights are $$1-p$$ and $$p$$, while the expected weights given $$a$$ are $$1-p_a$$ and $$p_a$$.

Then the translation approach wants to map the weights back to the default. Let $$w'_0$$ and $$w'_1$$ be the adjusted weights, then:

• $$\mathbb{E}(w_0)=1-p_a, \mathbb{E}(w_1)=p_a, \mathbb{E}(w'_0)=1-p, \mathbb{E}(w'_1)=p$$.

The rest of this post will focus mainly on the weight translation “counterfactual”.

## Update defaults: no or observations only

The bias penalty, the evidential counterfactual, and the translation approach depend on some default assessment of the probability of $$B$$. One could either set some fixed probability at the beginning and never update them, or allow them to be updated only by observations, and not by the agent’s own actions.

## Translation type: simple, mean, or specific

Given a prior history $$h$$, let $$p_h$$ be the default probability of $$B$$ and $$p_{ha}$$ be the probability of $$B$$ given that the agent takes action $$a$$. Let the current weight of $$R_0$$ be $$w'(h)$$.

The translation approach aims to correct the expected weights of $$R_0$$ and $$R_1$$ from $$w(ha)$$ back to $$w(h)$$. By definition we know that, if $$o$$ is the next observation and $$\mu(o|ha)$$ is the probability of $$o$$ given $$h$$ and $$a$$:

• $$\sum_{o}\mu(o|ha)w_i'(hao)=w_i'(h)$$.

A simple translation is one where there is a vector $$v$$ such that all $$w_i'(hao)$$ are equal to $$w_i(hao)+v$$.

A mean translation is one where having the mean equality is the key requirement; the $$w_i'(hao)$$ may be constrained in various ways, but the mean equality is the main requirement.

A specific translation is one that has specific values for $$w_i'(hao)$$, where the mean equality is a consequence of those specific values. The counterfactual approach can be seen

## Weights: probabilities or extra rewards

Finally, we have to sort out what we mean by these ‘weights’.

The simplest is that these weights are probabilities. So, just before the choice between $$B$$ and $$\neg B$$, the agent could have $$w_{Bi}'$$ for $$B$$ and $$w_{\neg Bi}'$$. Then if $$B$$ happens, the agent has reward $$R_i$$ with probability $$w_{Bi}'$$; and if $$\neg B$$ happens, it has $$R_i$$ with probability $$w_{\neg Bi}'$$.

If the AI didn’t take any actions at all, then $$w_{B1}'=w_{\neg B0}'=1$$ and $$w_{B0}'=w_{\neg B1}'=0$$, same as the original values.

The problem with that approach is that we must ensure the weights are constrained between $$0$$ and $$1$$.

Alternatively, changes in weights can be seen instead as adding extra rewards to the final rewards, rather than changing their probabilities. Thus if $$w_{B1}' = 1+a$$, and $$B$$ happens, the agent’s reward function is $$(1+a)R_0 - aR_1$$, and if $$w_{\neg B 0}=1+b$$, the agent’s reward function is $$(1+b)R_1-bR_0$$.

This has the advantage that the weights may be negative or higher than one, but disadvantage that it may result in unusual mixed reward functions.

## Examples

Given these terms, the indifference approach I described as the best is Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple for small translations, mean for large ones, and Weights: probabilities.

One could imagine slightly tweaking that approach, by using extra rewards for weights, and dropping the complicated conditions needed to keep the weights bounded between $$0$$ and $$1$$, allowing simple translations always. This would result in: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: simple, and Weights: extra rewards.

Finally, the counterfactual approach can be seen as: Method: indifference, Counterfactual: translation, Update defaults: observations only, Translation type: specific, and Weights: probabilities.

 by Patrick LaVictoire 281 days ago | link Question that I haven’t seen addressed (and haven’t worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)? reply

### NEW DISCUSSION POSTS

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes