Indifference and compensatory rewards discussion post by Stuart Armstrong 36 days ago | discuss It’s occurred to me that there is a framework where we can see all “indifference” results as corrective rewards, both for the utility function change indifference and for the policy change indifference. Imagine that the agent has reward $$R_0$$ and is following policy $$\pi_0$$, and we want to change it to having reward $$R_1$$ and following policy $$\pi_1$$. Then the corrective reward we need to pay it, so that it doesn’t attempt to resist or cause that change, is simply the difference between the two expected values: $$V(R_0|\pi_0)-V(R_1|\pi_1)$$, where $$V$$ is the agent’s own valuation of the expected reward, conditional on the policy. This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, $$R_0=R_1$$. And since off-policy agents have value estimates that are indifferent to the policy followed, $$V(R_0|\pi_0)=V(R_1|\pi_1)$$, and the compensatory rewards are zero.

### NEW DISCUSSION POSTS

I don't know which open
 by Jessica Taylor on Some problems with making induction benign, and ap... | 0 likes

KWIK learning is definitely
 by Vadim Kosoy on Some problems with making induction benign, and ap... | 0 likes

I should have said "reliably
 by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

I think that one can argue
 by Vadim Kosoy on Generalizing Foundations of Decision Theory | 0 likes

"Having a well-calibrated
 by Jessica Taylor on HCH as a measure of manipulation | 0 likes

Re #2, I think this is an
 by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

Re #1, an obvious set of
 by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

 by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

I agree it's not a complete
 by David Krueger on An idea for creating safe AI | 0 likes

 by David Krueger on An idea for creating safe AI | 0 likes

 by Alex Mennen on Generalizing Foundations of Decision Theory | 0 likes

I can think of two problems:
 by Ryan Carey on HCH as a measure of manipulation | 0 likes

Question that I haven't seen
 by Patrick LaVictoire on All the indifference designs | 0 likes

Agree that IRL doesn't solve
 by Jessica Taylor on Some problems with making induction benign, and ap... | 0 likes

Designing an agent which is
 by Vadim Kosoy on An idea for creating safe AI | 0 likes