by Stuart Armstrong 360 days ago | link | parent Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If $$o_s$$ is the observation of the shutdown command and $$o_p$$ the observation of the paperclip maximising command, and $$u_s$$ and $$u_p$$ the relevant utilities, then $$P$$ can be defined as $$P(u_s|h_{m-1}o_s)=1$$ and $$P(u_p|h_{m-1}o_p)=1$$, for all histories $$h_{m-1}$$. Then define $$\widehat{P}$$ as the probability of $$o_s$$ versus $$o_p$$, conditional on the fact that the agent follows a particular deterministic policy $$\pi^0$$. If the agent does indeed follow $$\pi^0$$, then $$\widehat{P}=\widehat{P}'$$. If it varies from this policy, then $$\widehat{P}'$$ is altered in proportion to the expected change in $$\widehat{P}$$ caused by choosing a different action.

### NEW DISCUSSION POSTS

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes