by Jessica Taylor 366 days ago | link | parent If we apply this to the shutdown problem, is it acceptable to say: $\hat{P}(\cdot | h_t) = 100\% ~ U_N \text{ if the button has not been pressed in h_t }$ $\hat{P}(\cdot | h_t) = 100\% ~ U_S \text{ otherwise}$ If not, what would you set $$\hat{P}$$ to? (I’m treating $$U_N$$ and $$U_S$$ as reward functions here which seems fine)

 by Stuart Armstrong 366 days ago | link For policies/actions that don’t affect the probability of humans pressing the button, $$\widehat{P}=P$$. For actions that do affect the probability a little bit, the effect of $$\widehat{P}$$ will be to undo this, by, for instance, slightly increasing the probability of $$U_S$$ given the button was pressed. I’m not completely sure what multiple actions with large changes of probability would lead to (in expectation, nothing, but in actual fact…) reply
 by Jessica Taylor 365 days ago | link Hmm… I’m finding that I’m unable to write down a simple shutdown problem in this framework (e.g. an environment where it should switch between maximizing paperclips and shutting down) to analyze what this algorithm does. To know what the algorithm does, I need to know what $$P$$ and $$\hat{P}$$ are (since these are parameters of the algorithm). From those I can derive $$P'$$ and $$\hat{P}'$$ to determine the agent’s action. But at the moment I have no way of proceeding, since I don’t know what $$P$$ and $$\hat{P}$$ are. Can you get me unstuck? reply
 by Stuart Armstrong 360 days ago | link Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If $$o_s$$ is the observation of the shutdown command and $$o_p$$ the observation of the paperclip maximising command, and $$u_s$$ and $$u_p$$ the relevant utilities, then $$P$$ can be defined as $$P(u_s|h_{m-1}o_s)=1$$ and $$P(u_p|h_{m-1}o_p)=1$$, for all histories $$h_{m-1}$$. Then define $$\widehat{P}$$ as the probability of $$o_s$$ versus $$o_p$$, conditional on the fact that the agent follows a particular deterministic policy $$\pi^0$$. If the agent does indeed follow $$\pi^0$$, then $$\widehat{P}=\widehat{P}'$$. If it varies from this policy, then $$\widehat{P}'$$ is altered in proportion to the expected change in $$\widehat{P}$$ caused by choosing a different action. reply

### RECENT COMMENTS

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Privacy & Terms