Intelligent Agent Foundations Forumsign up / log in
by Patrick LaVictoire 595 days ago | Jessica Taylor likes this | link | parent

It’s illustrating the failure of a further desideratum for the shutdown problem: we would like the AI to be able to update on and react to things that happen in the world which correlate with a certain channel, and yet still not attempt to influence that channel.

For motivation, assume a variant on the paperclip game:

  • the humans can be observed reaching for the button several turns before it is pressed
  • the humans’ decision to press the button is a stochastic function of environmental variables (like seeing that the AI has unexpectedly been hit by lightning, or has started producing Too Many Paperclips, etc)

We would like a solution which in some sense updates on the precursors to shutdown and minimizes the damage while still not attempting to influence the button press. (If doing such a thing robustly is impossible, we would like to discover this; Jessica mentioned that there is a version which does this but is not reflectively consistent.)

Intuitively, I could imagine a well-constructed AI reasoning “oh, they’re showing signs that they’re going to shut me down, guess my goal is wrong, I’ll initiate Safe Shutdown Protocol now rather than risk doing further damage”, but current formalizations don’t do this.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms