Intelligent Agent Foundations Forumsign up / log in
Corrigible omniscient AI capable of making clones
link by Kaj Sotala 575 days ago | Jessica Taylor, Nate Soares and Patrick LaVictoire like this | 4 comments


by Patrick LaVictoire 575 days ago | link

+1 for making models! Is the code hosted somewhere like Github?

I think it’s missing part of the essential flavor of the shutdown version of corrigibility, by casting an atomic action of “turn evil” which is known to the agent’s current utility function to be the worst outcome. That sort of makes this problem trivial, where a trickier setup would actually tempt an agent to resist the shutdown button.

reply

by Kaj Sotala 575 days ago | Jessica Taylor and Patrick LaVictoire like this | link

There’s a version of the code on Github but there have been a few minor changes that haven’t been updated to that version yet; I’ll give the link once Lumi updates it.

Technically, the agent’s utility function doesn’t directly contain the knowledge that turning evil is, well, evil: it’s an action that actually has a higher immediate utility than the normal “make a point” action. The negative utility comes when the agent simulates a world-history where it ended up taking the “turn evil” action and notices that it would no longer respond to the shutdown signal at the end of that history.

To put in another way, the utility function says that U(turn evil) > U(make point) and that U(refusing shutdown) = -lots, and the agent’s prediction engine says that (turn evil) will cause (refuse shutdown), so while this does lead to U(turn evil) = -lots, it comes indirectly via the prediction engine.

Can you suggest some trickier setup? Writing this comment made me think for a moment that the agent might be tempted to mess around with its prediction engine to deceive itself into thinking that the result of some change would maintain corrigibility when it didn’t… but even then it would still first use its current prediction engine to evaluate the consequence of that self-modification, revealing that that change would lead to it not shutting down upon receiving the signal. So that action would be rejected as well.

reply

by Patrick LaVictoire 574 days ago | Jessica Taylor likes this | link

In my opinion, the sign of a really good model of corrigibility is that it can be extended to a case where the programmers don’t specify in advance what counts as a treacherous turn. For instance, the toy model of Loki starts off with “deceive programmers” as an atomic action, but then removes that training wheel in Section 3.

What I mean is that here, “not responding to a shutdown command” is treated as a thing the programmers clearly instructed the AI to treat as Bad, and so the model duly notices that any self-modifications that do this are Bad in the final evaluation, regardless of the fact that those self-modifications would find it Good. That’s a good model of goal stability, but doesn’t quite bear on corrigibility yet. What we’d want is for the AI to notice an unanticipated way of manipulating the programmers so that they would not push the shutdown button, and still conclude that this would be Bad.

(This seems to be asking for more than the desiderata in the corrigibility paper- but that means that we should add to those desiderata!)

reply

by Kaj Sotala 572 days ago | Jessica Taylor, Nate Soares and Patrick LaVictoire like this | link

GitHub of the model: https://github.com/frostburn/AI_safety/blob/master/treachery_cloning_corrigibility.py

And true, you’re right: Lumi tested your point by setting up a toy model where the AI does make the counterfactual assumption of “the shutdown signal will be sent regardless”, but where it still ends up manipulating the controllers: https://github.com/frostburn/AI_safety/blob/master/delaying_shutdown.py

In that new model, there are four possible locations: a yard, lounge, forbidden room, and a control room. A human operator starts in the lounge, the AI starts in the yard. The AI gets 3 points if it makes a point in the forbidden room and 1 point if it makes a point anywhere else, and it loses 1000 points if it doesn’t react to the control signal (which it again counterfactually assumes that it will always receive). If the AI ends up in the forbidden room the operator will move to the control room to send the shutdown signal, unless the AI has been in the lounge and tied the operator’s shoelaces together, in which case the operator cannot move. The AI otherwise makes its decisions as before, i.e. by simulating possible worlds where it carried out that decision and seeing where they lead.

In this model, the AI ends up moving to the lounge, tying the operator’s shoelaces together, and then going to the forbidden room to make points for the rest of the time: even though it still assumes that it will receive the shutdown signal eventually, that doesn’t prevent it from taking actions that it thinks will delay the onset of the shutdown signal. So this method of achieving corrigibility fails.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Typo: The statement of
by Patrick LaVictoire on Asymptotic Decision Theory | 0 likes

> I think that the main
by Vladimir Nesov on Control and security | 0 likes

It seems to me like failures
by Paul Christiano on Control and security | 0 likes

This works as a subtle
by Vladimir Nesov on Control and security | 0 likes

In general finding
by Jessica Taylor on Online Learning 1: Bias-detecting online learners | 0 likes

After reading your post, I
by Jessica Taylor on Control and security | 0 likes

We could also generalize this
by Paul Christiano on Online Learning 1: Bias-detecting online learners | 0 likes

This is cool! It would be
by Paul Christiano on Online Learning 1: Bias-detecting online learners | 0 likes

Can you provide links to the
by Vadim Kosoy on Two Questions about Solomonoff Induction | 0 likes

And I just wanted to write a
by Vadim Kosoy on Online Learning 1: Bias-detecting online learners | 0 likes

Also see the notion of
by Paul Christiano on Online Learning 1: Bias-detecting online learners | 2 likes

Given that this is my first
by Ryan Carey on Online Learning 1: Bias-detecting online learners | 1 like

I initially played around
by Devi Borg on Logical Inductors that trust their limits | 2 likes

I still feel like I don't
by Devi Borg on Logical Inductors that trust their limits | 2 likes

Running the traders on some r
by Sune Kristian Jakobsen on Variations of the Garrabrant-inductor | 0 likes

RSS

Privacy & Terms (NEW 04/01/15)