Intelligent Agent Foundations Forumsign up / log in
by Jessica Taylor 811 days ago | link | parent

  1. Can we agree that this is “manipulating the human to cause them to have reward-seeking behavior” and not “manipulating the human so their preferences are easy to satisfy”? The second brings to mind things like making the human want the speed of light to be above 100 m/s, and we don’t have an argument for why this does that.

  2. Why is reward-seeking behavior evidence for getting high rewards when getting heroin, instead of evidence for getting negative rewards when not getting heroin?



by Stuart Armstrong 810 days ago | link

  1. I don’t really see the relevant difference here. If the human has their hard-to-satisfy preferences about, eg art and meaning, replaced by a single desire for heroin, this seems like it’s making them easier to satisfy.

  2. That’s a good point

reply

by Jessica Taylor 810 days ago | link

Re 1: There are cases where it makes the human’s preferences harder to satisfy. For example, perhaps heroin addicts demand twice as much heroin as the AI can provide, making their preferences harder to satisfy. Yet they will still seek reward strongly and often achieve it, so you might predict that the AI gives them heroin.

I think my real beef with saying this “manipulates the human’s preferences to make them easier to satisfy” is that, when most people hear this phrase, they think of a specific technical problem that is quite different from this (in terms of what we would predict the AI to do, not necessarily the desirability of the end result). Specifically, the most obvious interpretation is naive wireheading (under which the AI wants the human to want the speed of light to be above 100m/s), and this is quite a different problem at a technical level.

reply

by Stuart Armstrong 810 days ago | link

Wireheading the human is the ultimate goal of the AI. I chose heroin as the first step along those lines, but that’s where the human would ultimately end at.

For instance, once the human’s on heroin, the AI could ask it “is your true reward function \(r\)? If you answer yes, you’ll get heroin.” Under the assumption that the human is rational and the heroin offered is short term, this allows the AI to conclude the human’s reward function is any given \(r\).

reply

by Jessica Taylor 810 days ago | link

I strongly predict that if you make your argument really precise (as you did in the main post), it will have a visible flaw in it. In particular, I expect the fact that r and r-1000 are indistinguishable to prevent the argument from going through (though it’s hard to say exactly how this applies without having access to a sufficiently mathematical argument).

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms