Intelligent Agent Foundations Forumsign up / log in
by Paul Christiano 1007 days ago | Patrick LaVictoire and Stuart Armstrong like this | link | parent

This doesn’t seem meaningfully different from the normal IRL setup. See here for a baseline algorithm in small state spaces, and here for another algorithm that can be generalized to use function approximators.

The main novelty here is giving the agent a wrong preliminary reward function, which it might be able to use as a “hint” about the real goal. But it seems more natural is to use the human’s utterances as evidence, and either learn a model of the relationship between utterances and goals, or work in a setting where we can model the user as modeling the robot and making goal-directed utterances (which provide evidence about their goals in the same way as a trajectory would).



by Stuart Armstrong 1006 days ago | link

But it seems more natural is to use the human’s utterances as evidence, and either learn a model of the relationship between utterances and goals, or work in a setting where we can model the user as modeling the robot and making goal-directed utterances (which provide evidence about their goals in the same way as a trajectory would).

At extreme computing power, that would be the right approach (if we’ve managed to ground evidence of goals in the right way). But for lesser agents, I want to see if we can learn anything by doing it this way (and thanks for those links, but I’d already encountered them ^_^).

reply

by Paul Christiano 1006 days ago | link

So what is the actual model here? We have IRL algorithms that in principle solve the proposed problem. The given hint doesn’t make the problem much easier information-theoretically, and if we are able to make collaborative IRL with language work then it doesn’t make the problem information-theoretically easier at all. Your concern doesn’t seem to be that IRL won’t work when the AI becomes powerful, but that it won’t work when AI is weak.

If you want to study a problem like this, it seems like you have to engage with the algorithms which you are claiming won’t work in practice, and then show that some modification improves their practical performance.

(Also, it seems weird to write a post about an elaboration of IRL, whose utility is founded on an empirical claim about the behavior of IRL algorithms, without mentioning it at all.

ETA: nevermind, didn’t see the previous post where this is discussed.)

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms