Intelligent Agent Foundations Forumsign up / log in
by Jessica Taylor 658 days ago | link | parent

As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing

If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it’ll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about “aligned versions of algorithms” crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we’ll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren’t convinced that it’s possible to measure progress).

“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”

It seems like “value loading is very hard/costly” has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai’s objections to it proves fatal. But it seems like arguments of the form “human values are complex and hard to formalize” or “humans don’t know what we value” are insufficient to establish this; Wei Dai’s objections in the thread are mostly not about value learning. (sorry if you aren’t arguing “value loading is hard because human values are complex and hard to formalize” and I’m misinterpreting you)



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms