Intelligent Agent Foundations Forumsign up / log in
by Jessica Taylor 347 days ago | link | parent

As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing

If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it’ll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about “aligned versions of algorithms” crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we’ll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren’t convinced that it’s possible to measure progress).

“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”

It seems like “value loading is very hard/costly” has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai’s objections to it proves fatal. But it seems like arguments of the form “human values are complex and hard to formalize” or “humans don’t know what we value” are insufficient to establish this; Wei Dai’s objections in the thread are mostly not about value learning. (sorry if you aren’t arguing “value loading is hard because human values are complex and hard to formalize” and I’m misinterpreting you)



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms