Intelligent Agent Foundations Forumsign up / log in
A note on misunderstanding the boundaries of models
discussion post by Stuart Armstrong 333 days ago | discuss

This is a reminder for me as much as for anyone. I’m often posting a bunch of ideas and impossibility results, but it’s important to remember that those results only apply within their models. And the models likely don’t match up exactly with either reality or what we want/need.

For instance, we have the various no-free-lunch theorems (any two optimization algorithms are equivalent when their performance is averaged across all possible problems), or Rice’s theorem (you can’t figure out non-trivial semantic properties of every program).

These fail when we note that we have strong evidence that we live in a very specific environment and have to solve only a small class of problems; and that most programs we meet are (or can be) designed to be clear and understandable, rather than being selected at random.

Conversely, we can have results on how to use Oracles safely. However, these results fall apart when you consider issues of acausal trade. The initial results still hold – the design seems safe if multiple iterations of the AI don’t care about each other’s reward. However, the acausal trade example demonstrates that that assumption can fail in ways we didn’t expect.

Both of these cases stem from misunderstanding the boundaries of the model. For the no-free-lunch theorem, we might be informally thinking “of course we want the AI to be good at things in general”, without realising that this doesn’t perfectly match up to “good performance across every environment”. Because “every environment” includes weird, pathological, and highly random environments. Similarly, “don’t have the AIs care about each other’s reward” seems like something that could be achieved with simple programming and boxing approaches, but what we achieve is “AIs that don’t care about each other’s rewards in the conventional human understanding of these terms”.

The boundary of the model did not lie where we thought it did.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms