This is a reminder for me as much as for anyone. I’m often posting a bunch of ideas and impossibility results, but it’s important to remember that those results only apply within their models. And the models likely don’t match up exactly with either reality or what we want/need.
For instance, we have the various no-free-lunch theorems (any two optimization algorithms are equivalent when their performance is averaged across all possible problems), or Rice’s theorem (you can’t figure out non-trivial semantic properties of every program).
These fail when we note that we have strong evidence that we live in a very specific environment and have to solve only a small class of problems; and that most programs we meet are (or can be) designed to be clear and understandable, rather than being selected at random.
Conversely, we can have results on how to use Oracles safely. However, these results fall apart when you consider issues of acausal trade. The initial results still hold – the design seems safe if multiple iterations of the AI don’t care about each other’s reward. However, the acausal trade example demonstrates that that assumption can fail in ways we didn’t expect.
Both of these cases stem from misunderstanding the boundaries of the model. For the no-free-lunch theorem, we might be informally thinking “of course we want the AI to be good at things in general”, without realising that this doesn’t perfectly match up to “good performance across every environment”. Because “every environment” includes weird, pathological, and highly random environments. Similarly, “don’t have the AIs care about each other’s reward” seems like something that could be achieved with simple programming and boxing approaches, but what we achieve is “AIs that don’t care about each other’s rewards in the conventional human understanding of these terms”.
The boundary of the model did not lie where we thought it did.