A putative new idea for AI control; index here.
Pick a very unsafe goal: \(G=\)“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?
I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.
Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.
For the purpose of this post, I’ll assume we have some sufficient measure of accuracy \(A\). This is a booleanvalue function, that takes as input a human \(h\) (in a particular time and place), a string/description \(s\), and a world \(w\) or a pair of worlds \(w\), \(w'\). Then \(A(h,s,w)/A(h,s,w,w')\) is true iff the string \(s\), when presented to the human \(h\), is an understandably accurate description (of \(w\))/(of the difference between \(w\) and \(w'\)).
\(G\) is describing a world the human would not see as accurately described by \(G\).
Let \(w\) be our world, let \(\omega\) be any world, and let \(\omega_G\) be the world that \(G\) is meant to be describing (this is an informal definition, as we haven’t formalised what this means yet).
Humans have a poor understanding of causality, of what causes what in the real world \(w\) (and in \(\omega_G\)). A lot of strong political positions, for instance, seem predicated on denying the existence of certain tradeoffs. And noone has a complete understanding of all the physics, biology, and social sciences that best model our world. Thus the desiderata of \(G\) may be impossible to satisfy; there is no plausible world \(\omega_G\) that is well described by \(G\).
And on a basic and more fundamental level, we are simply ignorant of vast amounts of things about the world. Noone has a knowledge of all the basic statistical descriptors about our world, let alone the full distribution behind those descriptors.
Thus even if there was a plausible world \(\omega_G\) welldescribed by \(G\), if we had a full description of that world, we would think it very different from what we intended with \(G\) – just as if we had a full description of \(w\), we wouldn’t recognise our own world.
This suggests that \(G\) should in some way be seen as a description of the “difference” \(w\omega_G\) between worlds.
Modelling worlds
Here we’re going to replace worlds \(\omega\) with models \(M(\omega)\) of those worlds. There models are made up of variables \(\{x_i\}\). Each of those variables has a description \(s_i\), and we use our measure of accuracy to ensure that these descriptions are understandable.
Specifically if \(M(w)\) and \(M(\omega)\) are almost the same except they have different values of \(x_i\) for \(i\) in a small set \(I\), then we say the descriptions are understandable if \(A(h,\{s_i,M(w)_i,M(\omega)_i i \in I\},w,\omega)\) is true.
Thus the difference in the variables \(x_i\), along with the descriptions \(s_i\) of \(x_i\), is a good description of the difference between worlds.
Lastly, the variables \(x_i\) are required to be important, to humans, based on their descriptions. Thus it is more likely to include \(s_i=\)“human happiness” rather than \(s_i=\)“electron density of Saturn”.
Testing the model: devil’s advocacy
Now, it should be obvious that there exists worlds \(\omega\) with very positive \(M(\omega)\) – every human is modelled as being alive, healthy, happy, free, flourishing, equal, etc… – that are nevertheless horrible places to live.
It’s not only a question of siren worlds, deceptive worlds designed to hide their badness. It’s more that \(s_i\) is only an accurate description of \(x_i\) in worlds that differ little from \(w\), and thus that constraining worlds to have specific \(M(\omega)\) does not constrain them to being well described by \(\{s_i\}\) and \(M(\omega)\).
And even if they were welldescribed, it’s possible that \(\{x_i\}\) do not capture all the variables that humans find important – it may have missed some.
This is especially likely as humans often miss important background features of their own lives, that they don’t have to think about. And because we haven’t yet specified how to select all the variables in the model \(M(\cdot)\).
Enter the devil’s advocate AI, DAI.
If given a world \(\omega\) with model \(M(\omega)\), the job off the DAI is to highlight to humans all the ways the \(\omega\) can go wrong, in all the ways that are not captured by \(M(\omega)\) already.
Specifically, DAI needs to produce a description string \(s\) such that:
 \(s\) describes the difference between \(w\) and \(\omega\) well; ie \(A(h,s,w,\omega)\) is true.
 \(s\) is not captured by the model difference; ie \((s,\{s_i\},M(w),M(\omega))\) is a more accurate description than \((\{s_i\},M(w),M(\omega))\).
 The human \(h\) agrees \(s\) is an important fact (alternatively, we might want them to agree \(s\) is an important and negative fact).
There may a back and forth cycle with other AIs that defend \(\omega\) against the DAI, all of them using accurate descriptions, before the human agrees whether \(s\) is important or not.
If the DAI loses, say that \(\omega\) is wellmodelled by \(M(\omega)\).
Cashing out the description \(G\)
We’re now ready to try and cash out the description of \(G\). First of all, we translate it into a requirement on the variables \(\{x_i\}\). We check whether this requirement translates well by comparing how humans interpret \(G\) versus how they interpret changes to \(\{x_i\}\). This allows a measure \(G(w,\omega)\) which counts how well the variables of \(M(\omega)\) are moved in the direction of \(G\) compared with \(M(w)\).
Then we can finally define \(\omega_G\):
 \(\omega_G\) is wellmodelled by \(M(\omega_G)\).
 \(\omega_G\) maximises/satisfices/quantilises \(G(w,\omega_G)\).
Note that the first requirement can be used to fix the variables in \(M\): many variables make it easier to find welldescribed worlds (we may need to combine with a prior to cut down the number of variables to make sure it doesn’t get too ridiculous).
