Intelligent Agent Foundations Forumsign up / log in
"Like this world, but..."
post by Stuart Armstrong 246 days ago | discuss

A putative new idea for AI control; index here.

Pick a very unsafe goal: \(G=\)“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?

I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.

Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.

For the purpose of this post, I’ll assume we have some sufficient measure of accuracy \(A\). This is a boolean-value function, that takes as input a human \(h\) (in a particular time and place), a string/description \(s\), and a world \(w\) or a pair of worlds \(w\), \(w'\). Then \(A(h,s,w)/A(h,s,w,w')\) is true iff the string \(s\), when presented to the human \(h\), is an understandably accurate description (of \(w\))/(of the difference between \(w\) and \(w'\)).

\(G\) is describing a world the human would not see as accurately described by \(G\).

Let \(w\) be our world, let \(\omega\) be any world, and let \(\omega_G\) be the world that \(G\) is meant to be describing (this is an informal definition, as we haven’t formalised what this means yet).

Humans have a poor understanding of causality, of what causes what in the real world \(w\) (and in \(\omega_G\)). A lot of strong political positions, for instance, seem predicated on denying the existence of certain trade-offs. And no-one has a complete understanding of all the physics, biology, and social sciences that best model our world. Thus the desiderata of \(G\) may be impossible to satisfy; there is no plausible world \(\omega_G\) that is well described by \(G\).

And on a basic and more fundamental level, we are simply ignorant of vast amounts of things about the world. No-one has a knowledge of all the basic statistical descriptors about our world, let alone the full distribution behind those descriptors.

Thus even if there was a plausible world \(\omega_G\) well-described by \(G\), if we had a full description of that world, we would think it very different from what we intended with \(G\) – just as if we had a full description of \(w\), we wouldn’t recognise our own world.

This suggests that \(G\) should in some way be seen as a description of the “difference” \(w-\omega_G\) between worlds.

Modelling worlds

Here we’re going to replace worlds \(\omega\) with models \(M(\omega)\) of those worlds. There models are made up of variables \(\{x_i\}\). Each of those variables has a description \(s_i\), and we use our measure of accuracy to ensure that these descriptions are understandable.

Specifically if \(M(w)\) and \(M(\omega)\) are almost the same except they have different values of \(x_i\) for \(i\) in a small set \(I\), then we say the descriptions are understandable if \(A(h,\{s_i,M(w)_i,M(\omega)_i| i \in I\},w,\omega)\) is true.

Thus the difference in the variables \(x_i\), along with the descriptions \(s_i\) of \(x_i\), is a good description of the difference between worlds.

Lastly, the variables \(x_i\) are required to be important, to humans, based on their descriptions. Thus it is more likely to include \(s_i=\)“human happiness” rather than \(s_i=\)“electron density of Saturn”.

Testing the model: devil’s advocacy

Now, it should be obvious that there exists worlds \(\omega\) with very positive \(M(\omega)\) – every human is modelled as being alive, healthy, happy, free, flourishing, equal, etc… – that are nevertheless horrible places to live.

It’s not only a question of siren worlds, deceptive worlds designed to hide their badness. It’s more that \(s_i\) is only an accurate description of \(x_i\) in worlds that differ little from \(w\), and thus that constraining worlds to have specific \(M(\omega)\) does not constrain them to being well described by \(\{s_i\}\) and \(M(\omega)\).

And even if they were well-described, it’s possible that \(\{x_i\}\) do not capture all the variables that humans find important – it may have missed some.

This is especially likely as humans often miss important background features of their own lives, that they don’t have to think about. And because we haven’t yet specified how to select all the variables in the model \(M(\cdot)\).

Enter the devil’s advocate AI, DAI.

If given a world \(\omega\) with model \(M(\omega)\), the job off the DAI is to highlight to humans all the ways the \(\omega\) can go wrong, in all the ways that are not captured by \(M(\omega)\) already.

Specifically, DAI needs to produce a description string \(s\) such that:

  1. \(s\) describes the difference between \(w\) and \(\omega\) well; ie \(A(h,s,w,\omega)\) is true.
  2. \(s\) is not captured by the model difference; ie \((s,\{s_i\},M(w),M(\omega))\) is a more accurate description than \((\{s_i\},M(w),M(\omega))\).
  3. The human \(h\) agrees \(s\) is an important fact (alternatively, we might want them to agree \(s\) is an important and negative fact).

There may a back and forth cycle with other AIs that defend \(\omega\) against the DAI, all of them using accurate descriptions, before the human agrees whether \(s\) is important or not.

If the DAI loses, say that \(\omega\) is well-modelled by \(M(\omega)\).

Cashing out the description \(G\)

We’re now ready to try and cash out the description of \(G\). First of all, we translate it into a requirement on the variables \(\{x_i\}\). We check whether this requirement translates well by comparing how humans interpret \(G\) versus how they interpret changes to \(\{x_i\}\). This allows a measure \(G(w,\omega)\) which counts how well the variables of \(M(\omega)\) are moved in the direction of \(G\) compared with \(M(w)\).

Then we can finally define \(\omega_G\):

  1. \(\omega_G\) is well-modelled by \(M(\omega_G)\).
  2. \(\omega_G\) maximises/satisfices/quantilises \(G(w,\omega_G)\).

Note that the first requirement can be used to fix the variables in \(M\): many variables make it easier to find well-described worlds (we may need to combine with a prior to cut down the number of variables to make sure it doesn’t get too ridiculous).





If you drop the
by Alex Appel on Distributed Cooperation | 0 likes

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes


Privacy & Terms