Intelligent Agent Foundations Forumsign up / log in
Divergent preferences and meta-preferences
post by Stuart Armstrong 142 days ago | discuss

A putative new idea for AI control; index here.

In simple graphical form, here is the problem of divergent human preferences:


Here the AI either chooses \(A\) or \(\neg A\), and as a consequence, the human then chooses \(B\) or \(\neg B\).

There are a variety of situations in which this is or isn’t a problem (when \(A\) or \(B\) or their negations aren’t defined, take them to be the negative of what is define):

  • Not problems:
    • \(A/\neg A =\) “gives right shoe/left shoe”, \(B/\neg B =\) “adds left shoe/right shoe”.
    • \(A =\) “offers drink”, \(\neg B =\) “goes looking for extra drink”.
    • \(A =\) “gives money”, \(B =\) “makes large purchase”.
  • Potentially problems:
    • \(A/\neg A =\) “causes human to fall in love with X/Y”, \(B/\neg B =\) “moves to X’s/Y’s country”.
    • \(A/\neg A =\) “recommends studying X/Y”, \(B/\neg B =\) “choose profession P/Q”.
    • \(A =\) “lets human conceive child”, \(\neg B =\) “keeps up previous hobbies and friendships”.
  • Problems:
    • \(A =\) “coercive brain surgery”, \(B =\) anything.
    • \(A =\) “extreme manipulation”, \(B =\) almost anything.
    • \(A =\)heroin injection”, \(B =\) “wants more heroin”.

So, what are the differences? For the “not problems”, it makes sense to model the human as having a single reward \(R\), variously “likes having a matching pair of shoes”, “needs a certain amount of fluids”, and “values certain purchases”. Then all that the the AI is doing is helping (or not) the human towards that goal.

As you move more towards the “problems”, notice that they seem to have two distinct human reward functions, \(R_A\) and \(R_{\neg A}\), and that the AI’s actions seem to choose which one the human will end up with. In the spirit of humans not being agents, this seems to be AI determining what values the human will come to possess.

Grue, Bleen, and agency

Of course, you could always say that the human actually has reward \(R = I_A R_A + (1-I_A)R_{\neg A}\), where \(I_A\) is the indicator function as to whether the AI does action \(A\) or not.

Similarly to the grue and bleen problem, there is no logical way of distinguishing that “pieced-together” \(R\) from a more “natural” \(R\) (such as valuing pleasure, for instance). Thus there is no logical way of distinguishing the human being an agent from the human not being an agent, just from its preferences and behaviour.

However, from a learning and computational complexity point of view, it does make sense to distinguish “natural” \(R\)’s (where \(R_A\) and \(R_{\neg A}\) are essentially the same, despite the human’s actions being different) from composite \(R\)’s.

This allows us to define:

  • Preference divergence point: A preference divergence point is one where \(R_A\) and \(R_{\neg A}\) are sufficiently distinct, according to some criteria of distinction.

Note that sometimes, \(R_A = R_A' + R'\) and \(R_{\neg A} = R_{\neg A}' + R'\): the two \(R_A\) and \(R_{\neg A}\) overlap on a common piece \(R'\), but diverge on \(R_A'\) and \(R_{\neg A}'\). It makes sense to define this as a preference divergence point as well, if \(R_A'\) and \(R_{\neg A}'\) are “important” in the agent’s subsequent decisions. Importance being a somewhat hazy metric, which would, for instance, assess how much \(R'\) reward the human would sacrifice to increase \(R_A'\) and \(R_{\neg A}'\).

Meta-preferences

From the perspective of revealed preferences about the human, \(R(\mu)=I_A R_A + \mu(1-I_A) R_{\neg A}\) will predict the same behaviour for all scaling factors \(\mu > 0\).

Thus at a preference divergence point, the AI’s behaviour, if it was a \(R(\mu)\) maximiser, would depend on the non-observed weighting between the two divergent preferences.

This is unsafe, especially if one of the divergent preferences is much easier to achieve a high value with than the other.

Thus preference divergence points are moments when the AI should turn explicitly to human meta-preferences to distinguish between them.

This can be made recursive - if we see the human meta-preferences as explicitly weighting \(R_A\) versus \(R_{\neg A}\) and hence giving \(R\), then if there is a prior AI decision point \(Z\), and, depending on what the AI chooses, the human meta-preferences will be different, this gives two reward functions \(R_Z=I_A R_A+ \mu_Z(1-I_A)R_{\neg A}\) and \(R_{\neg Z}=I_A R_A+ \mu_{\neg Z}(1-I_A)R_{\neg A}\) with different weights \(\mu_Z\) and \(\mu_{\neg Z}\).

If these weights are sufficiently distinct, this could identify a meta-preference divergence point and hence a point where human meta-meta-preferences become relevant.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

What does the Law of Logical
by Alex Appel on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes

To quote the straw vulcan:
by Stuart Armstrong on Hyperreal Brouwer | 0 likes

I intend to cross-post often.
by Scott Garrabrant on Should I post technical ideas here or on LessWrong... | 1 like

I think technical research
by Vadim Kosoy on Should I post technical ideas here or on LessWrong... | 2 likes

I am much more likely to miss
by Abram Demski on Should I post technical ideas here or on LessWrong... | 1 like

Note that the problem with
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Typos on page 5: *
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Ah, you're right. So gain
by Abram Demski on Smoking Lesion Steelman | 0 likes

> Do you have ideas for how
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I think I understand what
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>You don’t have to solve
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

Your confusion is because you
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

My confusion is the
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

> First of all, it seems to
by Abram Demski on Smoking Lesion Steelman | 0 likes

> figure out what my values
by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

RSS

Privacy & Terms