Intelligent Agent Foundations Forumsign up / log in
Our values are underdefined, changeable, and manipulable
discussion post by Stuart Armstrong 108 days ago | discuss

A putative new idea for AI control; index here.

When asked whether “communist” journalists could report freely from the USA, only 36% of 1950 Americans agreed. A follow up question about Amerian journalists reporting freely from the USSR got 66% agreement. When the order of the questions was reversed, 90% were in favour of American journalists - and an astounding 73% in favour of the communist ones.

There are many examples of survey responses depending on question order, or subtle issues of phrasing.

So there are people whose answers depended on question order. What then are the “true” values of these individuals?

Underdetermined values

I think the best way of characterising their values is to call them “underdetermined”. There were/are presumably some people for which universal freedom of the press or strict national security were firm and established values. But for most, there were presumably some soft versions of freedom of the press and nationalism, and the first question triggered one narrative more strongly than the other. What then, are their “real” values? That’s the wrong question - akin to asking if Argentina really won the 1986 world cup.

Politicians can change the opinions of a large sector of the voting public with a single pronouncement - were the people’s real opinions the ones before, or the ones after? Again, this seems to be the wrong question. But don’t people fret about this inconsistency? I’d wager that they aren’t really aware of this, because people are the most changeable on issues they’ve given the least thought to.

And rationalists and EAs are not immune to this - we presumably don’t shift much on what we identify as our core values, but on less important values, we’re probably as changeable as anyone. But such contingent values can become very strong if attacked, thus becoming a core part of our identity - even if it’s very plausible we could have held the opposite position in a world slightly different.

Frameworks and moral updating

People often rely on a small number of moral frameworks and principles to guide them. When a new moral issue arises, we generally try and fit it into a moral framework - and when there are multiple ones that could fit, we can go in multiple directions, driven by mood, bias, tribalism, and many other contingent factors.

The moral frameworks themselves can and do shift, due to issues like tribalism, cognitive dissonance, life experience, and our own self-analysis. Or the frameworks can accumulate so many exceptions or refinements, that they transform in practice if not in name - it’s very interesting that my leftist opinions agree with Anders Sandberg’s libertarian opinions on most important issues. We seem to have changed positions without changing labels.


In a sense, you could see all of metaethics as the refinement and analysis of these frameworks. There are urges towards simplicity, to get a more stable and elegant system, and towards complexity, to capture the full spectrum of human values. Much of philosophical disagreement can be seen as “Given A, proposition B (generally acceptable conclusion) implies C (controversial position I endorse)”, to which the response is “C is wrong, thus A (or B) is wrong as stated and needs to be refined or denied” - the logic is generally accepted, but which position is kept varies.

Since ethical disagreements are rarely resolved, it’s likely that the positions of professional philosophers, though more consistent, are also often driven by contingent and random factors. The process is not completely random - ethical ideas that are the least coherent, like the moral foundation of purity, tend to get discarded - but is certainly contingent. As before, I argue you should focus on the procedure P by which philosophers update their opinions, rather than the (hypothetical) R to which P may be supposed to converge to.

Most people, however, will not have consistent meta-ethics, as they haven’t considered these questions. So their meta-opinions there will be even more subject to random influences that their base-level opinions.

Future preferences

There is an urgent question dividing the future world: should local FLOOBS be allowed to restrict use of BLARGS, or instead ORFOILS should pressure COLATS to agree to FLAPPLE the SNARFS.

Ok, we don’t currently know what future political issues will be, but it’s clear there will be new issues (how do we know this? Because nobody cares today whether Richard Lionheart and Phillip August of France lacked in their feudal duties to each other, nor did the people of that period worry much about medical tort reform). And people will take positions on them, and they will be incorporated into moral frameworks, causing those frameworks to change, and eventually philosophers may incorporate enough change into new metaethical frameworks.

I think it’s fair to say that our current positions on these future issues are even more under-determined than most of our values.

Contingent means manipulable

If our future values are determined by contingent facts, then a sufficiently powerful and intelligent agent can manipulate our values, by manipulating those facts. However, without some sort of learning-processes-with-contingent-facts, our values are underdetermined, and hence an agent that wanted to maximise human values/reward wouldn’t know what to do.

It was this realisation, that the agent could manipulate the values it was supposed to maximise, that caused me to look at ways of avoiding this.

Choices need to be made

We want a safe way to resolve the under-determination in human values, a task that gets more and more difficult as we move away from the usual world of today and into the hypothetical world that a superpowered AI could build.

But, precisely because of the under-determination, there are doing to be multiple ways of resolving this safely. Which means that choices will need to be made as to how to do so. The process of making human values fully rigorous, is not value-free.

(A minor example, that illustrated for me a tiny part of the challenge: does the way we behave when we’re drunk reveal our true values? And the answer: do you want it to? If there is a divergence in drunk and sober values, then accommodating drunk values is a decision - one that will likely be made sober.)





[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes


Privacy & Terms