There’s been a lot of work on how to reach agreement between people with different preferences or values. In practice, reaching agreement can be tricky, because of issues of extortion/trade and how the negotiations actually play out.\(\newcommand{\S}{\mathbb{S}}\)\(\newcommand{\R}{\mathbb{R}}\)\(\newcommand{\U}{\mathbb{U}}\)\(\newcommand{\O}{\mathbb{O}}\)\(\newcommand{\H}{\mathcal{H}}\)
To put those issues aside, let’s consider a much simpler case: where a single agent is uncertain about their own utility function. Then there is no issue of extortion, because the agent’s opponent is simply itself.
This type of comparison is called intertheoretic, rather than interpersonal.
A question of scale
It would seem that if the agent believed with probability \(p\) that it followed utility \(u\), and \(1p\) that it followed utility \(v\), then it should simply follow utility \(w=(p)u + (1p)v\).
But this is problematic, because \(u\) and \(v\) are only defined up to positive affine transformations. Translations are not a problem: sending \(u\) to \(u+c\) sends \(w\) to \(w+pc\). But scalings are: sending \(u\) to \(ru\) does not usually send \(w\) to any scaled version of \(w\).
So if we identify \([u]\) as the equivalence class of utilities equivalent to \(u\), then we can write \(w=(p)u + (1p)v\), but it’s not meaningful to write \([w]=(p)[u] + (1p)[v]\).
From clarity, we’ll call things like \(u\) (which map worlds to real values) utility functions, while \([u]\) will be called utility classes.
The setup
This is work done in collaboration with Toby Ord, Owen CottonBarratt, and Will MacAskill. We had some slightly different emphases during that process. In this post, I’ll present my preferred version, while adding the more general approach at the end.
We will need the structure described in this post:
 A finite set \(\mathbb{S}\) of deterministic strategies the agent can take.
 A set \(\mathbb{U}\) of utility classes the agent might follow.
 A distribution \(p\) over \(\mathbb{U}\), reflecting the agent’s uncertainty over its own utility functions.
 Let \(\mathbb{U}_p \subset \mathbb{U}\) be the subset to which \(p\) assigns a nonzero weight. We’ll assume \(p\) puts no weight on trivial, constant utility functions.
We’ll assume here that \(p\) never gets updated, that the agent never sees any evidence that changes its values. The issue of updating \(p\) is analysed in the sections on reward learning agent.
We’ll be assuming that there is some function \(f\) that takes in \(\mathbb{S}\) and \(p\) and outputs a single utility class \(f(\mathbb{S},p) \in \mathbb{U}\) reflecting the agent’s values.
Basic axioms
 Relevant data: If the utility classes \([u]\) and \([v]\) have the same values on all of \(\mathbb{S}\), then they are interchangeable from \(f\)’s perspective. Thus, in the terminology of this post, we can identify \(\U\) with \(k(U)/\sim\).
This gives \(\U\) the structure of \(S \cup \{0\}\), where \(S\) is a sphere, and \(0\) corresponds to the trivial utility that is equal on all \(\S\). The topology of \(\U\) is the standard topology on \(S\), and the only open set containing \(\{0\}\) is the whole of \(\U\).
Then with a reasonable topology on the probability distribution on \(\U\) – such as the weak topology? – this leads to the next axiom:
Continuity: the function \(f\) is continuous in \(p\).
Individual normalisation: there is a function \(h\) that maps \(\mathbb{U}\) to individual utility functions, such that \(f(\mathbb{S},p)= \left[\int_{\mathbb{U}_p} h(u)p \right]\) (using \(p\) as a measure on \(\mathbb{U}_p\)).
The previous axiom means that all utility classes get normalised individually, then added together according to their weight in \(p\).
 Symmetry: If \(\rho\) is a stable permutation of \(\mathbb{S}\), then \(f(\mathbb{S},p) \circ \rho = f(\mathbb{S},p \circ \rho)\).
Symmetry essentially means that the labels of \(\mathbb{S}\), or the details of how the strategies are implemented, do not matter.
Utility reflection: \(h[u]=h[u]\).
Cloning indifference: If there exists \(s_1, s_2 \in\mathbb{S}\) such that for all \(u\) in \(\mathbb{U}\) on which \(p\) is nonzero, \(u(s_1)=u(s_2)\), then \(f(\mathbb{S},p)=f(\mathbb{S}\{s_1\},p)\).
Cloning indifference means that the normalisation procedure does not care about multiple strategies that are equivalent on all possible utilities: it treats these strategies as if they were a single strategy.
We might want a stronger result, an independence of irrelevant alternatives. But this clashes with symmetry, so the following axioms attempt to get a weaker version of that requirement.
Relevance axioms
The above axioms are sufficient for the basics, but, as we’ll see, they’re compatible with a lot of different ways of combining utilities. The following two axioms attempt to put some sort of limitations on these possibilities.
First of all, we want to define events that are irrelevant. In the terminology of this post, let \(ha\) be a partial history (ending in an action), with at two possible observations afterwards: \(o\) and \(o'\).
Then \(\S_{ha}=\S_{hao}\times\S_{hao'}\). Then if there exists a bijection \(\sigma\) between \(\S_{hao}\) and \(\S_{hao '}\) such that, for all \(u\) with \([u]\in\mathbb{U}_p\), \(u(s)=u(\sigma(s))\), then the observation \(o\) versus \(o'\) is irrelevant. See here for more on how to define \(u\) on \(\S_{hao}\) in this context.
Thus irrelevance means that the utilities in \(\mathbb{U}_p\) really do not ‘care’ about \(o\) versus \(o'\), and that the increased strategy set it allows is specious. So if we remove \(o\) as a possible observation (substituting \(o'\) instead) this should make no difference:
Weak irrelevance: If \(o\) versus \(o'\) given \(ha\) is irrelevant for \(p\), then making \(o\) (xor \(o'\)) impossible does not change \(f\).
Strong irrelevance: If \(o\) versus \(o'\) given \(h\) is irrelevant for \(p\) and there is at least one other possible observation \(o''\) after \(ha\), then making \(o\) (xor \(o'\)) impossible does not change \(f\).
Full theory
In our full analysis, we considered other approaches and properties, and I’ll briefly list them here.
First of all, there is a set of prospects/options \(\mathbb{O}\) that may be different from the set of strategies \(\mathbb{S}\). This allows you to add other moral considerations, not just strictly consequentialist expected utility reasoning.
In this context, the \(f\) defined above was called a ‘rating function’, that rated the various utilities. With \(\mathbb{O}\), there are two other possibilities, the ‘choice function’ which selected the best option, and the permissibility function, which lists the options you are allowed to take.
If we’re considering options as outputs, rather than utilities, then we can do things like requiring the options to be Pareto only. We could also consider that the normalisation should stay the same if we remove the nonPareto options or strategies. We might also consider that it’s the space of possible utilities that we should care about; so, for instance, if \(u(s_1)=1\), \(u(s_2)=0\) and \(u(s_3) = 1\), and similar results hold for all \([u]\) in \(\mathbb{U}_p\), then we may as well drop \(s_2\) from the strategy set as it’s image is in the mixture of the other strategies.
Finally, some of the axioms above were presented in weaker forms (eg the individual normalisations) or stronger (eg independence of irrelevant alternatives).
