Model of human (ir)rationality
post by Stuart Armstrong 510 days ago | discuss

A putative new idea for AI control; index here.

This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it’s our definition which determines what’s a bias and what’s a preference in humans).

## Humans, actions, and joint distributions

The human themselves is simply modelled as their brain (thus various human sense organs can be observed by the AI rather than being part of the description).

Let $$R$$ be the set of possible reward functions the human may be maximising. Let $$H_\pi$$ be the set of policies the human may be following. We’ll assume that $$H_\pi$$ is closed under the taking of mixed strategies.

The AI has a joint probability distribution $$P$$ over $$R$$, $$H_\pi$$ and events in the world. By conditioning on any element $$r\in R$$, $$P$$ defines a map $$\mu$$ from $$R$$ to probability distributions over $$H_\pi$$. Since $$H_\pi$$ is closed under the taking of mixed strategies, this means that $$\mu$$ can be seen as a map from $$R$$ to $$H_\pi$$.

The map $$\mu$$ and the marginal distribution $$P_R$$ ($$P$$ restricted to $$R$$) define $$P$$ entirely. Note that $$\mu$$ is what relates human actions to their explanation in terms of the reward $$R$$.

## Basic properties of $$P$$

Here are a few properties $$P$$ could have:

1. The distribution $$P$$ is historical if $$P_R$$ is independent of any action the AI takes.
2. An AI’s action $$a$$ overwrites the reward if $$\mu$$ is constant, conditional on $$a$$, while $$P_R$$ is still `broad’ (“broad” is not fully defined, but $$P_R$$ is certainly broad enough if it assigns non-zero probability to both an $$r$$ and $$-r$$).
3. The distribution $$P$$ is $$Q$$-rational if there exits a prior distribution $$Q$$ over the universe such that $$\mu$$ maps $$r\in R$$ to the optimal policy for an $$r$$-maximising agent with prior $$Q$$.

It’s clear that if $$P$$ is historical, the AI will treat the human’s reward function as something it has to discover, and can’t influence. An action $$a$$ that overwrites the reward means that the human’s policy is fixed by action $$a$$, independently of whatever reward it might have. This is bad because a) the human actions are no longer informative to the AI about their reward, and b) the human actions are likely suboptimal with respect

Note that stratification can be seen as taking a non-historical distribution, and making it historical via counterfactual.

## Advanced properties of $$P$$

Those basic properties can define a basic model of a human. But humans have far more biases and irrationalities. Though these are multiple and complicated, we’ll focus here on a few general properties that can capture a lot of these irrationalities in relatively “natural” ways.

By “natural”, we mean human understandable properties that encode biases in ways that are not too complicated and are close to how we understand them.

Humans are not perfect logical reasoners who fully and immediately know all the infinite implications of any statement. Now, modelling bounded rationality or logical uncertainty is going to be tricky, but we can for the moment simply assume that humans only partially update their probabilities when new data comes in.

Specifically, there is a function $$f$$ which maps an observation $$o$$ and previous history $$h_{<o}$$ to the set of statements that get updated.

Humans don’t tend to update their beliefs in a fully Bayesian fashion. Thus define an update function $$u$$ which is used instead of Bayesian updates. If we update the odds ratio of event $$A$$ according to evidence $$E$$, the correct Bayesian update is

• $$\frac{P(A|E)}{P(\neg A|E)} = \left(\frac{P(E|A)}{P(E|\neg A)}\right) \frac{P(A)}{P(\neg A)}$$.

Then $$u$$ could be a function of one variable:

• $$\frac{P(A|E)}{P(\neg A|E)} = u\left(\frac{P(E|A)}{P(E|\neg A)}\right) \frac{P(A)}{P(\neg A)}$$.

Or of two:

• $$\frac{P(A|E)}{P(\neg A|E)} = u\left(\frac{P(E|A)}{P(E|\neg A)}, \frac{P(A)}{P(\neg A)} \right) \frac{P(A)}{P(\neg A)}$$.

Or $$u$$ could also be a function of the observation $$o$$ and prior history $$h_{<o}$$.

Now, humans update some probabilities better than others, but we’ll defer that to when we talk of multi-agent models.

# Bounded rationality: inconceivable actions

Humans don’t fully explore the space of possible actions and policies, preferring to stick to those that are the most easily accessible. So most actions are literally inconceivable to us.

This can be modelled by a function $$g$$ which maps $$o$$ and/or $$h_{<o}$$ to a set of possible actions. This map will determine a subset $$G \subset H_\pi$$ of possible human policies (these are the policies that only ever take actions compatible with $$g$$).

# Multi-agent models

Multi-agent models are useful for modelling the various contradictions in the human psyche (system 1 vs system 2, conscious vs subconscious, short vs long term preferences etc).

This can be modelled as seeing the human as consisting of different agents $$A_0$$, $$A_1$$, … $$A_n$$, each of them with their own possible biases as described above (tohugh they must share a common function $$g$$). There is a function $$L$$ which, taking $$o$$ and possibly $$h_{<o}$$ into account, weights the verdicts of the different subagents and outputs the ultimate action.

These multiple agents may be optimising for different reward functions, so $$\mu$$ can break down into multiple $$\mu_i$$, one for each agent.

# Recursion and introspection

This is the most complex situation here. Humans have explicit meta preferences (“I don’t want to be racist”, “I want to be rational”, “I want to be right”, etc…) that influences how they update their beliefs. Generally these meta-preferences view the human as an integrated whole.

We can try and model this by positing meta preferences $$M$$ which are desirable properties of the agent as a whole, and an introspection function $$I$$ which will trigger occasionally, and map some feature of the agent closer to $$M$$. This is deliberately very vague, and I’ll try and flesh it out and formalise it as needed.

## The modelled human

Thus we can define $$\mu$$ as representing an irrational human if:

1. The prior $$P$$ is $$(\{A_i,f_i,u_i,r_i\},g,L)$$-consistent if the human’s actions are chosen by applying the agent weighting function $$L$$ to $$\mu_i(r_i)$$, where $$\mu_i$$ maps any $$r$$ to the action chosen by an agent $$A_i$$ with reward $$r$$, bounded rationality $$f_i$$, partial update $$u_i$$, and available actions $$g$$.
2. The prior $$P$$ is $$(\{A_i,f_i,u_i,r_i\},g,L,M,I)$$-consistent if it is as above, plus the action of the introspection function $$I$$ on the meta-preferences $$M$$.

### NEW DISCUSSION POSTS

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes