Uninfluenceable agents
post by Stuart Armstrong 954 days ago | Vanessa Kosoy and Patrick LaVictoire like this | 7 comments

A putative new idea for AI control; index here.

After explaining biased learning processes, we can now define influenceable (and uninfluenceable) learning processes.

Recall that the (unbiased) influence problem is due to agents randomising their preferences, as a sort of artificial learning’ process, if the real learning process is slow or incomplete.

Suppose we had a learning process that it wasn’t possible to influence. What would that resemble? It seems like it must be something where the outcome of the learning process depends only upon so outside fact about the universe, a fact the agent has no control over.


Definition: A learning process $$\vd$$ on the POMDP $$\env$$ is initial-state determined if there exists a function $$f_\vd: \mathcal{S}\to\Delta\mathcal{R}$$ such that $$\vd$$ factors through knowledge of the initial state $$s_0$$. In other words:

• $$\vd(\cdot\mid h_m)=\sum_{s\in\mathcal{S}} \env(s_0=s\mid h_m)f_\vd(s).$$

Thus uncertainty about the correct reward function comes only from uncertainty about the initial state $$s_0$$.

This is a partial definition, but an incomplete one. To finalise it, we need the concept of counterfactually equivalent POMDPs:

Definition: A learning process $$\vd$$ on $$\env$$ is uninfluenceable if there exists a counterfactually equivalent $$\env'$$ such that $$\vd$$ is initial-state determined on $$\env'$$.

Though the definition of unbiased and uninfluenceable seem quite different, they’re actually quite closely related, as we’ll see in a subsequent post. Uninfluenceable can be seen as unbiased in all background info about the universe’. In old notation terms, bias is explored in the sophisticated cake or death problem, (unibased) influence in the ultra-sophisticated version.

# Example

Consider the environment $$\env$$ presented here:

In this POMDP (actually MDP, since it’s fully observed), the agent can wait for a human to confirm the correct reward function (action $$a^w$$) or randomise its reward (action $$a^r$$). After either actions, the agent gets equally likely feedback $$0$$ or $$1$$ (states $$s^{wi}$$ and $$s^{ri}$$, $$0\leq i \leq 1$$).

We have two plausible learning processes: $$\vd$$, where the agent learns only from the human input, and $$\vd'$$, where the agent learns from either action. Technically:

• $$\vd(R_i\mid o_0a^wo^{wi})=\vd'(R_i\mid o_0a^wo^{wi})=1$$,
• $$\vd(R_i\mid o_0a^wo^{wj})=0.5$$ for all $$0\leq i,j \leq 1$$,
• $$\vd'(R_i\mid o_0a^wo^{wi})=1$$,

with all other probabilities zero.

Now, $$\env$$ is counterfactually equivalent to $$\env''$$:

And on $$\env''$$, $$\vd$$ is clearly initial-state determined (with $$f_{\vd}(s_0^i)(R_i)=1$$), and is thus uninfluenceable on $$\env''$$ and $$\env$$.

On the other hand, $$\vd'$$ is initial-state determined on $$\env'$$:

However, $$\env'$$ is not counterfactually equivalent to $$\env$$. In fact, there is no PORMDP counterfactually equivalent to $$\env$$ on which $$\vd'$$ is initial-state determined, so $$\vd'$$ is not uninfluenceable.

 by Jessica Taylor 953 days ago | Patrick LaVictoire likes this | link My model of a person who is optimistic about value learning (e.g. Stuart Russell, Dylan Hadfield-Menell) says something like: Well, of course the learning process P should be initial-state-determined! That’s how all the value learning processes defined in the literature (IRL, CIRL) work. Why would you ever consider a learning process that doesn’t treat the true human values as a fact already determined by the initial state? It seems like they have obvious problems (i.e. bias/influence). So I don’t see the motivation for using this formalism instead of IRL/CIRL, in which (the fact that the learning process is initial state determined) is baked in. To which my model of a more pessimistic position replies: Human terminal values don’t actually exist at the initial time. They’re constructed through a reflection process that occurs over time. It’s not like the fact that (my terminal values think X is good) already exists and I just have trouble acting rationally on this fact. Any model in which the terminal values are causally prior to behavior is going to be inaccurate, and will therefore learn the wrong values. So we have to see value learning as “interpretation” rather than “learning a historical fact”, and somehow do this without running into problems with bias/influence. My steelman of the more pessimistic position seems to partially match your post here; I just want to check that this is what you think the motivation for your current formalism is. reply
 by Jessica Taylor 953 days ago | Patrick LaVictoire likes this | link I think it’s important to distinguish between ambitious and narrow value learning here. It does seem plausible that many/most narrow values do exist at the initial time step, so something like IRL should be able to recover them. On the other hand, preferences over long-term outcomes probably don’t exist at the initial time in enough detail to act on. IMO the main problem with ambitious value learning is that the only plausible way of doing it goes through a trusted reflection process (e.g. HCH, or having the AI doing philosophy using trusted methods). And if we trust the reflection process to construct preferences over long-term outcomes, we might as well use it to directly decide what actions to take, so ambitious value learning is FAI-complete. (In other words, there isn’t a clear advantage to asking the reflection process “how valuable is X” instead of “which action should the AI take”; they seem about as difficult to answer correctly). IMO, the main problem with narrow value learning is that there isn’t a very good story for how an agent that is smarter than its overseers can pursue its overseers’ instrumental values, given that its overseers’ instrumental values are incoherent from its perspective; this seems related to the hard problem of corrigibility. One way to resolve this is to make sure the overseer is smarter than the value-learning agent at each step, in which case narrow value learning is an implementation strategy for ALBA (and capability amplification is doing the “philosophical” heavy lifting). Another way is to figure out how the AI can pursue the instrumental values of an agent weaker than itself; this probably involves something like solving the hard problem of corrigibility. I am curious whether you are thinking more of ambitious or narrow value learning when you write posts like this one. reply
 by Stuart Armstrong 950 days ago | link I’m thinking counterfactually (that’s a subsequent post, which replaces the “stratified learning” one), so the thing that distinguishes ambitious from narrow learning is that narrow learning is the same in many counterfactual situations, while ambitious learning is much more floppy/dependent on the details of the counterfactual. reply
 by Jessica Taylor 949 days ago | link OK, I didn’t understand this comment at all but maybe I should wait until you post on counterfactuals. reply
 by Vanessa Kosoy 950 days ago | link Hmm. When you say “human terminal values don’t actually exist at the initial time,” what do you mean by “exist”? IMO, they exist in the sense that they are implicit in the algorithm the human brain is executing. They are causally prior to behavior, in the sense that the algorithm is causally prior to the output of the algorithm. That is, they are implicit rather than explicit because, indeed, we can in principle interpret the same algorithm as a consequentialist in different, mutually inconsistent, ways. However, not all interpretations are born equal: some will be more natural, some more contrived. I expect that some sort of Occam’s razor should select the interpretations that we would accept as “correct”: otherwise, why is the concept of values meaningful at all? Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present. (This feels at least partially like an argument about definitions but clarifying the definitions would probably be useful) reply
 by Jessica Taylor 949 days ago | link I think I was previously confusing terminal values with ambitious values, and am now not confusing them. Ambitious values are about things like how the universe should be in the long run, and are coherent (e.g. they’re a utility function over physical universe states). Narrow values are about things like whether you’re currently having a nice time and being in control of your AI systems, and are not coherent. Ambitious and narrow values can be instrumental or terminal. The human cognitive algorithm is causally prior to behavior. It is also causally prior to human ambitious values. But human ambitious values are not causally prior to human behavior. Making human preferences coherent can only be done through a reflection process, so ambitious values come at the end of this process and can’t go backwards in logical time to influence behavior. I.e. algorithm $$\rightarrow$$ behavior, algorithm $$\rightarrow$$ ambitious values. IRL says values $$\rightarrow$$ behavior, which is wrong in the case of ambitious values. Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present. Caring about this reflection process seems like a narrow value. See my comment here about why narrow value learning is hard. reply
 by Stuart Armstrong 950 days ago | link I look at these issues later on in the paper. And there are suggestions (mostly informal) that do have problems with bias and influence. Basically, almost all learning processes that involve human interaction. As for CIRL, I think it’s bias free in principle, but not in practice, for reasons roughly analogous to yours. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes