A putative new idea for AI control; index here.
After explaining biased learning processes, we can now define influenceable (and uninfluenceable) learning processes.
Recall that the (unbiased) influence problem is due to agents randomising their preferences, as a sort of artificial `learning’ process, if the real learning process is slow or incomplete.
Suppose we had a learning process that it wasn’t possible to influence. What would that resemble? It seems like it must be something where the outcome of the learning process depends only upon so outside fact about the universe, a fact the agent has no control over.
\(\newcommand{\env}{\mu}\newcommand{\vd}{P}\)So with that in mind, define:
Definition: A learning process \(\vd\) on the POMDP \(\env\) is initialstate determined if there exists a function \(f_\vd: \mathcal{S}\to\Delta\mathcal{R}\) such that \(\vd\) factors through knowledge of the initial state \(s_0\). In other words:
 \(\vd(\cdot\mid h_m)=\sum_{s\in\mathcal{S}} \env(s_0=s\mid h_m)f_\vd(s).\)
Thus uncertainty about the correct reward function comes only from uncertainty about the initial state \(s_0\).
This is a partial definition, but an incomplete one. To finalise it, we need the concept of counterfactually equivalent POMDPs:
Definition: A learning process \(\vd\) on \(\env\) is uninfluenceable if there exists a counterfactually equivalent \(\env'\) such that \(\vd\) is initialstate determined on \(\env'\).
Though the definition of unbiased and uninfluenceable seem quite different, they’re actually quite closely related, as we’ll see in a subsequent post. Uninfluenceable can be seen as `unbiased in all background info about the universe’. In old notation terms, bias is explored in the sophisticated cake or death problem, (unibased) influence in the ultrasophisticated version.
Example
Consider the environment \(\env\) presented here:
In this POMDP (actually MDP, since it’s fully observed), the agent can wait for a human to confirm the correct reward function (action \(a^w\)) or randomise its reward (action \(a^r\)). After either actions, the agent gets equally likely feedback \(0\) or \(1\) (states \(s^{wi}\) and \(s^{ri}\), \(0\leq i \leq 1\)).
We have two plausible learning processes: \(\vd\), where the agent learns only from the human input, and \(\vd'\), where the agent learns from either action. Technically:
 \(\vd(R_i\mid o_0a^wo^{wi})=\vd'(R_i\mid o_0a^wo^{wi})=1\),
 \(\vd(R_i\mid o_0a^wo^{wj})=0.5\) for all \(0\leq i,j \leq 1\),
 \(\vd'(R_i\mid o_0a^wo^{wi})=1\),
with all other probabilities zero.
Now, \(\env\) is counterfactually equivalent to \(\env''\):
And on \(\env''\), \(\vd\) is clearly initialstate determined (with \(f_{\vd}(s_0^i)(R_i)=1\)), and is thus uninfluenceable on \(\env''\) and \(\env\).
On the other hand, \(\vd'\) is initialstate determined on \(\env'\):
However, \(\env'\) is not counterfactually equivalent to \(\env\). In fact, there is no PORMDP counterfactually equivalent to \(\env\) on which \(\vd'\) is initialstate determined, so \(\vd'\) is not uninfluenceable.
