In order to study algorithms that can modify their own reward functions, we can define vectorvalued versions of reinforcement learning concepts.
Imagine that there are several different goods that we could care about; then a utility function is represented by a preference vector \(\vec\theta\). Furthermore, if it is possible for the agent (or the environment or other agents) to modify \(\vec \theta\), then we will want to index them by the timestep.
Consider an agent that can take actions, some of which affect its own reward function. This agent would (and should) wirehead if it attempts to maximize the discounted rewards as calculated by its future selves; i.e. at timestep \(n\) it would choose actions to maximize
\begin{eqnarray} U_n = \sum_{k\geq n} \gamma_k \vec{x}_k\cdot\vec{\theta}_k\end{eqnarray}
where \(\vec x_k\) is the vector of goods gained at time \(k\), \(\vec \theta_k\) is the preference vector at timestep \(k\), and \(\gamma_k\) is the time discount factor at time \(k\). (We will often use the case of an exponential discount \(\gamma^k\) for \(0<\gamma<1\).)
However, we might instead maximize the value of tomorrow’s actions in light of today’s reward function,
\begin{eqnarray} V_n = \sum_{k\geq n} \gamma_k\vec{x}_k\cdot\vec{\theta}_{n} \end{eqnarray}
(the only difference being \(\vec \theta_n\) rather than \(\vec \theta_k\)). Genuinely maximizing this should lead to more stable goals; concretely, we can consider environments that can offer “bribes” to selfmodify, and a learner maximizing \(U_n\) would generally accept such bribes, while a learner maximizing \(V_n\) would be cautious about doing so.
So what do we see when we adapt existing RL algorithms to such problems? There’s then a distinction between Qlearning and SARSA, where Qlearning foolishly accepts bribes that SARSA passes on, and this seems to be the flip side of the concept of interruptibility!
