A putative new idea for AI control; index here.
This post will be extending ideas from inverse reinforcement learning (IRL) to the problem of goal completion. I’ll be drawing on the presentation and the algorithm from Apprenticeship Learning via Inverse Reinforcement Learning (with one minor modification).
In that setup, the environment is an MDP (Markov Decision process), and the real reward R is assumed to be linear in the “features” of the stateaction space. Features are functions φi from the full stateaction space S×A to the unit interval [0,1] (the paper linked above only considers functions from the state space; this is the “minor modification”). These features form a vector φ∈[0,1]^k, for k different features. The actual reward is given by the inner product with a vector w∈ℝ^k, thus the reward at stateaction pair (s,a) is
R(s,a)=w.φ(s,a).
To ensure the reward is always between 1 and 1, w is constrained to have w_1 ≤ 1; to reduce redundancy, we’ll assume w_1=1.
The advantages of linearity is that we can compute the expected rewards directly from the expected feature vector. If the agent follows a policy π (a map from state to action) and has a discount factor γ, the expected feature vector is
 μ(π) = E(Σ_t γ^tφ(s_t,π(s_t)),
where s_t is the state at step t.
The agent’s expected reward is then simply
Thus the problem of computing the correct reward is reduced to the problem of computing the correct w. In practice, to compute the correct policy, we just need to find one whose expected features are close enough to optimal; this need not involve computing w.
Inverse reinforcement learning
The approach of IRL is then to find a way to efficiently compute w, given some “trajectories”: examples of good performance, provided by (human) experts. These experts are following the expert policy πE Given these trajectories, the agent can compute an empirical estimate for the μE=μ(πE), by simply averaging the (discounted) feature vectors produced on each trajectory.
The algorithm used is to gradually expand a series of policies π(0), π(1),… until μE is close enough to the convex hull of the μ(π(i)). Then a policy is chosen in the hull of {π(0), π(1),…}, whose expected features are close to those of μE.
Note that since there exists a genuine w such that the experts act as w.μEmaximisers, there must exist a pure deterministic policy that is maximally good according to this w. This will not be the case if the φ is underspecified, as we shall see.
The rocket world
I want to slightly simplify the rocket model of the previous example. First of all, discard the extra state variables PA and DA; the state is described entirely by position and velocity. To ensure a finite state space, make the space cyclic (of size 100) and the velocity similarly cyclic  the state space is of size 10,000. The actions are, as before, accelerations of 3, 2, 1, 0, 1, 2, and 3 (a stateaction space of 70,000). The updates to position and velocity are deterministic and as expected (new pos=old pos+old vel, new vel=old vel+acceleration).
The space station is at point 0; docking with the station means reaching it with zero velocity, ie hitting state (0,0). The agent/rocket will start at point 36 with velocity 0. Each turn it is not docked, it will get a cost of 1; once docked, it will get no further reward or cost. Each time it accelerates with ±2, it gets a penalty of 10 as some of its passengers are uncomfortable. Each time it accelerates with ±3, it gets a penalty of 1000 as some of its passengers are killed.
In the terminology above, this can be captured by a fourcomponent feature vector given my (all unspecified values are zero):
 φ0 (0,0; )=1.
 φ1 (,; )=1.
 φ2 (,; ±2)=1
 φ3 (,; ±3)=1
Here φ0 encodes the specialness of the terminal position (docked at the space station), φ1 gives a uniform reward/penalty every turn, while φ2 and φ3 encode the effects of accelerations of magnitude 2 and 3, respectively. Then the true reward function is given by inner product with the weight vector
 w = (1/N,1/N, 10/N, 1000/N).
Here N=1012 is the normalisation factor ensuring w1=1. Since w0=w1, the bonus from the docking at the space station exactly cancels out the penalty per turn, meaning the agent gets nothing further (and loses nothing further) when docked.
The optimal trajectories/policies
The agent will have some optimal trajectories to go along with its its partial goal. In this case, since the setup is deterministic, there is only one trajectory: the rocket accelerates as velocity +1 for 6 turns (covering 15 squares) and then decelerates by 1 for 6 turns (covering 21 squares) to reach its destination. Thus it takes 13 turns to dock, and it will then stay there. The expected features are 1/(1γ) φ1 + (γ^13)/(1γ) φ0 (it will get the φ1 turn penalty forever, and the φ0 “docked” bonus after 12 turns) and the expected reward is (1/N) times (1γ^13)/(1γ).
Instead of giving one or multiple optimal trajectories (being deterministic, all trajectories would be identical), we could just give the above policy. In fact we will. Given any set of trajectories, the agent can infer the optimal (partial) policy π, simply by observing what the humans did in that state. It’s only a partial policy, because the humans might not have reached every state during their example trajectories (in this case, the humans have only reached a very narrow subset of states). Call π the observed optimal policy.
Goal completion with known φ
Assume that the agent knows all the φ and has been told that the first two components of w are (proportion to) (1,1). It seems then that the algorithm should proceed as usual. The only difference being that the policies the agent considers are subject to constraints that it has been given. Since there is a true policy π with these constraints, it should end up being converged to in the usual way. However, the agent will not compute the correct weighting of w2 versus w3, as neither type of acceleration appears in the expert trajectories.
In this sense, the goal completion algorithm is trivial. There still may be some advantages to goal completion (with the partial goal and trajectories) rather than traditional IRL (with just trajectories). If there are few trajectories to rely on, then the partial goal may be very informative. If there are multiple value functions that can lead to the same trajectories, then the information in the partial goal can help pick out one of them specifically (useful for when the agent leaves the training environment). This will be most relevant when we want to account for noise. Suppose, for instance, that we have told the agent that we have decided to price human lives at $5.8 million (though don’t confuse the price of a life with the value of a life). Then if the agent observes a lot of mild inconsistency in human behaviour, where we sometimes seem to price life less, sometimes more, it won’t try and overfit to these inconsistencies (or average them out to some other value), it will just dismiss them as noise.
The partial goals are most helpful if they reflect the true tradeoffs that we can quantify between multiple options. For example, we might not have known the true tradeoff between deaths and delay, but actuaries may have computed the tradeoff between discomfort/injury and death, giving the relationships between w2 and w3, which it could not infer from the trajectories it is given.
Goal completion with missing φ
Now imagine that the agent does not know the full φ. Though the full model will be useful to test the algorithms in practice, we’ll simplify the situation for this exposition and remove the possibility of accelerations of magnitude 3. Thus the stateaction space is of size 50,000, we remove φ3 and set the true w to be (1/12,1/12,10/12). The agent knows the first two components, as before, but is ignorant of both w2 and φ2.
The first thing to notice is that the standard algorithm can’t work. Knowing φ0, φ1, w0, and w1, the agent can compute the optimal policy π’: accelerate at ±2 towards the space station, and get there in 9 turns (2 turns of +2 acceleration, covering 2 squares; 1 turn of no acceleration, covering 4 squares; 2 turns of +2 acceleration, covering 10 squares; and 4 turns of 2 acceleration, covering 20 squares). But this gives an expected feature vector of 1/(1γ) φ1 + (γ^10)/(1γ) φ0, rather than the observed feature vector of 1/(1γ) φ1 + (γ^13)/(1γ) φ0, in the trajectories it has been given. And π’ is the only optimal policy compatible with the partial goal it has been given.
One thing we can do at this point is loosen the partial goal. Suppose we allow it to consider (w0, w1)∝(1,1) as a possible reward vector (ie the opposite of the true one). Under this setup, it gets rewards for every single turn, except when docking at the space station. The optimal policy π’’ for this is to accelerate at random (unless the acceleration would cause the rocket to come to rest at the space station). This has an expected feature vector of 1/(1γ) φ1.
Now the expert feature vector is in the convex hull of the expected features it has found. The policy
 π’‘’ = γ^3π’ + (1γ^3) π’‘(meaning that at the very beginning, the agent choose, once, either to follow π’ with probability γ^3, and otherwise follows π’’),
will give the correct expert expected features.
However it is clear that π’’’ is a different policy from the observed policy π, even if they have the same expected (φ0, φ1). It is a mixed strategy, for one, not close to any pure strategy, and it is not optimal for any possible values of the (two dimensional) vector w.
Adding depth of observations
One might be tempted at this point to try to get extra observations. We could change the setup in the following way: at the very first turn, a “cosmic wind” blows and randomises the position and velocity of the rocket; after that, everything proceeds as before. Then the agent will be given a much richer set of trajectories, making its observed optimal policy π (dock at the space station as fast as possible using only ± accelerations) more defined, maybe even into a full policy.
But that doesn’t suffice, however. The agent will still compute that π has an observed features expectation of 1/(1γ)φ1+Aφ2, for some constant A, and that this can be reached by π’‘’ = Bπ’ + (1B)π’‘, with π’={dock as fast as possible, using all accelerations} and π’’={never dock, otherwise anything goes}, and some constant B. The problem is not the lack of observations, the problem is that the agent lacks sufficient φi’s to converge on the reward for the policy it has been given.
The best φi to add
The logical next step is for the agent to “guess” a φ2 and add it to its feature vector. But what criteria should it be using? Obviously, if it tries every single possible φi (ie one for each stateaction pair) then it will eventually converge on the correct behaviour in this model. But that is an absurdly high number of features to test, and will likely result in an extremely overfitted reward that could go badly wrong if the agent is faced with a more general situation.
The first step is to put some structure on the possible φi. This could be a list of φi’s to try; a Bayesian prior over likely φi’s; the output of a deep neural net looking at the data, etc… What is needed is a relatively short list of φi’s worth trying, possibly weighted/ordered by likelihood.
The candidate φi’s are then weighted again by how well they explain the discrepancy between π and π’’’. Both these policies are indistinguishable (or very similar, in the general case) in terms of the expected features that the agent knows about, yet they have divergent behaviours. The best candidate additional features are those that best explain these divergences. Note that it is trivial to perfectly explain these divergences by constructing, by hand, a φi that precisely records these differences; thus the need for a structure on the φi beforehand. Depending on the structure on the stateaction space, one could use ideas similar to linear discriminant analysis or correspondence analysis to quantify how good the φi is.
Then the highest weight candidate is added to φ and the agent attempts to recompute the optimal policy again, in the traditional IRL way. If it is sufficiently close to π, it stops; if not, it attempts to find a φj that accounts for the remaining discrepancies, and so on.
In the example above, the state space with its position and acceleration, has a lot of structure. More interestingly, the combined stateaction space has a natural product decomposition. We’ve been talking about “accelerate by 2” as an action that can happen in every state; so instead of saying “here is a state space, each state has an individual list of possible actions”, we’ve been saying “the stateaction space is a product of the state space and the action (acceleration) space”. This natural product suggests several candidate φi: look at features that are purely represented in state space, or purely represented in action space.
The action features of π are clear: “generally choose ±1, rarely 0”. The action features of π’ are “generally choose ±2, rarely ±1 or zero”. The action features of π’‘are “almost always, all actions are equally valid”. There are three candidate φ2’s that can therefore best distinguish π from π’’’ (note that we are trying to distinguish different actions in the same state, not different states reached):
 φ2 (,; ±2) = 1 (as defined above).
 φ2’ (,; 2) = 1.
 φ2’’ (,; 2) = 1.
Now φ2 is correct, and adding it will allows the agent to immediately converge on the correct reward and behaviour. However, depending on how we’ve structured the action space, it may not be an obvious or top guess. On the other hand φ2’ is an obvious guess, and adding it will result in a second round of policy convergence (where we now have a π’ that can’t accelerate fast but can decelerate fast) and then φ2’‘will get added at the next iteration, giving the correct policy, a feature vector of (φ0,φ1,φ2’,φ2’’), and a reward vector proportional to something like (1,1,A,B) for some positive A and B.
If we went back to allowing accelerations of ±3, the agent should converge on some reward function that correctly only uses accelerations of 0 and ±1, but if won’t know the relative tradeoff between ±2 versus ±3, at least not in this environment (as the human experts refrain from using either one).
Once it has a candidate φi, the agent is not obliged to immediately add it in and proceed. Instead, a moderately advanced agent could choose to ask humans whether this is a good φi, while a more advanced agent could go looking for more information to determine this. Even once the agent has a good model of the human behaviour in this situation, the additional φi are good candidates for humans to review and assess (it will likely bring up issues that humans hadn’t considered up to that point). This is especially the case if the extra φi’s do not fix w “rigidly”, ie allow for a lot of variation in the weights of w while still computing the same expected features as a human would. This is a sign that there are too many features for the information contained in the partial goal and the trajectories.
Learning from negative examples
One good thing would be to have the agent learn from negative examples, just as humans do: “whatever you do, don’t do this!”. But it is tricky to determine how we are supposed to interpret a negative example. A positive trajectory implicitly contains a lot of negative examples: all the possible actions the human could have taken, but didn’t. What makes a specific negative example special?
Generally, humangenerated negative examples aren’t trivial (“the driver is going at 79.99 km/h rather than 80 km/h; don’t do that!”) nor are they maximally negative (“the driver is crashing into the White House, killing the US president, and attempting to start a nuclear war; don’t do that!”). So the “don’t do that” command is clear, but the intensity of that command is not (subtleties in using negative examples are quite common, see eg “Learning from Negative Examples in SetExpansion”).
When interpreted in the light of the preceding, however, it’s easier to see what a negative example is. It’s an example of an action that is likely to be chosen, and has a disproportionately negative impact among such likely actions. “Likely to be chosen”, means likely in the human judgement, of course. The action could be likely to happen through error, or it could be something that the human thinks the agent (or other humans) are likely to want to attempt.
Dealing with this seems straightforward. The negative action gets added as “action to be avoided” (along with all the actions of π’‘’ that differ from those of π). The φi is then chosen to best separate the sets {actions of π} from {actions of π’’’}∪{examples of negative actions}.
Indeed, the agent could keep track of whether using {actions of π’‘’}∪{examples of negative actions} or plain {actions of π’’’} results in a faster/better convergence. If the agents converges better with the smaller set, but still doesn’t do any of the negative actions, then it’s a sign that the human negative examples are not adding much to the agent’s “understanding”: what humans thought were worthwhile edge cases were situations the agents had already classified correctly. Of course, the negative action examples might be intended to help more for when the agent moves beyond its training environment, so they need to be kept in mind, even if they are currently uninformative.
Learning from extreme examples
What might be as useful (or more useful) than negative examples, are extreme examples. These are examples of decisions that are valid, but are made in unusual or urgent situations  maybe the rocket is rushing to arrive with desperately needed medical supplies. This could allow the agent to distinguish between accelerations of ±2 and those of ±3, which is impossible to do from the trajectories it was given, which contained neither.
