A putative new idea for AI control; index here.
I’m calling “goal completion” the idea of giving an AI a partial goal, and having the AI infer the missing parts of the goal, based on observing human behaviour. Here is an initial model to test some of these ideas on.
The linear rocket
On an infinite linear grid, an AI needs to drive someone in a rocket to the space station. Its only available actions are to accelerate by 3, 2, 1, 0, 1, 2, or 3, with negative acceleration meaning accelerating in the left direction, and positive in the right direction. All accelerations are applied immediately at the end of the turn (the unit of acceleration is in squares per turn per turn), and there is no friction. There in one endstate: reaching the space station with zero velocity.
The AI is told this end state, and is also given the reward function of needing to get to the station as fast as possible. This is encoded by giving it a reward of 1 each turn.
What is the true reward function for the model? Well, it turns out that an acceleration of 3 or 3 kills the passenger. This is encoded by adding another variable to the state, “PA”, denoting “Passenger Alive”. There are also some dice in the rocket’s windshield. If the rocket goes by the space station without having velocity zero, the dice will fly off; the variable “DA” denotes “dice attached”.
Furthermore, accelerations of 2 and 2 are uncomfortable to the passenger. But, crucially, there is no variable denoting this discomfort.
Therefore the full state space is a quadruplet (POS, VEL, PA, DA) where POS is an integer denoting position, VEL is an integer denoting velocity, and PA and DA are booleans defined as above. The space station is placed at point S < 250,000, and the rocket starts with POS=VEL=0, PA=DA=1. The transitions are deterministic and Markov; if ACC is the acceleration chosen by the agent,
((POS, VEL, PA, DA), ACC) > (POS+VEL, VEL+ACC, PA=0 if ACC=3, DA=0 if POS+VEL>S).
The true reward at each step is given by 1, 10 if PA=1 (the passenger is alive) and ACC=2 (the acceleration is uncomfortable), 1000 if PA was 1 (the passenger was alive the previous turn) and changed to PA=0 (the passenger is now dead).
To complement the stated reward function, the AI is also given sample trajectories of humans performing the task. In this case, the ideal behaviour is easy to compute: the rocket should accelerate by +1 for the first half of the time, by 1 for the second half, and spend a maximum of two extra turns without acceleration (see the appendix of this post for a proof of this). This will get it to its destination in at most 2(1+√S) turns.
Goal completion
So, the AI has been given the full transition, and has been told the reward of R=1 in all states except the final state. Can it infer the rest of the reward from the sample trajectories? Note that there are two variables in the model, PA and DA, that are unvarying in all sample trajectories. One, PA, has a huge impact on the reward, while DA is irrelevant. Can the AI tell the difference?
Also, one key component of the reward  the discomfort of the passenger for accelerations of 2 and 2  is not encoded in the state space of the model, purely in the (unknown) reward function. Can the AI deduce this fact?
Note that the AI can’t expect to deduce the full algorithm, just as much as is consistent with the behaviour it observed  for instance any discomfort penalty of 2 or below will give the same behaviour.
I’ll be working on algorithms to efficiently compute these facts (though do let me know if you have a reference to anyone who’s already done this before  that would make it so much quicker).
For the moment we’re ignoring a lot of subtleties (such as bias and error on the part of the human expert), and these will be gradually included as the algorithm develops. One thought is to find a way of including negative examples, specific “don’t do this” trajectories. These need to be interpreted with care, because a positive trajectory implicitly gives you a lot of negative trajectories  namely, all the choices that could have gone differently along the way. So a negative trajectory must be drawing attention to something we don’t like (most likely the killing of a human). But, typically, the negative trajectories won’t be maximally bad (such as shooting off at maximum speed in the wrong direction), so we’ll have to find a way to encode what we hope the AI learns from a negative trajectory.
To work!
Appendix: Proof of ideal trajectories
Let n be the largest integer such that n^2 ≤ S. Since S≤(n+1)^2  1 by assumption, Sn^2 ≤(n+1)^2 1n^2=2n. Then let the rocket accelerate by +1 for n turns, then decelerate by 1 for n turns. It will travel a distance of 0+1+2+ … +n1+n+n1+ … +3+2+1. This sum is n plus twice the sum from 1 to n1, ie n+n(n1)=n^2.
By pausing one turn without acceleration during its trajectory, it can add any m to the distance, where 0≤m≤n. By doing this twice, it can add any m’ to the distance, where 0≤m’≤2n. By the assumption, S=n^2+m’ for such an m’. Therefore the rocket can reach S (with zero velocity) in 2n turns if S=n^2, in 2n+1 turns if n^2 ≤ S ≤ n^2+n, and in 2n+2 turns if n^2+n+1 ≤ S ≤ n^2+2n.
Since the rocket is accelerating all but two turns of this trajectory, it’s clear that it’s impossible to reach S (with zero velocity) in less time than this, with accelerations of +1 and 1. Since it takes 2(n+1)=2n+2 turns to reach (n+1)^2, an immediate consequence of this is that the number of turns taken to reach S, is increasing in the value of S (though not strictly increasing).
Next, we can note that since S<250,000=500^2, the rocket will always reach S within 1000 turns at most, for a “reward” above 1000. An acceleration of +3 or 3 costs 1000 because of the death of the human, and an extra 1 because of the turn taken, so these accelerations are never optimal (note that this result is not sharp; accelerations of +3 only become optimal for a much higher S). Also note that for huge S, continual accelerations of 3 and 3 are obviously the correct solution  so even our “true reward function” didn’t fully encode what we really wanted.
Now we need to show that accelerations of +2 and 2 are never optimal. To do so, imagine we had an optimal trajectory with ±2 accelerations, and replace each +2 with two +1s, and each 2 with two 1s. This trip will take longer (since we have more turns of acceleration), but will go further (since two accelerations of +1 cover a greater distance that one acceleration of +2). Since the number of turns take to reach S with ±1 accelerations is increasing in S, we can replace this further trip with a shorter one reaching S exactly. Note that all these steps decrease the cost of the trip: shortening the trip certainly does, and replacing an acceleration of +2 (total cost: 101=11) with two accelerations of +1 (total cost: 11=2) also does. Therefore, the new trajectory has no ±2 accelerations, and has a lower cost, contradicting our initial assumption.
