Improbable Oversight, An Attempt at Informed Oversight link by William Saunders 971 days ago | Jessica Taylor and Patrick LaVictoire like this | 8 comments

 by Jessica Taylor 969 days ago | Patrick LaVictoire likes this | link This seems like a reasonable approach. My main worry is that, if the probability of sampling good advice is very low (less than 1 in 1 billion), then good advice won’t appear at all in the training process, so the training process would result in exactly the same resulting agent as if the advice were never reported. This is essentially the “high variance” problem that you point out. In the case of plagiarism, it’s not clear how to come up with a proposal distribution assigning non-negligible probability to good advice (i.e. a list of references to plagiarized works). reply
 by William Saunders 969 days ago | Jessica Taylor likes this | link If B is “sufficiently more intelligent” than A, then B can do things like monitor A’s access to the library, read parts of A’s memory state, etc. and this information can be used to detect plagarism directly or yield a better advice distribution. An adversarial model could also be used in this capacity, and increased in intelligence along with A. If A is “sufficiently intelligent” to “correctly model how $$B^\prime(a)$$ is generated from $$B(a)$$”, I claim that A never needs to see an example of punishment to want to adopt correct behaviour, because it will be able to perform the expected value calculations and arrive at the correct answer. I’m currently convinced of this claim, but I don’t know how to state it more rigorously. You might be able to use an off-policy learner, and show it a number of generated examples of plagarism and the appropriate (negative) rewards, with Improbable Oversight ensuring that this information matches the true expected values in the setup (though this may not lead to the agent behaving safely). Or maybe you could train the agent initially in more limited environments where Improbable Oversight can be turned on and off, with the agent clearly informed about whether it is turned on or off, and have IO turned on when it moves to more complex environments. reply
 by Jessica Taylor 968 days ago | link If B is “sufficiently more intelligent” than A, then B can do things like monitor A’s access to the library, read parts of A’s memory state, etc. and this information can be used to detect plagarism directly or yield a better advice distribution. Yes, but at this point it seems like this proposal is mostly the same as Paul’s original proposal for informed oversight. It has the same uncertainties about it (e.g. is B able to understand A’s architecture, when A’s architecture is not optimized for understandability). Though, I suppose it’s slightly easier to have a 1/1000 chance of detecting plagiarism than a 90% chance, so this could still be useful. I claim that A never needs to see an example of punishment to want to adopt correct behaviour, because it will be able to perform the expected value calculations and arrive at the correct answer. It seems like this is based on the assumption that you can specify the agent’s utility function to care about such things. Reinforcement learning agents only acquire their “utility function” through training, so extremely large punishments have to show up in the training process somewhere. In general, the agent will need $$Var(R | a) / \epsilon^2$$ training examples to converge to good beliefs about $$E[R | a]$$ within $$\epsilon$$, where $$R$$ is a random variable indicating how much reward the agent receives and $$a$$ is the action it takes. It seems like this might work if the agent is able to access low-variance estimates of $$E[R | a]$$, though. You might be able to use an off-policy learner, and show it a number of generated examples of plagarism and the appropriate (negative) rewards Seems like this training data won’t distinguish “you get punished for plagiarizing” from “you get punished for plagiarizing in the exact method used in the off-policy examples”. That is, the agent will be uncertain about whether certain forms of plagiarism will receive punishments, since those types don’t appear in the off-policy training set. This seems insufficient when we want to prevent bad actions that we’ll have trouble thinking of ahead of time (e.g. novel ways of doing steganography). reply
 by William Saunders 961 days ago | Jessica Taylor likes this | link I don’t quite have the exact specification of what I have in mind yet. Fortunately, this seems like a problem that I could try to address in a toy model with current techniques, so I can think about this a bit more and try to come up with a concrete system which would work. I think that it should be possible to construct a reinforcement learning which can make use of side information. One proposal would be something like: construct A so that it internally estimates (with good uncertainty models) $$F(a,r)$$ and $$B(a)$$, and uses those estimates to predict and maximize $$E[B^{\prime}(a)]$$ (perhaps using something as simple as a monte-carlo simulation that can run for a large number of draws). Then allow A to, as an action or a prelude to acting, ask questions about the value of $$F(a,r)$$ for specific $$(a,r)$$ pairs, or for estimates of $$F(a)$$ (with some model that estimates of $$F(a)$$ might be incorrect). If A performs correct value of information calculations, it should value asking these questions in the training process and correctly learn values, even if it never experiences a situation where it is caught and punished. reply
 by Patrick LaVictoire 970 days ago | link Even the last version might have odd incentives. If A knew that the chances were high enough that an actually original A book would be seen as rare plagiarism of some book unknown to A, the dominant strategy could be to instead commit the most obvious plagiarism ever, in order to minimize the penalty that cannot be reliably avoided. reply
 by William Saunders 970 days ago | Patrick LaVictoire likes this | link This falls in with the question of whether we can distinguish whether A was intentionally vs. unintentionally committing a forbidden action. If the advice class R only contains information about things external to A, then there is no way for this method to distinguish between the two, and we should forbid anything that could be intentional bad behaviour. However, we might be able to include information about A’s internal state in the advice. For example, the advice is a pair (book, location in A’s memory trace) and, F(a,r) is only true if the location in A’s memory trace indicates that A plagarize the particular book (of course, you need to be sure that you’d be able to spot something in A’s memory trace). At least the method fails safely in this case. The null action will always be preferred to committing obvious plagarism (and committing obvious plagarism is pretty great problem to have compared to a silent failure!). And you can always tune k to make A more willing to trade action goodness for forbiden-ness probability, or reduce the size of the library, to alter the incentives if A refuses to do anything when you turn it on. reply
 by Patrick LaVictoire 966 days ago | link If you’re confident of getting a memory trace for all books consulted, then there are simpler ways of preventing plagiarism in the informed oversight case: have the overseer read only the books consulted by the agent (or choose randomly among them for the ones to read). The informed oversight problem here assumes that the internals of A are potentially opaque to B, even though B has greater capabilities than A. reply
 by William Saunders 962 days ago | link Another method for dealing with this case: Prior to the algorithm, pick a “safe set” S of (input, action) pairs which seem, on some reflection, that they should obviously not be forbidden. Then only punish actions which have a larger set of values r for which F(a,r) is true than, say, the mean of all actions in S (could be done by setting the evaluation of the null action $$a_{null}$$ to equal the evaluation of some random element in S). This means that if A will choose actions that are at least as safe as those in S (which might be suboptimally influenced by the size of the advice set $$R_F$$, but would not lead to paralysis). This could compromise safety if S is chosen poorly, or if there are pathological cases where some unsafe action has fewer reasons to think it unsafe than the actions in S (this seems unlikely at first glance). reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes