by Paul Christiano 292 days ago | Jessica Taylor likes this | link | parent I meant that I may be able to sample pairs from some attack distribution without being able to harden my function against the attack distribution. Suppose that I have a program $$\widetilde{f} \in [0, 1]$$ which implements my desired reward function, except that it has a bunch of vulnerabilities $$\widetilde{a}_i$$ on which it mistakenly outputs 1 (when it really should output 0). Suppose further that I am able to sample vulnerabilities roughly as effectively as my AI. Then I can sample vulnerabilities $$\widetilde{a}$$ and provide the pairs $$(\widetilde{a}, -1)$$ to train my reward function, along with a bunch of pairs $$(a, \widetilde{f}(a))$$ for actions $$a$$ produced by the agent. This doesn’t quite work as stated but you could imagine learning $$f$$ despite having no access to it. (This is very similar to adversarial training / red teams).

### NEW DISCUSSION POSTS

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes