by Paul Christiano 508 days ago | Jessica Taylor likes this | link | parent I meant that I may be able to sample pairs from some attack distribution without being able to harden my function against the attack distribution. Suppose that I have a program $$\widetilde{f} \in [0, 1]$$ which implements my desired reward function, except that it has a bunch of vulnerabilities $$\widetilde{a}_i$$ on which it mistakenly outputs 1 (when it really should output 0). Suppose further that I am able to sample vulnerabilities roughly as effectively as my AI. Then I can sample vulnerabilities $$\widetilde{a}$$ and provide the pairs $$(\widetilde{a}, -1)$$ to train my reward function, along with a bunch of pairs $$(a, \widetilde{f}(a))$$ for actions $$a$$ produced by the agent. This doesn’t quite work as stated but you could imagine learning $$f$$ despite having no access to it. (This is very similar to adversarial training / red teams).

### NEW DISCUSSION POSTS

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes