Intelligent Agent Foundations Forumsign up / log in
Censoring out-of-domain representations
discussion post by Patrick LaVictoire 902 days ago | Jessica Taylor and Stuart Armstrong like this | 3 comments

(An idea from a recent MIRI research workshop; similar to some ideas of Eric Drexler and others. Would need further development before it’s clear whether this would do anything interesting, let alone be a reliable part of a taskifying approach.)

If you take an AI capable of pretty general reasoning, and you ask it to do a task in a restricted domain, you can’t necessarily count on it to find a solution that stays conceptually within that domain. This covers a couple of different failure modes, including wireheading (the AI modifies its own sensory inputs rather than the external object you wanted it to modify) and manipulation of humans (the AI influences human psychology rather than the thing you wanted it to directly affect).

Directly forbidding actions outside a domain seems tricky if we can’t define the domain in closed form (this is especially the case if we have an AI affecting humans, and we don’t want it to develop models of human psychology sufficient for manipulating them). One thing we could try instead is ensuring that the AI doesn’t “know too much”, in its internal representations, outside its domain.

“Know too much”, here, can be defined in the sense of Censoring Representations with an Adversary. We add to our network some additional networks seeking to extract out-of-domain information from internal representations, and then by reversing the sign of the backprop from that network to the representation, we cause it to erase such information.

Furthermore, so that we don’t destroy any hope of decent task performance, we can erase out-of-domain information only to the degree that it exceeds the info from some known-safe “black box” model of the environment outside the domain.

Examples of tasks that we might want to use this on:

  • Play well on a suite of interactive fiction games, but without understanding human psychology any better than some baseline natural language model (e.g. test whether it can predict text conversations between humans).

  • Optimize the newsfeed algorithm for a social network, as measured by analytics data, again without learning human psychology beyond some simple baseline model.

  • Do engineering or chemistry within a physically enclosed box, without learning anything about the world outside that box beyond a simple model of it (e.g. ask it questions about external objects, including the AI’s own hardware, which you’ve put outside the box for obvious reasons).

Question: Is there an extremely simple toy example we could test out on current systems without human input?



by Stuart Armstrong 902 days ago | Ryan Carey and Patrick LaVictoire like this | link

Seems interesting, but the adversary seems to need a very specific definition of what’s outside the domain. Absent that, this just becomes a patch or a nearest unblocked strategy: the solution will the one that’s best in the domain and doesn’t trigger the specific outside-domain adversary.

reply

by David Krueger 894 days ago | link

I agree… if there are specific things you don’t want to be able to do / predict, then you can do something very similar to the cited “Censoring Representations” paper.

But if you want to censor all “out-of-domain” knowledge, I don’t see a good way of doing it.

reply

by Patrick LaVictoire 892 days ago | link

Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.

(In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)

A whitelisting variant would be way more reliable than a blacklisting one, clearly.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms