by Stuart Armstrong 545 days ago | link | parent I don’t really see the relevant difference here. If the human has their hard-to-satisfy preferences about, eg art and meaning, replaced by a single desire for heroin, this seems like it’s making them easier to satisfy. That’s a good point

 by Jessica Taylor 545 days ago | link Re 1: There are cases where it makes the human’s preferences harder to satisfy. For example, perhaps heroin addicts demand twice as much heroin as the AI can provide, making their preferences harder to satisfy. Yet they will still seek reward strongly and often achieve it, so you might predict that the AI gives them heroin. I think my real beef with saying this “manipulates the human’s preferences to make them easier to satisfy” is that, when most people hear this phrase, they think of a specific technical problem that is quite different from this (in terms of what we would predict the AI to do, not necessarily the desirability of the end result). Specifically, the most obvious interpretation is naive wireheading (under which the AI wants the human to want the speed of light to be above 100m/s), and this is quite a different problem at a technical level. reply
 by Stuart Armstrong 545 days ago | link Wireheading the human is the ultimate goal of the AI. I chose heroin as the first step along those lines, but that’s where the human would ultimately end at. For instance, once the human’s on heroin, the AI could ask it “is your true reward function $$r$$? If you answer yes, you’ll get heroin.” Under the assumption that the human is rational and the heroin offered is short term, this allows the AI to conclude the human’s reward function is any given $$r$$. reply
 by Jessica Taylor 545 days ago | link I strongly predict that if you make your argument really precise (as you did in the main post), it will have a visible flaw in it. In particular, I expect the fact that r and r-1000 are indistinguishable to prevent the argument from going through (though it’s hard to say exactly how this applies without having access to a sufficiently mathematical argument). reply

### NEW DISCUSSION POSTS

If you drop the
 by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
 by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes