Intelligent Agent Foundations Forumsign up / log in
Exploiting EDT
post by Benja Fallenstein 1225 days ago | Ryan Carey, Abram Demski, Daniel Dewey, Nate Soares, Patrick LaVictoire and Paul Christiano like this | discuss

The problem with EDT is, as David Lewis put it, its “irrational policy of managing the news” (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it’s taking these actions tells it about events the agent can’t affect at all. The canonical example is the smoking lesion problem.

I’ve long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values “smoking” (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the “tickle defense” of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb’s problem. (It’s a variation of the thought experiment in my LessWrong post on “the sin of updating when you can change whether you exist”.)

Suppose that there’s this EDT agent around which plays the stock market. It’s pretty good at doing so, and has amassed a substantial net worth, but, unsurprisingly, it’s not perfect; there’s always a small chance of its investments going south. Currently, for example, there’s a bit of a hubbub around the CEO of one of the companies the agent has been investing in, and the agent assigns a 0.4% chance that there’s a scandal about to be revealed which will force that CEO to resign; if that’s the case, it expects to lose $150,000,000.

Along comes a clever AI researcher, who is known to be extremely good at predicting how different kinds of AI systems will react on simple thought experiments, who has access to the source code of our agent, and who happens to be able to access information about whether or not there’s a scandal. The researcher could find out and sell the information to the agent, but since it’s a low-probability scenario, the value of that information wouldn’t be that large. Instead, the researcher has a better idea: they are going to

  1. find out whether or not there is a scandal;
  2. figure out whether or not the agent will pay them $100,000,000 if they ask it to in the next step;
  3. if either (a) there is no scandal and the researcher predicts that the agent will pay up, or (b) there is a scandal and they predict that the agent won’t pay up, then they will send the agent a pre-drafted e-mail that explains this whole procedure, explains that either case (a) or (b) turned out to be true (but not which one), and asks the agent to immediately pay the researcher $100,000,000.

We assume that the agent has sufficient information about the researcher to know that they are trustworthy and a good predictor of the agent’s behavior. Then, as an EDT agent, it’ll calculate expected utility by conditioning on its observations and on the action it’s considering taking, as follows:

  • Conditioning on receiving the e-mail and paying up, the AI is in case (a): it’s paying $100,000,000, but there is no scandal, so it’s not losing the $150,000,000.
  • Conditioning on receiving the e-mail and not paying up, the AI is in case (b): it’s not paying the researcher, but there’s a scandal, and it’s going to lose $150,000,000.

The first of these leads to higher conditional expected utility, so the AI pays up. The researcher decides to use the money to fund investigations into better decision theories.





If you drop the
by Alex Appel on Distributed Cooperation | 0 likes

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes


Privacy & Terms