Exploiting EDT post by Benja Fallenstein 1679 days ago | Ryan Carey, Abram Demski, Daniel Dewey, Nate Soares, Patrick LaVictoire and Paul Christiano like this | 9 comments The problem with EDT is, as David Lewis put it, its “irrational policy of managing the news” (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it’s taking these actions tells it about events the agent can’t affect at all. The canonical example is the smoking lesion problem. I’ve long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values “smoking” (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the “tickle defense” of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb’s problem. (It’s a variation of the thought experiment in my LessWrong post on “the sin of updating when you can change whether you exist”.) Suppose that there’s this EDT agent around which plays the stock market. It’s pretty good at doing so, and has amassed a substantial net worth, but, unsurprisingly, it’s not perfect; there’s always a small chance of its investments going south. Currently, for example, there’s a bit of a hubbub around the CEO of one of the companies the agent has been investing in, and the agent assigns a 0.4% chance that there’s a scandal about to be revealed which will force that CEO to resign; if that’s the case, it expects to lose $150,000,000. Along comes a clever AI researcher, who is known to be extremely good at predicting how different kinds of AI systems will react on simple thought experiments, who has access to the source code of our agent, and who happens to be able to access information about whether or not there’s a scandal. The researcher could find out and sell the information to the agent, but since it’s a low-probability scenario, the value of that information wouldn’t be that large. Instead, the researcher has a better idea: they are going to find out whether or not there is a scandal; figure out whether or not the agent will pay them$100,000,000 if they ask it to in the next step; if either (a) there is no scandal and the researcher predicts that the agent will pay up, or (b) there is a scandal and they predict that the agent won’t pay up, then they will send the agent a pre-drafted e-mail that explains this whole procedure, explains that either case (a) or (b) turned out to be true (but not which one), and asks the agent to immediately pay the researcher $100,000,000. We assume that the agent has sufficient information about the researcher to know that they are trustworthy and a good predictor of the agent’s behavior. Then, as an EDT agent, it’ll calculate expected utility by conditioning on its observations and on the action it’s considering taking, as follows: Conditioning on receiving the e-mail and paying up, the AI is in case (a): it’s paying$100,000,000, but there is no scandal, so it’s not losing the $150,000,000. Conditioning on receiving the e-mail and not paying up, the AI is in case (b): it’s not paying the researcher, but there’s a scandal, and it’s going to lose$150,000,000. The first of these leads to higher conditional expected utility, so the AI pays up. The researcher decides to use the money to fund investigations into better decision theories.

 by Daniel Dewey 1678 days ago | Ryan Carey and Nate Soares like this | link Nice example! I think I understood better why this picks out the particular weakness of EDT (and why it’s not a general exploit that can be used against any DT) when I thought of it less as a money-pump and more as “Not only does EDT want to manage the news, you can get it to pay you a lot for the privilege”. reply
 by Abram Demski 1677 days ago | Nate Soares likes this | link There is a nuance that needs to be mentioned here. If the EDT agent is aware of the researcher’s ploys ahead of time, it will set things up so that emails from the researcher go straight to the spam folder, block the researcher’s calls, and so on. It is not actually happy to pay the researcher for managing the news! This is less pathological than listening to the researcher and paying up, but it’s still an odd news-management strategy that’s result of EDT. reply
 by Benja Fallenstein 1677 days ago | Abram Demski and Nate Soares like this | link True. This looks to me like an effect of EDT not being stable under self-modification, although here the issue is handicapping itself through external means rather than self-modification—like, if you offer a CDT agent a potion that will make it unable to lift more than one box before it enters Newcomb’s problem (i.e., before Omega makes its observation of the agent), then it’ll cheerfully take it and pay you for the privilege. reply
 by Benja Fallenstein 1678 days ago | link Thanks! I didn’t really think at all about whether or not “money-pump” was the appropriate word (I’m not sure what the exact definition is); have now changed “way to money-pump EDT agents” into “way to get EDT agents to pay you for managing the news for them”. reply
 by Daniel Dewey 1678 days ago | link Hm, I don’t know what the definition is either. In my head, it means “can get an arbitrary amount of money from”, e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful. reply
 by Patrick LaVictoire 1679 days ago | Daniel Dewey likes this | link I like this! You could also post it to Less Wrong without any modifications. reply
 by Daniel Dewey 1678 days ago | link I wonder if this example can be used to help pin down desiderata for decisions or decision counterfactuals. What axiom(s) for decisions would avoid this general class of exploits? reply
 by Abram Demski 1677 days ago | link I find this surprising, and quite interesting. Here’s what I’m getting when I try to translate the Tickle Defense: “If this argument works, the AI should be able to recognize that, and predict the AI researcher’s prediction. It knows that it is already the type of agent that will say yes, effectively screening-off its action from the AI researcher’s prediction. When it conditions on refusing to pay, it still predicts that the AI researcher thought it would pay up, and expects the fiasco with the same probability as ever. Therefore, it refuses to pay. By way of contradiction, we conclude that the original argument doesn’t work.” This is implausible, since it seems quite likely that conditioning on its “don’t pay up” action causes the AI to consider a universe in which this whole argument doesn’t work (and the AI researcher sent it a letter knowing that it wouldn’t pay, following (b) in the strategy). However, it does highlight the importance of how the EDT agent is computing impossible possible worlds. reply
 by Abram Demski 1677 days ago | Nate Soares likes this | link More technically, we might assume that the AI is using a good finite-time approximation one of the logical priors that has been explored, conditioned on the description of the scenario. We include a logical description of its own source code and physical computer [making the agent unable to consider disruptions to its machine, but this isn’t important]. To decide actions, the agent makes decisions by the ambient chicken rule: if the agent can prove what action it will take, it does something different from that. Otherwise, it takes the action with the highest expected utility (according to Bayesian conditional). Then, the agent cannot predict that it will give the researcher money, because it doesn’t know whether it will trip its chicken clause. However, it knows that the researcher will make a correct prediction. So, it seems that it will pay up. The tickle defense fails as a result of the chicken rule. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes