Intelligent Agent Foundations Forumsign up / log in
Exploiting EDT
post by Benja Fallenstein 1012 days ago | Ryan Carey, Abram Demski, Daniel Dewey, Nate Soares, Patrick LaVictoire and Paul Christiano like this | 9 comments

The problem with EDT is, as David Lewis put it, its “irrational policy of managing the news” (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it’s taking these actions tells it about events the agent can’t affect at all. The canonical example is the smoking lesion problem.

I’ve long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values “smoking” (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the “tickle defense” of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb’s problem. (It’s a variation of the thought experiment in my LessWrong post on “the sin of updating when you can change whether you exist”.)


Suppose that there’s this EDT agent around which plays the stock market. It’s pretty good at doing so, and has amassed a substantial net worth, but, unsurprisingly, it’s not perfect; there’s always a small chance of its investments going south. Currently, for example, there’s a bit of a hubbub around the CEO of one of the companies the agent has been investing in, and the agent assigns a 0.4% chance that there’s a scandal about to be revealed which will force that CEO to resign; if that’s the case, it expects to lose $150,000,000.

Along comes a clever AI researcher, who is known to be extremely good at predicting how different kinds of AI systems will react on simple thought experiments, who has access to the source code of our agent, and who happens to be able to access information about whether or not there’s a scandal. The researcher could find out and sell the information to the agent, but since it’s a low-probability scenario, the value of that information wouldn’t be that large. Instead, the researcher has a better idea: they are going to

  1. find out whether or not there is a scandal;
  2. figure out whether or not the agent will pay them $100,000,000 if they ask it to in the next step;
  3. if either (a) there is no scandal and the researcher predicts that the agent will pay up, or (b) there is a scandal and they predict that the agent won’t pay up, then they will send the agent a pre-drafted e-mail that explains this whole procedure, explains that either case (a) or (b) turned out to be true (but not which one), and asks the agent to immediately pay the researcher $100,000,000.

We assume that the agent has sufficient information about the researcher to know that they are trustworthy and a good predictor of the agent’s behavior. Then, as an EDT agent, it’ll calculate expected utility by conditioning on its observations and on the action it’s considering taking, as follows:

  • Conditioning on receiving the e-mail and paying up, the AI is in case (a): it’s paying $100,000,000, but there is no scandal, so it’s not losing the $150,000,000.
  • Conditioning on receiving the e-mail and not paying up, the AI is in case (b): it’s not paying the researcher, but there’s a scandal, and it’s going to lose $150,000,000.

The first of these leads to higher conditional expected utility, so the AI pays up. The researcher decides to use the money to fund investigations into better decision theories.



by Daniel Dewey 1010 days ago | Ryan Carey and Nate Soares like this | link

Nice example! I think I understood better why this picks out the particular weakness of EDT (and why it’s not a general exploit that can be used against any DT) when I thought of it less as a money-pump and more as “Not only does EDT want to manage the news, you can get it to pay you a lot for the privilege”.

reply

by Abram Demski 1009 days ago | Nate Soares likes this | link

There is a nuance that needs to be mentioned here. If the EDT agent is aware of the researcher’s ploys ahead of time, it will set things up so that emails from the researcher go straight to the spam folder, block the researcher’s calls, and so on. It is not actually happy to pay the researcher for managing the news!

This is less pathological than listening to the researcher and paying up, but it’s still an odd news-management strategy that’s result of EDT.

reply

by Benja Fallenstein 1009 days ago | Abram Demski and Nate Soares like this | link

True. This looks to me like an effect of EDT not being stable under self-modification, although here the issue is handicapping itself through external means rather than self-modification—like, if you offer a CDT agent a potion that will make it unable to lift more than one box before it enters Newcomb’s problem (i.e., before Omega makes its observation of the agent), then it’ll cheerfully take it and pay you for the privilege.

reply

by Benja Fallenstein 1010 days ago | link

Thanks! I didn’t really think at all about whether or not “money-pump” was the appropriate word (I’m not sure what the exact definition is); have now changed “way to money-pump EDT agents” into “way to get EDT agents to pay you for managing the news for them”.

reply

by Daniel Dewey 1010 days ago | link

Hm, I don’t know what the definition is either. In my head, it means “can get an arbitrary amount of money from”, e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful.

reply

by Patrick LaVictoire 1011 days ago | Daniel Dewey likes this | link

I like this! You could also post it to Less Wrong without any modifications.

reply

by Daniel Dewey 1010 days ago | link

I wonder if this example can be used to help pin down desiderata for decisions or decision counterfactuals. What axiom(s) for decisions would avoid this general class of exploits?

reply

by Abram Demski 1009 days ago | link

I find this surprising, and quite interesting.

Here’s what I’m getting when I try to translate the Tickle Defense:

“If this argument works, the AI should be able to recognize that, and predict the AI researcher’s prediction. It knows that it is already the type of agent that will say yes, effectively screening-off its action from the AI researcher’s prediction. When it conditions on refusing to pay, it still predicts that the AI researcher thought it would pay up, and expects the fiasco with the same probability as ever. Therefore, it refuses to pay. By way of contradiction, we conclude that the original argument doesn’t work.”

This is implausible, since it seems quite likely that conditioning on its “don’t pay up” action causes the AI to consider a universe in which this whole argument doesn’t work (and the AI researcher sent it a letter knowing that it wouldn’t pay, following (b) in the strategy). However, it does highlight the importance of how the EDT agent is computing impossible possible worlds.

reply

by Abram Demski 1009 days ago | Nate Soares likes this | link

More technically, we might assume that the AI is using a good finite-time approximation one of the logical priors that has been explored, conditioned on the description of the scenario. We include a logical description of its own source code and physical computer [making the agent unable to consider disruptions to its machine, but this isn’t important]. To decide actions, the agent makes decisions by the ambient chicken rule: if the agent can prove what action it will take, it does something different from that. Otherwise, it takes the action with the highest expected utility (according to Bayesian conditional).

Then, the agent cannot predict that it will give the researcher money, because it doesn’t know whether it will trip its chicken clause. However, it knows that the researcher will make a correct prediction. So, it seems that it will pay up.

The tickle defense fails as a result of the chicken rule.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

I have stopped working on
by Scott Garrabrant on Cooperative Oracles: Introduction | 0 likes

The only assumptions about
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

So this requires the agent's
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

If the agent always delegates
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

Hi Vadim! So basically the
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

Hi Tom! There is a
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

Hi Alex! I agree that the
by Vadim Kosoy on Cooperative Oracles: Stratified Pareto Optima and ... | 0 likes

That is a good question. I
by Tom Everitt on CIRL Wireheading | 0 likes

Adversarial examples for
by Tom Everitt on CIRL Wireheading | 0 likes

"The use of an advisor allows
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

If we're talking about you,
by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

Suppose that I, Paul, use a
by Paul Christiano on Current thoughts on Paul Christano's research agen... | 0 likes

When you wrote "suppose I use
by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

> but that kind of white-box
by Paul Christiano on Current thoughts on Paul Christano's research agen... | 0 likes

>Competence can be an
by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

RSS

Privacy & Terms