Can you control the past?

Joe Carlsmith

(Cross-posted from Hands and Cities. Lots of stuff familiar to LessWrong folks interested in decision theory.)

I think that you can “control” events you have no causal interaction with, including events in the past, and that this is a wild and disorienting fact, with uncertain but possibly significant implications. This post attempts to impart such disorientation.

My main example is a prisoner’s dilemma between perfect deterministic software twins, exposed to the exact same inputs. This example that shows, I think, that you can write on whiteboards light-years away, with no delays; you can move the arm of another person, in another room, just by moving your own. This, I claim, is extremely weird.

My topic, more broadly, is the implications of this weirdness for the theory of instrumental rationality (“decision theory”). Many philosophers, and many parts of common sense, favor causal decision theory (CDT), on which, roughly, you should pick the action that causes the best outcomes in expectation. I think that deterministic twins, along with other examples, show that CDT is wrong. And I don’t think that uncertainty about “who are you,” or “where your algorithm is,” can save it.

Granted that CDT is wrong, though, I’m not sure what’s right. The most famous alternative is evidential decision theory (EDT), on which, roughly, you should choose the action you would be happiest to learn you had chosen. I think that EDT is more attractive (and more confusing) than many philosophers give it credit for, and that some putative counterexamples don’t withstand scrutiny. But EDT has problems, too.

In particular, I suspect that attractive versions of EDT (and perhaps, attractive attempts to recapture the spirit of CDT) require something in the vicinity of “following the policy that you would’ve wanted yourself to commit to, from some epistemic position that ‘forgets’ information you now know.” I don’t think that the most immediate objection to this – namely, that it implies choosing lower pay-offs even when you know them with certainty – is decisive (though some debates in this vicinity seem to me verbal). But it also seems extremely unclear what epistemic position you should evaluate policies from, and what policy such a position actually implies.

Overall, rejecting the common-sense comforts of CDT, and accepting the possibility of some kind of “acausal control,” leaves us in strange and uncertain territory. I think we should do it anyway. But we should also tread carefully.

I. Grandpappy Omega

Decision theorists often assume that instrumental rationality is about maximizing expected utility in some sense. The question is: what sense?

The most famous debate is between CDT and EDT. CDT chooses the action that will have the best effects. EDT chooses the action whose performance would be the best news.

More specifically: CDT and EDT disagree about the type of “if” to use when evaluating the utility to expect, if you do X. CDT uses a counterfactual type of “if” — one that holds fixed the probability of everything outside of action X’s causal influence, then plays out the consequences of doing X. In this sense, it doesn’t allow your choice to serve as “evidence” about anything you can’t cause — even when your choice is such evidence.

EDT, by contrast, uses a conditional “if.” That is, to evaluate X, it updates your overall picture of the world to reflect the assumption that action X has been been performed, and then sees how good the world looks in expectation. In this sense, it takes all the evidence into account, including the evidence that your having done X would provide.

To see what this difference looks like in action, consider:

Newcomb’s problem: You face two boxes: a transparent box, containing a thousand dollars, and an opaque box, which contains either a million dollars, or nothing. You can take (a) only the opaque box (one-boxing), or (b) both boxes (two-boxing). Yesterday, Omega — a superintelligent AI — put a million dollars in the opaque box if she predicted you’d one-box, and nothing if she predicted you’d two-box. Omega’s predictions are almost always right.

CDT two-boxes. Your choice, after all, is evidence about what’s in the opaque box, but it doesn’t actually affect what’s in the box — by the time you’re choosing, the opaque box is either already empty, or already full. So CDT assigns some probability p to the box being full, and then holds that probability fixed in evaluating different actions. Let’s say p is 1%. CDT’s expected payoffs are then:

One-boxing: 1% probability of $1M, 99% probability of nothing = $10K.
Two-boxing: 1% probability of $1M + $1K, 99% probability of $1K = $11K.

Note that there’s some ambiguity, here, about whether CDT then updates p based on its knowledge that it’s about to two-box, then recalculates the expected utilities, and only goes forward if it finds equilibrium. And in some problems, this sort of recalculation makes CDT’s decision-making unstable — see e.g. Gibbard and Harper’s (1978) “Death in Damascus.” But in Newcomb’s problem, no matter what p you use, CDT always says that two-boxing is $1K better, and so two-boxes regardless of what it thinks Omega did, or what evidence its own plans provide.

EDT, by contrast, one-boxes. Learning that you one-boxed, after all, is the better news: it means that Omega probably put a million in the opaque box. More specifically, in comparing one-boxing with two-boxing, EDT changes the probability that the box is full. Why? Because, well, the probability is different, conditional on one-boxing vs. two-boxing. Thus, EDT’s pay-offs are:

One-boxing: ~100% chance of $1M = ~$1M.
Two-boxing: ~100% chance of $1K = ~$1K.

What’s the right choice? I think: one-boxing, and I’ll say much more about why below. But I feel the pull towards two-boxing, for CDT-ish reasons.

Imagine, for example, that you have a friend who can see what’s in the opaque box (see Drescher (2006) for this framing). You ask them: what choice will leave me richer? They start to answer. But wait: did you even need to ask? Whether the opaque box is empty or full, you know what they’re going to say. Every single time, the answer will be: two-boxing, dumbo. Omega, after all, is gone; the box’s contents are fixed; the past is past. The question now is simply whether you want an extra $1,000, or not.

I find that my two-boxing intuition strengthens if Omega is your great grandfather, long dead (h/t Amanda Askell for suggesting this framing to me years ago), and if we specify that he’s merely a “pretty good” predictor; one who is right, say, 80% of the time (EDT still says to one-box, in this case). Suppose that he left the boxes in the attic of your family estate, for you to open on your 18th birthday. At the appointed time, you climb the dusty staircase; you brush the cobwebs off the antique boxes; you see the thousand through the glass. Are you really supposed to just leave it there, sitting in the attic? What sort of rationality is that?

Sometimes, one-boxers object: if two-boxers are so rational, why do the one-boxers end up so much richer? But two-boxers can answer: because Omega has chosen to give better options to agents who will choose irrationally. Two-boxers make the best of a worse situation: they almost always face a choice between nothing or $1K, and they, rationally, choose $1K. One-boxers, by contrast, make the worse of a better situation: they almost always face a choice between $1M or $1M+$1K, and they, irrationally, choose $1M.

But wouldn’t a two-boxer want to modify themselves, ahead of Omega’s prediction, to become a one-boxer? Depending on the modification and the circumstances: yes. But depending on the modification and the circumstances, it can be rational to self-modify into any old thing — especially if rich and powerful superintelligences are going around rewarding irrationality. If Omega will give you millions if you believe that Paris is in Ohio, self-modifying to make such a mistake might be worth it; but the Eiffel Tower stays put. At the very least, then, arguments from incentives towards self-modification require more specificity. (Though we might try to provide this specificity, by focusing on self-modifications whose advantages are sufficiently robust, and/or on a restricted class of cases that we deem “fair.”)

CDT’s arguments and replies to objections here are simple, flat-footed, and I think, quite strong. Indeed, many philosophers are convinced by something in the vicinity (see e.g. the 2009 Phil Papers survey, in which two-boxing, at 31%, beats one-boxing, at 21%, with the other 47% answering “other” – though we might wonder what “other” amounts to in a case with only two options). And more broadly, that I think that relative to EDT at least, CDT fits better with a certain kind of common sense. Action, we think, isn’t about manipulating our evidence about what’s already the case – what David Lewis calls “managing the news.” Rather, action is about causing stuff. In this sense, CDT feels to me like a basic and hard-headed default. In my head, it’s the “man on the street’s” decision theory. It’s not trying to get “too fancy.” It can feel like solid ground.

II. Writing on whiteboards light-years away

Nevertheless, I think that CDT is wrong. Here’s the case that convinces me most.

Perfect deterministic twin prisoner’s dilemma: You’re a deterministic AI system, who only wants money for yourself (you don’t care about copies of yourself). The authorities make a perfect copy of you, separate you and your copy by a large distance, and then expose you both, in simulation, to exactly identical inputs (let’s say, a room, a whiteboard, some markers, etc). You both face the following choice: either (a) send a million dollars to the other (“cooperate”), or (b) take a thousand dollars for yourself (“defect”).

(Prisoner’s dilemmas, with varying degrees of similarity between the participants, are common in the decision theory literature: see e.g. Lewis (1979), and Hofstadter (1985)).

CDT, in this case, defects. After all, your choice can’t causally influence your copy’s choice: you’re in your room, and he’s in his, far away. Indeed, we can specify that such influence is physically impossible – by the time information about your choice, traveling at the speed of light, can reach him, he’ll have already chosen (and vice versa). And regardless of what he chooses, you get more money by taking the thousand.

But defecting in this case, I claim, is totally crazy. Why? Because absent some kind of computer malfunction, both of you will make the same choice, as a matter of logical necessity. If you press the defect button, so will he; if you cooperate, so will he. The two of you, after all, are exact mirror images. You move in unison; you speak, and think, and reach for buttons, in perfect synchrony. Watching the two of you is like watching the same movie on two screens.

Indeed, for all intents and purposes, you control what he does. Imagine, for example, that you want to get something written on his whiteboard: let’s say, the words “I am the egg man; you are the walrus.” What to do? Just write it on your own whiteboard. Go ahead, try it. It will really work. When you two rendezvous after this is all over, his whiteboard will bear the words you chose. In this sense, your whiteboard is a strange kind of portal; a slate via which you can etch your choices into his far-away world; a chance to act, spookily, at a distance.

And it’s not just whiteboards: you can make him do whatever you want – dance a silly samba, bang his head against the wall, press the cooperate button — just by doing it yourself. He is your puppet. Invisible strings, more powerful and direct than any that operate via mere causality, tie every movement of your mind and body to his.

What’s more: such strings can’t be severed. Try, for example, to make the two whiteboards different. Imagine that you’ll get ten million dollars if you succeed. It doesn’t matter: you’ll fail. Your most whimsical impulse, your most intricate mental acrobatics, your special-est snowflake self, will never suffice: you can no more write “up” while he writes “down” than you can floss while the man in the bathroom mirror brushes his teeth. In this sense, if you find yourself reasoning about scenarios where he presses one button, and you press another – e.g., “even if he cooperates, it would be better for me to defect” – then you are misunderstanding your situation. Those scenarios just aren’t on the table. The available outcomes here are only defect-defect, and cooperate-cooperate. You can get a thousand, by defecting, or you can get a million, by cooperating; but you can’t get less, or more.

To me, it’s an extremely easy choice. Just press the “give myself a million dollars” button! Indeed, at this point, if someone tells me “I defect on a perfect, deterministic copy of myself, exposed to identical inputs,” I feel like: really?

Note that this doesn’t seem like a case where any idiosyncratic predictors are going around rewarding irrationality. Nor, indeed, does feel to me like “cooperating is an irrational choice, but it would be better for me to be the type of person who makes such a choice” or “You should pre-commit to cooperating ahead of time, however silly it will seem in the moment” (I’ll discuss cases that have more of this flavor later). Rather, it feels like what compels me is a direct, object-level argument, which could be made equally well before the copying or after. This argument recognizes a form of acausal “control” that our everyday notion of agency does not countenance, but which, pretty clearly, needs to be taken into account. Indeed, in effect, I feel like the case discovers a kind of magic; a mechanism for writing on whiteboards light-years away; a way of moving my copy’s hand to the cooperate button, or the defect button, just by moving mine. Ignoring this magic feels like ignoring a genuine and decision-relevant feature of the real world.

III. Who is the eggman, and who is the walrus?

I want to acknowledge and emphasize, though, that this kind of magic is extremely weird. Recognizing it, I think, involves a genuinely different way of understanding your situation, and your power. It makes your choices reverberate in new directions; it gives you a new type of control, over things you once thought beyond your sphere of influence – including, I’ll suggest, over events in the past (more on this below).

What’s more, I think, it changes – and clarifies — your sense of what your agency amounts to. Consider: who is the eggman, here, and who is the walrus? Suppose you want to send your copy a message: “hello, this is a message from your copy.” So you write it on your whiteboard, and thus on his. You step back, and see a message on your own whiteboard: “hello, this is a message from your copy.” Did he write that to you? Was that your way of writing to him? Are you actually alone, writing to yourself? All of three at once. I said earlier that your copy is your puppet. But equally, you are his puppet. But more truly, neither of you are puppets. Rather, you are both free men, in a strange but actually possible situation. You stand in front of your whiteboard, and it is genuinely up to you what you write, or do. You can write “I am a little lollypop, booka booka boo.” You can draw a demon kitten eating a windmill. You can scream, and dance, and wave your arms around, however you damn well please. Feel the wind on your face, cowboy: this is liberty. And yet, he will do the same. And yet, you two will always move in unison.

We can think of the magic, here, as arising centrally because compatibilism about free will is true. Let’s say you got copied on Monday, and it’s Friday, now – the day both copies will choose. On Monday, there was already an answer as to what button you and your copy will press, given exposure to the Friday inputs. Maybe we haven’t computed the answer yet (or maybe we have); but regardless, it’s fixed: we just need to crunch the numbers, run the deterministic code. From this sort of pre-determination comes a classic argument against free will: if the past and the physical laws (or their computational analogs, e.g. your state on Monday, and the rest of the code that will be run on Friday) are only compatible with your performing one of (a) or (b), then you can’t be free to choose either, because this would imply that you are free to choose the past/or the physical laws, which you can’t. Here, though, we pull a “one person’s reductio is another’s discovery”: because only one of (a) or (b) is compatible with the past/the physical laws, and because you are free to choose (a) or (b), it turns out that in some sense, you’re free to choose the past/the physical laws (or, their computational analogs).

What? That can’t be right. But isn’t it, in the practically relevant sense? Consider: the case is basically one where, if it’s the case that your state on Monday (call this Monday-Joe), copied and evolved according to deterministic process P, outputs “cooperate,” then you get a million dollars; and if it outputs “defect,” you get a thousand dollars (see e.g. Ahmed (2014)‘s “Betting on the Past” for an even simpler version of this). It’s Friday now. The state of Monday-Joe is fixed; Monday-Joe lives in the past. And process P, let’s say, was fixed on Monday, too. In this sense, the question of what Monday-Joe + process P outputs is already fixed. You, on Friday, are evolving-Joe: that is, Monday-Joe-in-the-midst-of-evolving-according-to-process-P. If you choose cooperate, it will always have been the case that Monday-Joe + process P outputs cooperate. If you choose defect, it will always have been the case that Monday-Joe + process P outputs defect. In this very real sense – the same sense at stake in every choice in a deterministic world – you get to choose what will have always been the case, even before your choice.

Try it. It will really work. Make your Friday choice, then leave the simulation, go get an old and isolated copy of Monday-Joe and Process P – one that’s been housed, since Monday, somewhere you could not have touched or tampered with — press play, and watch what comes out the other end. You won’t be surprised.

Is that changing the past? In one sense: no. It’s not that Joe’s state on Monday was X, but then because of what Evolving-Joe did on Friday, Joe’s state on Monday became Y instead. Nor does the output of Monday-Joe + Process P alter over the course of the week. Don’t be silly. You can’t change these things like you can change the contents of your fridge: milk on one day, juice on the next. It’s not milk at noon on Monday, and then on Friday, juice at noon on Monday instead. We must distinguish between the ability to “change things” in this sense, and the ability to “control” them in some broader sense.

But nevertheless: you get to decide, on Friday, the thing that will always have been true; the one thing that will always have been in your fridge, since the beginning of time. And perhaps this approaches, ultimately, the full sense of compatibilist decision-making, compatibilist “control,” even in cases of causal influence. Perhaps, that is, you can change the past, here, about as much as you can change the future in a deterministic world: that is, not at all, and enough to matter for practical purposes. After all, in such a world, the future is already fixed by the past. Your ability to decide that future was, therefore, always puzzling. Perhaps your ability to decide the past isn’t much more so (though certainly, it’s no less).

CDT can’t handle this kind of thing. CDT imagines that we have severed the ties between you and your copy, between you and the history that determines every aspect of you. It imagines that you can hold your copy’s arm fixed, and move yours freely; that you can break apart the future from the past, and let the future swing, at your pleasure, along some physically (indeed, logically!) impossible hinge. But you can’t. The echoes of your choice started before you chose. You are implicated in a structure that reverberates in all directions. You pull your arm, and the past and the universe trail behind; and yet, the past and universe push your arm; and yet, neither: you, the past, the future, the universe, are all born in the same timeless instant — free, fixed, consistent, a full and living painting of someone painting it as they go along.

And CDT’s mistake, here, is not just abstract misconception: rather, it misleads you in straightforward and practically-relevant ways. In particular, it prompts CDT to compare actions using expected utilities that you shouldn’t actually expect – which, when you step back, seems pretty silly. Suppose, for example, that as a CDT agent, you start out with a credence p that your copy will defect of 99%. Thus, as in Newcomb’s problem above, your payoffs are:

Expected utility from defecting: $1K guaranteed + $10K from a 1% probability of getting a million from my copy = $11K.
Expected utility from cooperating: $10K from a 1% probability of getting a million from my copy = $10K.

But you shouldn’t actually expect only $10k, if you cooperate, given the logical necessity of his doing what you do. That’s just … not the right number. So why are you considering it? This is no time to play around with fantasy distributions over outcomes; there’s real money on the line. And of course, this sort of objection will hold for any p. As long as you and your copy’s choice are correlated, CDT is going to ignore that correlation, hold p constant given different actions, and in that sense, prompt you to choose as though your probabilities are wrong.

EDT does better, here, of course: choosing based on what utility you, as a Bayesian, should actually expect, given different actions, is EDT’s forté by definition, and a powerful argument in its favor (see e.g. Christiano’s “simple argument for EDT” here). And the considerations about compatibilism and determinism I’ve been discussing seem friendly to EDT as well. After all, if you are a living in already-painted painting, it seems unsurprising if choice comes down to something like “managing the news.” The problem with managing the news, after all, was supposed to be that the news was already fixed. But in an already-painted painting, the future has already been fixed, too: you just don’t know what it is. And when you act, you start to find out. Insofar as you can choose how to act – and per compatibilism, you can – then you can choose what you’re going to find out, and in that sense, influence it. Do you hope that this already-fixed universe is one where you eat a sandwich? Well, go make a sandwich! If you do, you’ll discover that your dream for the universe has always been true, since the beginning of time. If you don’t make a sandwich, though, your dream will die. Why should the applicability of such reasoning be limited by the scope of “causation” (whatever that is)?

IV. What if the case is less clean?

I took pains, above, to specify that the copying process was perfect, and the inputs received exactly identical. It’s perfectly possible to satisfy this constraint, and we don’t need to use “atom-for-atom” copies and the like, or assume determinism at a physical level; we can just make you an AI system running in a deterministic simulation. What’s more, this constraint helps make the point more vivid; and it suffices, I think, to show that CDT is wrong.

However, I don’t think it’s necessary. Consider, for example, a version where there are small errors in the copying process; or in which you get a blue hat, and your copy, a red; or in which your environment involves some amount of randomness. These may or may not suffice to ruin your ability to write exactly what you want on his whiteboard. But very plausibly, the strong correlation between your choice of button, and his, will persist: and to the extent it does, this information is worthy of inclusion in your decision-making process.

What if you know that your copy has already chosen, before you make your choice? To the extent that the correlations between your choice and his persist in such conditions, I think that the same argument applies. Note, though, that your knowing that he’s already chosen means that the two of you got different inputs in a sense that seems more likely to affect your decision-making than getting different colored hats. That is, you saw a light indicating “your copy has already chosen”; he didn’t; and some people, faced with a light of that kind, start acting all weird about how “his choice is already made, I can’t affect it, might as well defect” and so on, in a way that they don’t when the light is off. So the question of what sorts of correlations are still at stake is more up for grabs. Does learning that you cooperate, after seeing such a light, still make it more likely that he cooperated, without seeing one? If so, that seems worth considering.

(This sort of “different inputs” dynamic also blocks certain types of loops/contradictions that could come from learning what a deterministic copy of you already did. E.g., if you learn what he chose — say, that he cooperated — before you make your choice, it’s still compatible with the case’s set up that you defect, as long as he got different inputs: e.g., he didn’t also learn that you cooperated. If he did “learn” that you cooperated, then things are getting more complicated. In particular, either you will in fact cooperate, or some feature of the case’s set-up is false. This is similar to how, if you travel back in time and try to kill you grandfather, either you will in fact fail, or the case’s set-up is false. Or to how, if you hear an infallible prediction that you’ll do X, then either you will in fact do X, or the prediction wasn’t infallible after all.)

V. Monopoly money

I think that “perfect deterministic twin prisoner’s dilemma”-type cases suffice to show that CDT is wrong. But I also want to note another type of argument I find persuasive, in the context of Newcomb’s problem, and which also evokes the type of “magic” I have in mind.

Imagine doing “tryout runs” of Newcomb’s problem, using monopoly money, as many times as you’d like, before facing the real case (h/t Drescher (2006) again). You try different patterns of one-boxing and two-boxing, over and over. Every time you one-box, the opaque box is full. Each time you two-box, it’s empty.

You find yourself thinking: “wow, this Omega character is no joke.” But you try getting fancier. You fake left, then go right — reaching for the one box, then lunging for the second box too at the last moment. You try increasingly complex chains of reasoning. Before choosing, you try deceiving yourself, bonking yourself on the head, taking heavy doses of hallucinogens. But to no avail. You can’t pull a fast one on ol’ Omega. Omega is right every time.

Indeed, pretty quickly, it starts to feel like you can basically just decide what the opaque box will contain. “Shazam!” you say, waving your arms over the boxes: “I hereby make it the case that Omega put a million dollars into the box.” And thus, as you one box, it is so. “Shazam!” you say again, waving your arms over a new set of boxes: “I hereby make it the case that Omega left the box empty.” And thus, as you two-box, it is so. With Omega’s help, you feel like you have become a magician. With Omega’s help, you feel like you can choose the past.

Now, finally, you face the true test, the real boxes, the legal tender. What will you choose? Here, I expect some feeling like: “I know this one; I’ve played this game before.” That is, I expect to have learned, in my gut, what one-boxing, or two-boxing, will lead to — to feel viscerally that there are really only two available outcomes here: I get a million dollars, by one boxing, or I get a thousand, by two-boxing. The choice seems clear.

VI. Against undue focus on folk-theoretical names

Of course, the same two-boxing responses I noted above apply here, too. It’s true that every time you one-box, you would’ve gotten an extra $1,000 if you’d two-boxed, assuming CDT’s “counterfactual” construal of “would.” It’s true that you leave the $1,000 dollars on the table; that is this is predictably regrettable for some sense of “regret”; and we can say, for this reason, that “Omega is just play-rewarding your play-irrationality.” I don’t have especially deep responses to these objections. But I find myself persuaded, nevertheless, that one-boxing is the way to go.

Or at least, it’s my way. When I step back in Newcomb’s case, I don’t feel especially attached to the idea that it's the way, the only “rational” choice (though I admit I feel this non-attachment less in perfect twin prisoner’s dilemmas, where defecting just seems to me pretty crazy). Rather, it feels like my conviction about one-boxing start to bypass debates about what’s “rational” or “irrational.” Faced with the boxes, I don’t feel like I’m asking myself “what’s the rational choice?” I feel like I’m, well, deciding what to do. In one sense of “rational” – e.g., the counterfactual sense – two-boxing is rational. In another sense – the conditional sense — one-boxing is. What’s the “true sense,” the “real rationality”? Mu. Who cares? What’s that question even about? Perhaps, for the normative realists, there is some “true rationality,” etched into the platonic realm; a single privileged way that the normative Gods demand that you arrange your mind, on pain of being… what? “Faulty”? Silly? Subject to a certain sort of criticism? But for the anti-realists, there is just the world, different ways of doing things, different ways of using words, different amounts of money that actually end up in your pocket. Let’s not get too hung up on what gets called what.

There’s a great line from David Lewis, which I often think of on those rare and clear-cut occasions when philosophical debate starts to border on the terminological.

“Why care about objective value or ethical reality? The sanction is that if you do not, your inner states will fail to deserve folk-theoretical names. Not a threat that will strike terror into the hearts of the wicked! But whoever thought that philosophy could replace the hangman?”

I want to highlight, in particular, the idea of “failing to deserve folk-theoretical names.” Too often, philosophy – especially normative philosophy — devolves into a debate about what kind of name-calling is appropriate, when. But faced with the boxes, or the buttons, our eyes should not be on the folk-theoretical names at stake. Rather, our eyes should be on the choice itself.

Note that my point here is not that “rationality is about winning” (see e.g. Yudkowsky (2009)). “Winning,” here, is subject to the same ambiguity as “rational.” One-boxers tend to end up richer, yes. But faced with a choice between $1k, or nothing (the choice that the two-boxer is actually presented with), $1k is the winning choice. Still, I am with Yudkowsky in spirit, in that I think that too much interest in the word “rational” here is apt to move our eyes from the prize.

(All that said, I’m going to continue, in what follows, to use the standard language of “what’s rational,” “what you should do,” etc, in discussing these cases. I hope that this language will be interpreted in a sense that connects directly to the actual, visceral process of deciding what to do, name-calling be damned. I acknowledge, though, that there’s a possible motte-and-bailey dynamic here, where the one-boxer goes in hard for claims like “CDT is wrong” and “c’mon, defecting in perfect twin prisoner’s dilemmas is just ridiculous!” and then backs off to “hey man, you’ve got your way, I’ve got my way, what’s all this obsession with the word ‘rationality’?” when pressed about the counterintuitive consequences of their own position. And more broadly, it can be hard to combine object level normative debate, which often reflects with a kind of “realist” flavor, with adequate consciousness and communication of some more fundamental meta-ethical arbitrariness. If necessary, we might go back through the whole post and try to rewrite it in more explicitly anti-realist terms — e.g., “I reject CDT.” But I’ll skip that, partly because I suspect that something beyond naive meta-ethical realism gets lost in this sort of move, even if we don’t have an explicit account of what it is.)

VII. Identity crises are no defense of CDT

I’ve now covered two data-points that I take to speak very strongly against CDT: namely, that one should cooperate in a twin prisoner’s dilemma, and that one should one-box in Newcomb’s problem. I want to briefly discuss an unusual way of trying to get CDT to one-box: namely, by appealing to uncertainty about whether you faced with the real boxes, or whether you are in a simulation being used by Omega to predict your future choice (see e.g. Aaronson (2005) and Critch (2017) for suggestions in this vein, though not necessarily in these specific terms). Basically, I don’t think this move works, in general, as a way of saving CDT, though the type of uncertainty in question might be relevant in other ways.

How is the story supposed to go? Imagine that you know that the way Omega predicts whether you’ll one-box, or two-box, is by running an extremely high-fidelity simulation of you. And suppose that both real-you and sim-you only care about what happens to real-you. By hypothesis, sim-you shouldn’t be able to figure out whether he’s simulated or real, because then he’ll serve as worse evidence about real-you’s future behavior (for example, if sim-you appears in a room with writing on the wall saying “you’re the sim,” then he can just one-box, thereby causing Omega to add the money to the opaque box, thereby allowing real-you, appearing in a room saying “you’re the real one,” to two-box, get the full million-point-one, and make Omega’s “prediction” wrong). So it needs to be the case that you’re uncertain – let’s say, 50-50 — about whether you’re simulated or not. Thus, the thought goes, you should one-box, because there’s a 50% chance that doing so will cause Omega to put the million in the box, and your real-self (who will also, presumably, one-box, given the similarity between you) will get it.

(Calculation: feel free to skip. Suppose that you currently expect yourself to one-box, as both real-you and sim-you, with 99% probability. Then the CDT calculation runs as follows:

50% chance you’re the sim, in which case:
- EV of one-boxing = 99% chance real-you gets a $1M, 1% chance real-you gets $1M + $1K = $1,000,010.
- EV of two-boxing = 99% chance real-you gets nothing, 1% chance real-you gets $1K = $10.
50% chance you’re real, in which case:
- EV of one-boxing: 99% chance real-you gets $1M, 1% chance real-you gets nothing = $990,000.
- EV of two-boxing: 99% chance real-you gets $1M + $1K, 1% chance real-you gets $1k = $991,000.
So overall:
- EV of one-boxing = 50% * $1,000,010 + 50% * $990,000 = $995,005.
- EV of two-boxing = 50% * $10 + 50% * $991,000 = $495,505.

Depending on the details, CDT may then need to adjust its probability that both sim-you and real-you one-box. But high-confidence that both versions of you one-box is a stable equilibrium (e.g., CDT still one-boxes, give such a belief); whereas high-confidence that both will two-box is not (e.g, CDT one-boxes, given such a belief). There are also some problems, here, with making such calculations consistent with assigning a specific probability to Omega being right in her prediction, but I’m setting those aside.)

My objections here are:

This move doesn’t work if you’re indexically selfish (e.g., you don’t care about copies of yourself).
This move doesn’t work for twin prisoner’s dilemma cases more broadly.
It’s not clear that simulations are necessary for predicting your actions in the relevant cases.
In general, it really doesn’t feel like this type of thing is driving my convictions about these cases.

Let’s start with (1). Suppose that real-you and sim-you aren’t united in sole concern for real-you. Rather, suppose that you’re both out for yourselves. Sim-you, let’s suppose, faces bleak prospects: whatever happens, Omega is going to shut down the simulation right after sim-you’s choice gets made. So sim-you doesn’t give a shit about this whole ridiculous situation with the god-damn boxes; the world is dust and ashes. Real-you, by contrast, is a CDT agent. So real-you, left to his own devices, is a two-boxer. Hence, sim-you doesn’t care, and real-you wants to two-box; and thus, uncertain about who you are, you two-box.

(Calculation, feel free to skip. Suppose you start out 99% confident that both versions of you will two-box. Thus:

50% chance you’re the sim, in which case: you get nothing no matter what.
50% chance you’re real, in which case:
- EV of one-boxing: 1% chance of $1M, 99% chance of nothing = $10,000.
- EV of two-boxing: 1% chance of $1M + $1K, 99% chance of $1k = $11,000
So overall:
- EV of one-boxing = 50% * $0 + 50% * $10,000 = $5,000.
- EV of two-boxing = 50% * $0 + 50% * $11,000= $5,500.

This dynamic holds regardless of your initial probabilities on how different versions of you will act, and regardless of your probability on being the sim vs. being real.)

Of course, real-you can try to “acausally induce” sim-you to one-box, by one-boxing himself. But “acausally inducing” other versions of yourself to do stuff isn’t the CDT way; rather, it’s the type of magical thinking silliness that CDT is supposed to eschew.

Perhaps one objects: sim-you should care about real-you! For one thing, though, this seems unobvious: indexical selfishness seems perfectly consistent and understandable (and indeed, for anti-realists, you can care about whatever you want). But more importantly, it’s an objection to a utility function, rather than to two-boxing per se; and decision theorists don’t generally go in for objecting to utility functions. If the claim is that “CDT is compatible with indexically altruistic agents one-boxing in Newcomb cases involving simulations,” then fair enough. But what about everyone else?

This leads us to objection (2): namely, that the twin prisoner’s dilemma, which I take to be one of the strongest reasons to reject CDT, is precisely a case of indexical selfishness. Perhaps I am uncertain about which copy I am; but regardless, I only care about myself; and on CDT, whatever that other guy does, I should defect. But defecting on your perfect deterministic twin, I claim, is totally crazy, even if you are indexically selfish. So CDT, I think, is still wrong.

What’s more, as I noted above, we can imagine versions of the case where I do know who I am; for example, I am the one with the blue hat, he’s the one with the red hat; I am the one who want to create flourishing Utopias, and he (the authorities changed my values during the copying process) wants to create paperclips. Unlike “sim vs. real,” these distinctions that are epistemically accessible. Still, though, if my choices are sufficiently correlated with those of my copy (and mutual cooperation is sufficiently beneficial), I should cooperate.

This is related to objection (3): namely, that not all cases where CDT gives the wrong verdicts involve simulations, or uncertainty about “who you are.” Twin prisoner’s dilemmas, where you are slightly but discernably different from your twin, are one example: no simulations or predictions necessary. But we might also wonder about Newcomb cases more broadly. Does Omega really need to be predicting your behavior via a simulation or model that you might actually be, in order for one-boxing to be the right call? This seems, at least, a substantively additional claim. And we might wonder about e.g. predicting your behavior via your genes (see e.g. Oesterheld (2015)), by observing lots of people who are “a lot like you,” or some via other unknown method.

That said, I want to acknowledge that one of the arguments for one-boxing that I find most persuasive – e.g., running the case lots of times with “play money,” before deciding what to do for real – works a lot better in contexts with very fine-grained prediction capabilities. This is because when I’m “playing around” with no real stakes, it makes more sense to imagine me using intricate and arbitrary decision-making processes, which the incentives at stake in the real case will not constrain. Thus, for example, maybe I try forms of pseudo-randomization (“I’ll one-box if the number of letters in the sentence I’m about to make up is odd” – see Aaronson here); maybe I try spinning myself around with my eyes closed, then pressing whichever button I see first; and so on. In order for Omega’s predictions to stay well-correlated with my behavior, here, it seems plausible she needs a very (unrealistically?) high-fidelity model. And we can say something similar about the twin prisoner’s dilemma. That is, the argument for cooperating is most compelling when his arm literally moves in logically-necessary lock-step with your own, as you reach towards the buttons. Once that’s not true, if we try to imagine a “play money” version of the case, then even with fairly minor psychology differences, you and your copy’s modes of “playing around” might de-correlate fast.

This feature of the intuitive landscape seems instructive. The sense that you acausally “control” what Omega predicts, or what your copy does, seems strongest when you can, as it were, do any old thing, for any old reason, and the correlation will remain. Once the correlation requires further constraints, the intuitive case weakens. That said, if you’re in the real case, with the real incentives, then it’s ultimately the correlation given those incentives that seems relevant: e.g., maybe Omega is accurate only for real-money cases; maybe you and your copy are only highly correlated when the real money comes out. In such a case, I think, you should still one-box/cooperate.

My final objection to the “appeal to uncertainty about who are you” sort of view just: it doesn’t feel like uncertainty about whether I’m a simulation is actually driving my one-boxing impulse. In the play-money Newcomb case, for example, I feel like what actually persuades me is a visceral sense that “one-boxing is going to result in me having a million dollars, two-boxing is going to result in me having a thousand dollars.” Questions about whether I’m a simulation, or whether Omega needs to simulate me in order to achieve this level of accuracy, just aren’t coming into it.

I conclude, then, that simulation uncertainty and related ideas can’t save CDT. Aaronson thinks that he can “pocket the $1,000,000, but still believe that the future doesn’t affect the past.” I think he’s wrong — at least in many cases where one wants the million, and can get it. He should face, I think, a weirder music.

VIII. Maybe EDT?

But what sort of music, exactly? And exactly how weird are we talking? I don’t know.

Consider, for example, EDT – CDT’s most famous rival. I think that a lot of philosophers write off EDT too quickly. As I mentioned earlier, EDT has the unique and compelling distinction of being the only view to use the utility you should actually expect, given the performance of action X, in order to calculate the expected utility of performing action X. In this sense, it’s the basic, simple-minded Bayesian’s decision theory; the type of decision theory you would use if you were, you know, trying to predict the outcomes of different actions.

What’s more, I think, a number of prominent objections to EDT seem to me, at least, much more complicated than they’re often made out to be. Consider, for example, the accusation that EDT endorses attempts to “manage the news.” There’s something true about this, but we should also be puzzled by it. Managing the news is obviously fine when you can influence the events the news is about. It’s fine, for example, to “manage the news” about whether you get a promotion, by working harder at the office. And it’s interestingly hard to “manage the news” successfully – e.g., change your rational credence in how good the future will be – with respect to things you can’t influence. Suppose, for example, that you’re worried (at, say, 70% credence) that your favored candidate lost yesterday’s election. Do you “manage the news” by refusing to read the morning’s newspaper, or by scribbling over the front page “Favored Candidate Wins Decisively!”? No: if you’re rational, your credence in the loss is still 70%.

Or take a somewhat more complicated case, discussed in Ahmed (2014). Suppose that you wake up not knowing what time it is, and all your clocks are broken. You hope that you’re not already late to work, and you consider running, to avoid either being late at all, or being later. Suppose, further, that people who run to work tend to be already late. Should you refrain from running, on the grounds that running would make it more likely that you’re already late? No. But plausibly, EDT doesn’t say you should, because running to work, in this case, wouldn’t be additional evidence that you’re already late, once we condition on the fact that you don’t know when you woke up, the reasons (including the subtle hunches about what time it might be) that you’d be running, and so on. After all, many of the already-late people running for work know that they’re already late, and are running for that reason. Your situation is different.

OK, so what does it take for the problematic type of news-management to be possible? This question matters, I think, because in some of the examples where EDT is supposed to go in for the problematic type of news-management, it’s not clear that the news-management in question would succeed. Consider:

Smoking lesion: Almost everyone who smokes has a fatal lesion, and almost everyone who doesn’t smoke doesn’t have this lesion. However, smoking doesn’t cause the lesion. Rather, the lesion causes people to smoke. Dying from the lesion is terrible, but smoking is pretty good. Should you smoke?

EDT, the objection goes, doesn’t smoke, here, because smoking increases your credence that you have the lesion. But this, the thought goes, is stupid. You’ve either already got the lesion, or you don’t have it and won’t get it. Either way, you should smoke. Not smoking is just “managing the news.”

I used to treat this case a fairly decisive reason to reject EDT. Now I feel more confused about it. For starters, EDT clearly smokes in some versions of the case. Suppose, for example, that the way the lesion causes people to smoke is by making them want to smoke. Conditional on someone wanting to smoke, though, there’s no additional correlation between actually smoking and having the lesion. Thus, if you notice that you want to smoke (e.g., you feel a “tickle”), then that’s the bad news right there: you’ve already got all the smoking-related evidence you’re going to get about whether you’ve got the lesion. Actually smoking, or not, doesn’t change the news: so, no need for further management. This sort of argument will work for any mechanism of influence on your decision that you notice and update on. Thus the so-called “Tickle Defense” of EDT.

Ok, but what if you don’t notice any tickle, or whatever other mechanism of influence is at stake? As Ahmed (2014, p. 91) characterizes it, the tickle defense assumes that all the inputs to your decision-making are “transparent” to you. But this seems like a strong condition, and granted greater ignorance, my sense is that in some versions of the case (for example, versions where the lesion makes you assign positive utility to smoking, but you don’t know what your utility function is, even as you use it in making decisions), EDT is indeed going to give the intuitively wrong result (see e.g., Demski’s “Smoking Lesion Steelman” for a worked example). Christiano argues that this is fine – “No matter how good your decision procedure is, if you don’t know a critical fact about the situation then you can make a decision that looks bad” – but I’m not so sure: prima facie, not smoking in smoking-lesion type cases seems like the type of mistake one ought to be able to avoid, even granted uncertainty about some aspects of your own psychology, and/or how the lesion works.

More generally, though, my sense is that really trying to dig into the details of tickle-defense type moves gets complicated fast, and that there’s some tension between (a) trying to craft a version of EDT where the “tickle defense” always works – e.g., one that somehow updates on everything influencing its decision-making (I’m not sure how this is supposed to work) – and (b) keeping EDT meaningfully distinct from CDT (see e.g. Demski’s sequence “EDT = CDT?”). Maybe some people are OK with collapsing the distinction, and OK, even, if EDT starts two-boxing in Newcomb’s problems (see e.g. Demski’s final comments here), and defecting on deterministic twins (I’ve been setting this possibility aside above, and following the standard understanding of how EDT acts in these cases). But for my part, a key reason I’m interested in EDT at all is because I’m interested in one-boxing and cooperating. Maybe I can get this in other ways (see e.g. the discussion of “follow the policy you would’ve committed to” below); but then, I think, EDT will lose much of its appeal (though not all; I also like the “basic Bayesian-ness” of it).

One other note on smoking lesion. You might think that the “do it over and over with monopoly money” type argument that I found persuasive earlier will give the intuitively wrong verdict on smoking lesion, suggesting that such an argument shouldn’t be trusted. After all, we might think, almost every time you smoke in a “play life,” you’ll end up with the play-lesion; and every time you don’t, you won’t. But note that when we dig in on this, the smoking lesion case can start to break in a maybe-instructive manner.

Suppose, for example, that I know that the base rate of lesions in the population is 50%, and I get “spawned” over and over into the world, where I can choose to smoke, or not. How can my “playing around” remain consistent with this 50% base rate? Imagine, for example, that I decide to refrain from smoking a million times in a row. If the case’s hypothesized correlations hold, then I will in fact spawn, consistently, without the lesion. In that case, though, it starts to look like my choice of whether to smoke or not actually is exerting a type of “control” over whether I get born as someone with the lesion – in defiance of the base rate. And if my choice can do that, then it’s not actually clear to me that non-smoking, here, is so crazy.

Maybe we could rule this out by fiat? “Well, if the base rate is 50%, then it turns you will, in fact, decide to ‘play around’ in a way that involves smoking ~50% of the time” (thanks to Katja Grace for discussion). But this feels a bit forced, and inconsistent with the spirit of “play around however you want; it’ll basically always work” – the spirit that I find persuasive in Newcomb’s case and sufficiently-high-fidelity twin prisoner’s dilemmas. Alternatively, we could specify that I’m not allowed to know the base rate, and then we can shift it around to remain consistent with my making whatever play choices I want and spawning at the base rate. But now it looks like I can control the base rate of lesions! And if I can do that, once again, I start to wonder about whether non-smoking is so crazy after all.

That said, maybe the right thing to say here is just that the correlations posited in smoking lesion don’t persist under conditions of “play around however you want” – something that I expect holds true of various versions of Newcomb’s problem and Twin Prisoner’s dilemma as well.

What about other putative counter-examples to EDT? There are lots to consider, but at least one other one – namely, “Yankees vs. Red Sox” (see Arntzenius (2008)) — strikes me as dubious (though also, elegant). In this case, the Yankees win 90% of games, and you face a choice between the following bets:

Yankees win Red Sox win

You bet on Yankees 1 -2

You bet on Red Sox -1 2

Or, if we think of the outcomes here as “you win your” and “you lose your bet” instead, we get:

You win your bet You lose your bet

You bet on Yankees 1 -2

You bet on Red Sox 2 -1

Before you choose your bet, an Oracle tells you whether you’re going to win your next bet. The issue is that once you condition on winning or losing (regardless of which), you should always bet on the Red Sox. So, the thought goes, EDT always bets on the Red Sox, and loses money 90% of the time. Betting on the Yankees every time does much better.

But something is fishy here. Specifically, the Oracle’s prediction, together with your knowledge of your own decision, leaks information that should render your decision-making unstable. Suppose, for example, that the Oracle tells you that you will lose your next bet. You then reason: “Conditional on knowing that I will lose my bet, I should bet on the Red Sox. But given that I’ll lose, this means that the Yankees will win, which means I should bet on the Yankees, which means I will win my bet. But I can’t win my bet, so the Yankees will lose, so I should bet on the Red Sox,” and so on. That is, you oscillate between reasoning using the second matrix, and reasoning using the first; and you never settle down.

(Note that if we allow for playing around with monopoly money, then this case, too, suffers from the same base-rate related problems as smoking lesion: e.g., either you can change the base rates of Yankee victory at will, or you’re somehow forced to play around in a manner consistent with both the 90% base rate and the Oracle’s accuracy, or somehow the Oracle’s accuracy doesn’t hold in conditions where you can play around.)

Even if we set aside smoking lesion and Yankees vs. Red Sox, though, there is at least one counterexample to EDT that seems to me pretty solidly damning, namely:

XOR blackmail: Termites in your house is a million-dollar loss, and you don’t know if you have them. A credible and accurate predictor finds out if you have termites, then writes the following letter: “I am sending you this letter if and only if (a) I predict that you will pay me $1,000 dollars upon receiving it, or (b) you have termites, but not both.” She then makes her prediction and follows the letter’s outlined procedure. If you receive the letter, should you pay?

(See Yudkowsky and Soares (2017), p. 24).

EDT pays, here. Why? Because conditional on paying, it’s much less likely that you’ve got termites, so paying is much better news than not paying. If you refuse to pay, you should call the exterminator (or do whatever you do with termites) pronto; if you pay, you can relax.

Or at least, you can relax for a bit. But if you’re EDT, you’re getting these letters all the time. Maybe the predictor decides to pull this stunt every day. You’re flooded with letters, all reflecting the prediction that you’ll pay. If you’d only stop paying, the letters would slow to a base-rate-of-termites-sized trickle. Try it with monopoly money: as you spawn over and over, you’ll find you can modulate the frequency of letter receipt at will, just by deciding to pay, or not, on the next round. But in real life, once you’ve got the letter, do you ever wise up, and decide, instead of paying, to already have termites? On EDT, it’s not clear (at least to me) why you would, absent some other change to the situation. Termites, after all, are terrible. And look at this letter, already sitting in your hand! It only comes given one of two conditions…

Perhaps one thinks: the core issue here isn’t that you’re getting so many letters. Even if you know that the predictor is only going to pull this stunt once, paying seem pretty silly. Why? It’s that old thing about the past having already happened, about the opaque box already being empty or full. You’ve either already got termites, or you don’t, dude: stop trying to manage the news.

But is that the core issue? Consider:

More active termite blackmail: The predictor gets more aggressive. Once a year, she writes the following letter: “I predicted that you would pay me $1,000 upon receipt of this letter. If I predicted ‘yes,’ I left your house alone. If I predicted ‘no,’ I gave you termites.” Then she predicts, obeys the procedure, and sends. If you receive the letter, should you pay?

Here, the “it’s too late, dude” objection still applies. CDT ignores letters like this. But CDT also gets given termites once a year. EDT, by contrast, pays, and stays termite free. What’s more, by hypothesis, the stunt gets pulled on everyone the same number of times, regardless of their payment patterns. In this sense, it’s more directly analogous to Newcomb’s problem. And I find that paying, here, seems more intuitive than in the previous case (though the fact that you ultimately want to deter this sort of behavior from occurring at all may bring in additional complications; if it helps, we can specify that the predictor’s not actually in this for getting money or for giving people termites — rather, she just likes putting people in weird decision-theory situations, and will do this regardless of how her victims respond).

We can consider other problems with EDT as well, beyond XOR blackmail. For example, a naïve formulation of EDT has trouble with cases where it starts out certain about what it’s going to do, or even very confident (see e.g. the “cosmic ray problem” on p. 24 of Yudkowsky and Soares (2017)). And more generally, the “managing the news” flavor of EDT makes it feel, to me, like the type of thing one could come up with counter-examples to. But it’s XOR blackmail, I find, that currently gives me the most pause (and note, too, that in XOR blackmail, we can imagine that you have arbitrary introspective access, such that tickle-defense type questions about whether all the factors influencing your decision are “transparent” or not don’t really apply). And I think that the importance of the way paying influences how many letters you get, as opposed to its trying to “control the past” more broadly, may be instructive.

Summarizing this section, then: my current sense is that:

EDT’s “basic Bayesianism” makes it attractive.
Really digging into EDT, especially re: tickle defenses, can get kind of gnarly.
Yankees vs. Red Sox isn’t a good counterargument to EDT.
EDT messes up in XOR blackmail.
There are probably a bunch of other problems with EDT that I’m not really considering/engaging with.

Does this make EDT better or worse than CDT? Currently, I’m weakly inclined to say “better” – at least in theory. But trying to actually implement EDT also seems more liable to lead to pretty silly stuff. I’ll discuss some of this silly stuff in the final section. First, though, and motivated by XOR blackmail, I want to discuss one more broad bucket of decision-theoretic options and examples – namely, those associated with following policies you would’ve wanted yourself to commit to, even when it hurts.

IX. What would you have wanted yourself to commit to?

Consider:

Parfit’s hitchhiker: You are stranded in the desert without cash, and you’ll die if you don’t get to the city soon. A selfish man comes along in a car. He is an extremely accurate predictor, and he’ll take you to the city if he predicts that once you arrive, you’ll go to an ATM, withdraw ten thousand dollars, and give it to him. However, once you get to the city, he’ll be powerless to stop you from not paying.

If you get to the city, should you pay him? Both CDT and EDT answer: no. By the time you get to the city, the risk of death in the desert is gone. Paying him, then, is pure loss (assuming you don’t value his welfare, and there are no other downstream consequences). Because they answer this way, though, both CDT and EDT agents rarely make it to the city: the man predicts, accurately, that they won’t pay.

Is this a problem? Some might answer: no, because paying in the city is clearly irrational. In particular, it violates what MacAskill (2019) calls:

Guaranteed Payoffs: When you’re certain about what the pay-offs of your different options would be, you should choose the option with the highest pay-off.

Guaranteed Payoffs, we should all agree, is an attractive principle, at least in the abstract. If you’re not taking the higher payoff, when you know exactly what payoffs your different actions will lead to, then what the heck are you doing, and why would we call it “rationality”?

On the other hand, is paying the driver really so silly? To me, it doesn’t feel that way. Indeed, I feel happy to pay, here (though I also think that the case brings in extra heuristics about promise-keeping and gratitude that may muddy the waters; better to run it with a mean and non-conscious AI system who demands that you just burn the money in the street, and kills itself before you even get to the ATM). What’s more, I want to be the type of person who pays. Indeed: if, in the desert, I could set-up some elaborate and costly self-binding scheme – say, a bomb that blows off my arm, in the city, if I don’t pay — such that paying in the city becomes straightforwardly incentivized, I would want to do it. But if that’s true, we might wonder, why not skip all this expensive faff with the bomb, and just, you know, pay in the city? After all, what if there are no bombs around to strap to my arm? What if I don’t know how to make bombs? Need my survival be subject to such contingencies? Why not learn, and practice, that oh-so valuable (and portable, and reliably available) skill instead: how to make, and actually keep, commitments? (h/t Carl Shulman, years ago, for suggesting this sort of framing.)

That said, various questions tend to blur together here – and once we pull them apart, it’s not clear to me how much substantive (as opposed to merely verbal) debate remains. Everyone agrees that it’s better to be the type of person who pays. Everyone agrees that if you can credibly commit to paying, you should do it; and that the ability to make and keep commitments is an extremely useful one. Indeed, everyone agrees that, if you’re a CDT or EDT agent about to face this case, it’s better, if you can, to self-modify into some other type of agent – one that will pay in the city (and are commitments and self-modifications really so different? Is cognition itself so different from self-modification?). As far as I can tell (and I’m not alone in thinking this), the only remaining dispute is whether, given these facts, we should baptize the action of paying in the city with the word “rational,” or if we should instead call it “an irrational action, but one that follows from a disposition it’s rational to cultivate, a self-modification it’s rational to make, a policy its rational to commit to,” and so on.

Is that an interesting question? What’s actually at stake, when we ask it? I’m not sure. As I mentioned above, I tend towards anti-realism about normativity; and for anti-realists, debates about the “true rationality” aren’t especially deep. Ultimately, there are just different ways of arranging your mind, different ways of making decisions, different shapes that can be given to this strange clay of self and world. Ultimately, that is, the question is just: what you in fact do in the city, and what in fact that decision means, implies, causes, and so on. We talk about “rationality” as a means of groping towards greater wisdom and clarity about these implications, effects, and so on; but if you understand all of this, and make your decisions in light of full information, additional disputes about what compliments and insults are appropriate don’t seem especially pressing.

All that said, terminology aside, I do think that Parfit’s hitchhiker-type cases can lead to genuinely practical and visceral forms of internal conflict. Consider:

Deterrence: You have a button that will destroy the world. The aliens want to invade, but they want the world intact, and they won’t invade if they predict that you’ll destroy the world upon observing their invasion. Being enslaved by the aliens is better than death; but freedom far better. The aliens predict that you won’t press the button, and so start to invade. Should you destroy the world?

This is far from a fanciful thought experiment. Rather, this is precisely the type of dynamic that decision-makers with real nuclear codes at their fingertips have to deal with. Same with tree-huggers chaining themselves to trees, teenagers playing chicken, and so on.

Or, more fancifully, consider:

Counterfactual mugging: Omega doesn’t know whether the X-th digit of pi is even or odd. Before finding out, she makes the following commitment. If the X-th digit of pi is odd, she will ask you for a thousand dollars. If the X-th digit is even, she will predict whether you would’ve given her the thousand had the X-th digit been odd, and she will give you a million if she predicts “yes.” The X-th digit is odd, and Omega asks you for the thousand. Should you pay?

(I use logical randomness, rather than e.g. coin-flipping, to make it more difficult to appeal to concern about versions of yourself that live in other quantum branches, possible worlds, and so on. Thanks to Katja Grace for suggesting this. That said, perhaps some such appeals are available regardless. For example, how did X get decided?)

Finally, consider a version of Newcomb’s problem in which both boxes are transparent – e.g., you can see how Omega has predicted you’ll behave. Suppose you find that Omega has predicted that you’ll one-box, and so left the million there. Should you one-box, or two-box? What if Omega has predicted that you’ll two-box?

We can think of all these cases as involving an inconsistency between the policy that an agent would want to adopt, at some prior point in time/from some epistemic position (e.g., before the aliens invade, before we know the value of the X-th digit, before Omega makes her predictions), and the action that Guaranteed Payoffs would mandate given full information. And there are lots of other cases in this vein as well (see e.g., The Absent-Minded Driver, and the literature on dynamical inconsistency in game theory).

There is a certain broad class of decision theories, a number of which are associated with the Machine Intelligence Research Institute (MIRI), that put resolving this type of inconsistency in favor of something like “the policy you would’ve wanted to adopt” at center stage. (In general, MIRI’s work on decision theory has heavily influenced my own thinking – influence on display throughout this post. See also Meacham (2010) for another view in this vein, as well as the work of Wei Dai and others on “updatelessness.”) There are lots of different ways to do this (see e.g. the discussion of the 2x2x3 matrix here), and I don’t feel like I have a strong grip on all of the relevant choice-points. Many of these views are united, though, in violating Guaranteed Payoffs, for reasons that feel, spiritually, pretty similar.

What’s more, and importantly, these theories tend to get cases like XOR blackmail right, where e.g. classic EDT gets them wrong. Consider, for example, whether before you receive any letter, you would want to commit to paying, or not paying, upon receipt. If we assume that the base rate of termites will stay constant regardless, then committing to not paying seems the clear choice. After all, doing so won’t make it more likely that you get termites; rather, it’ll make it less likely that you get letters.

If necessary, these theories can also get results like one-boxing, and cooperating with your twin, without appeal to any weird magic about controlling the past. After all, one-boxing and cooperating are both policies that you would want yourself to commit to, at least from some epistemic positions, even in a plain-old, common-sense, CDT-spirited world. Maybe executing these policies looks like trying to execute some kind of acausal control — and maybe, indeed, advocates of such policies talk in terms of such control. But maybe this is just talk. After all, executing policies that violate Guaranteed Payoffs looks pretty weird in general (for example, it looks like burning money for certain), and perhaps we need not take decisions about how to conceptualize such violations all that seriously: the main thing is what happens with the money.

A key price of this approach, though, is the whole “burning money for certain” thing; and here, perhaps, some people will want to get off the train. “Look, I was down for one-boxing, or for cooperating with my twin, when I didn’t actually know the payoffs in question. But violating Guaranteed Payoffs is just too much! You’re just destroying value for certain. That’s all. That’s the whole thing you do. You blow up the world, trying to prevent something that you know has already happened. Yes, it’s good to commit to doing that ex ante. But ex post, isn’t it also just obviously stupid?”

For people with this combination of views, though, I think it’s important to keep in mind the spiritual continuity between violating Guaranteed Payoffs, and one-boxing/cooperating more generally. After all, one of the strongest arguments for two-boxing is that, if you knew what was in the box (like, e.g., your friend does), you’d be in a Guaranteed Payoffs-type situation, and then a follower of Guaranteed Payoffs would two-box every time. Indeed, I think that part why “great grandpappy Omega, now long dead, leaves the boxes in the attic” prompts a two-boxing intuition is that in the attic, you sense that you’re about to move from a non-transparent Newcomb’s problem to a transparent one. That is, after you bring the one-box down from the attic, and open it, the other box isn’t going to disappear. The attic door is still open. The stairs still beckon. You could just go back up there and get that thousand. Why not do it? If you got the million, it’s not going to evaporate. And if you didn’t get the million, what’s the use of letting a thousand go to waste? But that’s just the type of thinking that leads to empty boxes…

X. Building statues to the average of all logically-impossible Gods

Overall, I don’t see violations of Guaranteed Payoffs as a decisive reason to reject approaches in the vein of “act in line with the policy you would’ve wanted to commit to from some epistemic position P” – and some disputes in this vicinity strike me as verbal rather than substantive. That said, I do want to flag an additional source of uncertainty about such approaches: namely, that it seems extremely unclear what they actually imply.

In particular, all the “violate Guaranteed Payoffs” cases above rely on some implied “prior” epistemic position (e.g., before the aliens invade, before Omega has made her prediction, etc), relative to which the policy in question is evaluated. But why is that the position, instead of some other one? Even if we were just “rewinding” your own epistemology (e.g., to back before you knew that the aliens were invading, but after you learned that about how they were going to make their decision), there would be a question of how far to rewind. Back to your childhood? Back to before you were born, and were an innocent platonic soul about to be spawned into the world? What features does this soul have? In what order were those features added? Does your platonic soul know basic facts about logic? What credence does it have that it’ll get born as a square circle, or into a world where 2+2=5? What in the goddamn hell are we talking about?

Also, it isn’t just a question of “rewinding” your own epistemology to some earlier epistemic position you (or even, a stripped-down version of you) held. There may be no actual time when you knew the information you’re trying to “remember” (e.g., that Omega is going to pull a counter-factual-mugging type stunt) but not the information you’re trying to “forget” (e.g., that the X-th digit of pi is odd). So it seems like the epistemic position in question may need to be one that no one – and certainly not you — has ever, in fact, occupied. How are we supposed to pick out such a position? What desiderata are even relevant? I haven’t engaged much with questions in this vein, but currently, I basically just don’t know how this is supposed to work. (I’m also not the only one with these questions. See e.g. Demski here, on the possibility that “updatelessness is doomed,” and Christiano here. And they’ve thought more about it.)

What’s more, some (basically all?) of these epistemic positions don’t seem particularly exciting from a “winning” perspective — and not just because they violate Guaranteed Payoffs. For example: weren’t you a member of some funky religion as a child — one that you now reject? And weren’t you more generally kind of dumb and ignorant? Are you sure you want to commit to a policy from that epistemic position (see e.g. Kokotajlo (2019) for more )? Or are we, maybe, imagining a superintelligent version of your childhood self, who knows everything? But wait: don’t forget to forget stuff, too, like what will end up in the boxes. But what should we “forget,” what should we “remember,” and what should we learn-for-the-first-time-because-apparently-we’re-talking-about-superintelligences now?

And even if we had such an attractive and privileged epistemic position identified, it seems additionally hard (totally impossible?) to know what policy this position would actually imply. Suppose, to take a normal everyday example that definitely doesn’t involve any theoretical problems, that you are about to be inserted as a random “soul” into a random “world.” What policy should you commit to? As Parfit’s Hitchhiker, should you pay in the city? Or should you, perhaps, commit to not getting into the man’s car at all, even if doing so is free, in order to disincentivize your younger self from taking ill-advised trips into the desert? Or should you, perhaps, commit to carving the desert sands into statues of square circles, and then burning yourself at the stake as an offering to the average of all logically impossible Gods? One feels, perhaps, a bit at sea; and a bit at risk of, as it were, doing something dumb. After all, you’ve already gone in for burning value for certain; you’ve already started trying to reason like someone you’re not, in a situation that you aren’t in. And without constraints like “don’t burn value for certain” as a basic filter on your action space, the floodgates open wide. One worries about swimming well in such water.

XI. Living with magic

Overall, the main thing I want to communicate in this post is: I think that the perfect deterministic twin’s prisoner’s dilemma case basically shows that there is such a thing as “acausal control,” and that this is super duper weird. For all intents and purposes, you can decide what gets written on whiteboards light-years away; you can move another man’s arm, in lock-step with your own, without any causal contact between him and you. It actually works, and that, I think, is pretty crazy. It’s not the type of power we think of ourselves as having. It’s not the type of power we’re used to trying to wield.

What does trying to wield it actually look it, especially in our actual lives? I’m not sure. I don’t have a worked out decision-theory that makes sense of this type of thing, let alone a view about how to apply it. As a first pass, though, I’d probably start by trying to figure out what EDT actually implies, once you account for (a) tickle-defense type stuff, and (b) decorrelations between your decision and the decisions of others that arise because you’re doing some kind of funky EDT-type reasoning, and they probably aren’t.

For example: suppose that you want other people to vote in the upcoming election. Does this give you reason to vote, not out of some sort of abstract “be the change you want to see in the world” type of ethic, but because, more concretely, your voting, even in causal isolation from everyone else, will literally (if acausally) increase non-you voter turnout? Let’s first stop and really grok that voting for this reason is a weird thing to do. You’re not just trying to obey some Kantian maxim, or to do your civic duty. You’re not just saying “what if everyone acted like that?” in the abstract, like a schoolteacher to an errant child, with no expectation that “everyone,” as it were, actually will. And you’re certainly not knocking on doors or driving neighbors to the polls. Rather, you’re literally trying to influence the behavior of other people you’ll never interact with, by walking down to the voting booth on your causally isolated island. Indeed, maybe your island is in a different time zone, and you know that the polls everywhere else are closed. Still, you reason, your choice’s influence can slip the surly bonds of space and time; the evening news can still be managed (indeed, some non-EDT decision theories vote even after they’ve seen the evening news).

Is this sort of thinking remotely sensible? Well, note that the EDT version, at least, makes sense only if you should actually expect a higher non-you voter turnout, conditional on you voting for this sort of reason, than otherwise. If the voting population is “perfect deterministic copies of myself who will see the exact same inputs,” this condition holds; and it holds in various weaker conditions, too. How much does it hold in the real world, though? That’s much less clear; and as ever, if you’re considering trying to manage the news, the first thing to check is whether the news is actually manageable.

In particular, as Abram Demski emphasizes here, the greater the role of weird-decision-theory type calculations in your thinking, the less correlated your decisions will be those of others who are thinking in less esoteric ways. Perhaps you should consider the influence of your behavior on the other people interested in non-causal decision-theories (evening news: “the weird decision theorists turn out in droves!”); but it’s a smaller demographic. That said, what sorts of correlations are at stake here is an empirical question, and there’s no guarantee that something common-sensical will emerge victorious. It seems possible, for example, that many people are implicitly implementing some proto-version of your decision theory, even if they’re not explicit about it.

Here’s another case that seems to me even weirder. Suppose that you’re reading about some prison camps from World War I. They sound horrible, but the description leaves many details unspecified, and you find yourself hoping that the guards in the prison camps were as nice as would be compatible with the historical evidence you’ve seen thus far. Does this give you, perhaps, some weak reason to be nicer to other people, in your own life, on the grounds that there is some weak correlation between your niceness, and the niceness of the guards? You’re all, after all, humans; you’ve got extremely similar genes; you’re subject to broadly similar influences; perhaps you and some of the guards are implementing vaguely similar decision procedures at some level; perhaps even (who knows?) there was some explicit decision theory happening in the trenches. Should you try to be the change you want to see in the past? Should you, now, try to improve the conditions in World War I prison camps? And if so: have you, perhaps, lost your marbles?

Perhaps some people will answer: look, the correlations are too weak, here, for such reasoning to get off the ground. To others, though, this will seem the wrong sort of reply. The issue isn’t that you’re wrong, empirically, about the correlations at stake – indeed, the extent of such correlations seems, in some sense, an open question. The issue is that you’re trying to improve the past at all.

There are other weird applications to consider as well. For example, once you can “control” things you have no causal interaction with, your sphere of possible control could in principle expand throughout a very large universe, allowing you to “influence” the behavior of aliens, other quantum branches, and so on (see e.g. Oesterheld (2017) for more). Indeed, there’s an argument for treating yourself as capable of such influence, even if you have comparatively low credence on the relevant funky decision theories, because being able to influence the behavior of tons of agents raises the stakes of your choice (see e.g. MacAskill et al (2019)). And taken seriously enough, the possibility of non-causal influence can lead to a very non-standard picture of the future – one in which “interactions” between causally-isolated civilizations throughout the universe/multi-verse move much closer to center stage.

Once you’ve started trying to acausally influence the behavior of aliens throughout the multiverse, though, one starts to wonder even more about the whole lost-your-marbles thing. And even if you’re OK with this sort of thing in principle, it’s a much further question whether you should expect any efforts in this broad funky-decision-theoretic vein to go well in practice. Indeed, my strong suspicion is that with respect to multiverse-wide whatever whatevers, for example, any such efforts, undertaken with our current level of understanding, will end up looking very misguided in hindsight, even if the decision theory that motivated them ends up vindicated. Here I think of Bostrom’s “ladder of deliberation,” in which one notices that whether an intervention seems like a good or bad idea switches back and forth as one reasons about it more, with no end in sight, thus inducing corresponding pessimism about the reliability of one’s current conclusions. Even if the weird-decision-theory ladder is sound, we are, I think, on a pretty early rung.

Overall, this whole “acausal control” thing is strange stuff. I think we should be careful with it, and generally avoid doing things that look stupid by normal lights, especially in the everyday situations our common-sense is used to dealing with. But the possibility of new, weird forms of control over the world also seems like the type of thing that could be important; and I think that perfect deterministic twins demonstrate that something in this vicinity is, at least sometimes, real. Its nature and implications, therefore, seem worth attention.

(My thanks to Paul Christiano, Bastian Stern, Nisan Stiennon, and especially to Katja Grace and Ketan Ramakrishnan, for discussion. And thanks, as well, to Abram Demski, Scott Garrabrant, Nick Beckstead, Rob Bensinger, and Ben Pace, for this exchange on related topics.)

[-]Vladimir Nesov3y150

Agent's policy determines how its instances act, but in general it also determines which instances exist, and that motivates thinking of the agent as the algorithm channeled by instances rather than as one of the instances controlling the others, or as all instances controlling each other. For example, in Newcomb's problem, you might be sitting inside the box with the $1M, and if you two-box, you have never existed. Grandpa decides to only have children if his grandchildren one-box. Or some copies in distant rooms numbered (on the outside) 1 to 5 writing integers on blackboards, with only the rooms whose number differs from the integer written by at most 1 being occupied. In the occupied rooms, the shape of the digits is exactly the same, but the choice of the integers determines which (if any) of the rooms are occupied. You may carefully write a 7, and all rooms are empty.

If you are the algorithm, which algorithm are you, and what instances are running you? Unfortunate policy decisions, such as thinking too much, can sever control over some instances, as in ASP, or when (as an instance) retracting too much knowledge (UDT-style) and then (as a resulting algorithm) having to examine too many possible states of knowledge or of possible observations, grasping at a wider scope but losing traction, because the instances can no longer channel such an algorithm. Decisions of some precursor algorithm may even determine which successor algorithm an instance is running, not just which policy a fixed algorithm executes, in which case identifying with the instance is even less coherent than if it can merely cease to exist.

[-]Steve Byrnes1y40Review for 2021 Review

For a long time, I could more-or-less follow the logical arguments related to e.g. Newcomb’s problem, but I didn’t really get it, like, it still felt wrong and stupid at some deep level. But when I read Joe’s description of “Perfect deterministic twin prisoner’s dilemma” in this post, and the surrounding discussion, thinking about that really helped me finally break through that cloud of vague doubt, and viscerally understand what everyone’s been talking about this whole time. The whole post is excellent; very strong recommend for the 2021 review.

[-]Dweomite3y30

I agree that figuring out what you "should have" precommitted can be fraught.

One possible response to that problem is to set aside some time to think about hypotheticals and figure out now what precommitments you would like to make, instead of waiting for those scenarios to actually happen. So the perspective is "actual you, at this exact moment".

I sometimes suspect you could view MIRI's decision theories as an example of this strategy.

Alice: Hey, Bob, have you seen this "Newcomb's problem" thing?
Bob: Fascinating. As we both have unshakable faith in CDT, we can easily agree that two-boxing is correct if you are surprised by this problem, but that you should precommit to one-boxing if you have the opportunity.
Alice: I was thinking--now that we've realized this, why not precommit to one-boxing right now? You know, just in case. The premise of the problem is that Omega has some sort of access to our actual decision-making algorithm, so in principle we can precommit just by deciding to precommit.
Bob: That seems unobjectionable, but not very useful in expectation; we're very unlikely to encounter this exact scenario. It seems like what we really ought to do is make a precommitment for the whole class of problems of which Newcomb's problem is just one example.
Alice: Hm, that seems tricky to formally define. I'm not sure I can stick to the precommitment unless I understand it rigorously. Maybe if...
--Alice & Bob do a bunch of math, and eventually come up with a decision strategy that looks a lot like MIRI's decision theory, all without ever questioning that CDT is absolutely philosophically correct?--

Possibly it's not that simple; I'm not confident that I appreciate all the nuances of MIRI's reasoning.

[-]David Xu3y60

The output of this process is something people have taken to calling Son-of-CDT; the problem (insofar as we understand Son-of-CDT well enough to talk about its behavior) is that the resulting decision theory continues to neglect correlations that existed prior to self-modification.

(In your terms: Alice and Bob would only one-box in Newcomb variants where Omega based his prediction on them after they came up with their new decision theory; Newcomb variants where Omega's prediction occurred before they had their talk would still be met with two-boxing, even if Omega is stipulated to be able to predict the outcome of the talk.)

This still does not seem like particularly sane behavior, which means, unfortunately, that there's no real way for a CDT agent to fix itself: it was born with too dumb of a prior for even self-modification to save it.

[-]Vladimir Nesov3y00

One way of noticing the Son-of-CDT issue dxu mentioned is thinking of CDT as not just being unable to control the events outside the future lightcone, but as not caring about the events outside the future lightcone. So even if it self-modifies, it's not going to accept tradeoffs between the future and not-the-future of the self-modification event, as that would involve changing its preference (and somehow reinventing preference for the events it didn't care about just before the self-modification event).

With time, CDT continually becomes numb to events outside its future, loses parts of its values. Self-modifying to Son-of-CDT stops further loss, but doesn't reverse past loss.

[-]Dweomite3y-20

Suppose you run your twins scenario, and the twins both defect. You visit one of the twins to discuss the outcome.

Consider the statement: "If you had cooperated, your twin would also have cooperated, and you would have received $1M instead of $1K." I think this is formally provable, given the premises.

Now consider the statement: "If you had cooperated, your twin would still have defected, and you would have received $0 instead of $1K." I think this is also formally provable, given the premises. Because we have assumed a deterministic AI that we already know will defect given this particular set of inputs! Any statement that begins "if you had cooperated..." is assuming a contradiction, from which literally anything is formally provable.

You say in the post that only the cooperate-cooperate and defect-defect outcomes are on the table, because cooperate-defect is impossible by the scenario's construction. I think that cooperate-cooperate and defect-defect aren't both on the table, either. Only one of those outcomes is consistent with the AI program that you already copied. If we can say you don't need to worry about cooperate-defect because it's impossible by construction, then in precisely what sense are cooperate-cooperate and defect-defect both still "possible"?

I feel like most people have a mental model for deterministic systems (billiard balls bouncing off each other, etc.) and a separate mental model for agents. If you can get your audience to invoke both of these models at once, you have probably instantiated in their minds a combined model with some latent contradiction in it. Then, by leading your audience down a specific path of reasoning, you can use that latent contradiction to prove essentially whatever you want.

(To give a simple example, I've often seen people ask variations of "does (some combinatorial game) have a 50/50 win rate if both sides play optimally?" A combinatorial game, played optimally, has only one outcome, which must occur 100% of the time; but non-mathematicians often fail to notice this, and apply their usual model of "agents playing a game" even though the question constrained the "agents" to optimal play.)

I notice this post uses a lot of phrases like "it actually works" and "try it yourself" when talking about the twins example. Unless there's been a recent breakthrough in mind uploading that I haven't heard about, this wording implies empirical confirmation that I'm pretty confident you don't have (and can't get).

If you were forced to express your hypothetical scenarios in computer source code, instead of informal English descriptions, I think it would probably be pretty easy to run some empirical tests and see which strategies actually get better outcomes. But I don't know, and I suspect you don't know, how to "faithfully" represent any of these examples as source code. This leaves me suspicious that perhaps all the interesting results are just confusions, rather than facts about the universe.

AI ALIGNMENT FORUM
AF