by Johannes Treutlein 286 days ago | Tom Everitt, Alex Appel, Abram Demski and Paul Christiano like this | link | parent | on: Smoking Lesion Steelman From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT I agree with this. It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one. I’d also be interested in finding such a problem. I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a counter-example to causal decision theory. For example, consider a decision problem with the following payoff matrix: Smoke-lover: Smokes: Killed: 10 Not killed: -90 Doesn’t smoke: Killed: 0 Not killed: 0 Non-smoke-lover: Smokes: Killed: -100 Not killed: -100 Doesn’t smoke: Killed: 0 Not killed: 0 For some reason, the agent doesn’t care whether they live or die. Also, let’s say that smoking makes a smoke-lover happy, but afterwards, they get terribly sick and lose 100 utilons. So they would only smoke if they knew they were going to be killed afterwards. The non-smoke-lover doesn’t want to smoke in any case. Now, smoke-loving evidential decision theorists rightly choose smoking: they know that robots with a non-smoke-loving utility function would never have any reason to smoke, no matter which probabilities they assign. So if they end up smoking, then this means they are certainly smoke-lovers. It follows that they will be killed, and conditional on that state, smoking gives 10 more utility than not smoking. Causal decision theory, on the other hand, seems to recommend a suboptimal action. Let $$a_1$$ be smoking, $$a_2$$ not smoking, $$S_1$$ being a smoke-lover, and $$S_2$$ being a non-smoke-lover. Moreover, say the prior probability $$P(S_1)$$ is $$0.5$$. Then, for a smoke-loving CDT bot, the expected utility of smoking is just $$\mathbb{E}[U|a_1]=P(S_1)\cdot U(S_1\wedge a_1)+P(S_2)\cdot U(S_2\wedge a_1)=0.5\cdot 10 + 0.5\cdot (-90) = -40$$, which is less then the certain $$0$$ utilons for $$a_2$$. Assigning a credence of around $$1$$ to $$P(S_1|a_1)$$, a smoke-loving EDT bot calculates $$\mathbb{E}[U|a_1]=P(S_1|a_1)\cdot U(S_1\wedge a_1)+P(S_2|a_1)\cdot U(S_2\wedge a_1)\approx 1 \cdot 10 + 0\cdot (-90) = 10$$, which is higher than the expected utility of $$a_2$$. The reason CDT fails here doesn’t seem to lie in a mistaken causal structure. Also, I’m not sure whether the problem for EDT in the smoking lesion steelman is really that it can’t condition on all its inputs. If EDT can’t condition on something, then EDT doesn’t account for this information, but this doesn’t seem to be a problem per se. In my opinion, the problem lies in an inconsistency in the expected utility equations. Smoke-loving EDT bots calculate the probability of being a non-smoke-lover, but then the utility they get is actually the one from being a smoke-lover. For this reason, they can get some “back-handed” information about their own utility function from their actions. The agents basically fail to condition two factors of the same product on the same knowledge. Say we don’t know our own utility function on an epistemic level. Ordinarily, we would calculate the expected utility of an action, both as smoke-lovers and as non-smoke-lovers, as follows: $$\mathbb{E}[U|a]=P(S_1|a)\cdot \mathbb{E}[U|S_1, a]+P(S_2|a)\cdot \mathbb{E}[U|S_2, a]$$, where, if $$U_{1}$$ ($$U_{2}$$) is the utility function of a smoke-lover (non-smoke-lover), $$\mathbb{E}[U|S_i, a]$$ is equal to $$\mathbb{E}[U_{i}|a]$$. In this case, we don’t get any information about our utility function from our own action, and hence, no Newcomb-like problem arises. I’m unsure whether there is any causal decision theory derivative that gets my case (or all other possible cases in this setting) right. It seems like as long as the agent isn’t certain to be a smoke-lover from the start, there are still payoffs for which CDT would (wrongly) choose not to smoke.
by Alex Appel 3 days ago | Abram Demski likes this | link | on: Smoking Lesion Steelman

I think that in that case, the agent shouldn’t smoke, and CDT is right, although there is side-channel information that can be used to come to the conclusion that the agent should smoke. Here’s a reframing of the provided payoff matrix that makes this argument clearer. (also, your problem as stated should have 0 utility for a nonsmoker imagining the situation where they smoke and get killed)

Let’s say that there is a kingdom which contains two types of people, good people and evil people, and a person doesn’t necessarily know which type they are. There is a magical sword enchanted with a heavenly aura, and if a good person wields the sword, it will guide them do heroic things, for +10 utility (according to a good person) and 0 utility (according to a bad person). However, if an evil person wields the sword, it will afflict them for the rest of their life with extreme itchiness, for -100 utility (according to everyone).

good person’s utility estimates:

• takes sword

• I’m good: 10

• I’m evil: -90

• don’t take sword: 0

evil person’s utility estimates:

• takes sword

• I’m good: 0

• I’m evil: -100

• don’t take sword: 0

As you can clearly see, this is the exact same payoff matrix as the previous example. However, now it’s clear that if a (secretly good) CDT agent believes that most of society is evil, then it’s a bad idea to pick up the sword, because the agent is probably evil (according to the info they have) and will be tormented with itchiness for the rest of their life, and if it believes that most of society is good, then it’s a good idea to pick up the sword. Further, this situation is intuitively clear enough to argue that CDT just straight-up gets the right answer in this case.

A human (with some degree of introspective power) in this case, could correctly reason “oh hey I just got a little warm fuzzy feeling upon thinking of the hypothetical where I wield the sword and it doesn’t curse me. This is evidence that I’m good, because an evil person would not have that response, so I can safely wield the sword and will do so”.

However, what the human is doing in this case is using side-channel information that isn’t present in the problem description. They’re directly experiencing sense data as a result of the utility calculation outputting 10 in that hypothetical, and updating on that. In a society where everyone was really terrible at introspection so the only access they had to their decision algorithm was seeing their actual decision, (and assuming no previous decision problems that good and evil people decide differently on so the good person could learn that they were good by their actions), it seems to me like there’s a very intuitively strong case for not picking up the sword/not smoking.

 No Constant Distribution Can be a Logical Inductor discussion post by Alex Appel 16 days ago | Abram Demski, Jessica Taylor and Stuart Armstrong like this | 1 comment

Two minor comments. First, the bitstrings that you use do not all correspond to worlds, since, for example, $$\rm{Con}(\rm{PA}+\rm{Con}(\rm{PA}))$$ implies $$\rm{Con}(\rm{PA})$$, as $$\rm{PA}$$ is a subtheory of $$\rm{PA} + \rm{Con}(\rm{PA})$$. This can be fixed by instead using a tree of sentences that all diagonalize against themselves. Tsvi and I used a construction in this spirit in A limit-computable, self-reflective distribution, for example.

Second, I believe that weakening #2 in this post also cannot be satisfied by any constant distribution. To sketch my reasoning, a trader can try to buy a sequence of sentences $$\phi_1, \phi_1 \wedge \phi_2, \dots$$, spending $$\2^{-n}$$ on the $$n$$th sentence $$\phi_1 \wedge \dots \wedge \phi_n$$. It should choose the sequence of sentences so that $$\phi_1 \wedge \dots \wedge \phi_n$$ has probability at most $$2^{-n}$$, and then it will make an infinite amount of money if the sentences are simultaneously true.

The way to do this is to choose each $$\phi_n$$ from a list of all sentences. If at any point you notice that $$\phi_1 \wedge \dots \wedge \phi_n$$ has too high a probability, then pick a new sentence for $$\phi_n$$. We can sell all the conjunctions $$\phi_1 \wedge \dots \wedge \phi_k$$ for $$k \ge n$$ and get back the original amount payed by hypothesis. Then, if we can keep using sharper continuous tests of the probabilities of the sentences $$\phi_1 \wedge \dots \wedge \phi_n$$ over time, we will settle down to a sequence with the desired property.

In order to turn this sketch into a proof, we need to be more careful about how these things are to be made continuous, but it seems routine.

 by Jessica Taylor 17 days ago | link | parent | on: Musings on Exploration The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring. My understanding of logical inductor exploration (e.g. in asymptotic decision theory) is that the exploration steps the agent learns from mostly don’t happen in its own lifetime, rather they happen in the lifetimes of similar but simpler agents. This allows exploration to work for single-shot problems such as 5 and 10. Intuitively, if you are in a 5 and 10 problem and your brain has size 10^1000, then you can simulate someone whose brain has size 10^999 doing a 5 and 10 problem, and thereby learn the relation between the agent’s action and how much utility they get. So each particular agent has some chance of exploring irrecoverably, but in aggregate not many of them will (and it’s hard to predict which will and which won’t). As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling. Thompson sampling still randomizes (it randomizes its belief about the world it’s in) and is therefore vulnerable to troll bridge.
by Alex Appel 16 days ago | link | on: Musings on Exploration

A: While that is a really interesting note that I hadn’t spotted before, the standard formulation of exploration steps in logical inductor decision theory involve infinite exploration steps over all time, so even though an agent of this type would be able to inductively learn from what other agents do in different decision problems in less time than it naively appears, that wouldn’t make it explore less.

B: What I intended with the remark about Thompson sampling was that troll bridge functions on there being two distinct causes of “attempting to cross the bridge”. One is crossing because you believe it to be the best action, and the other is crossing because an exploration step occurred, and Thompson sampling doesn’t have a split decision criterion like this. Although now that you point it out, it is possible to make a Thompson sampling variant where the troll blows up the bridge when “crossing the bridge” is not the highest-ranked action.

 Musings on Exploration discussion post by Alex Appel 20 days ago | Vadim Kosoy likes this | 4 comments

The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring.

My understanding of logical inductor exploration (e.g. in asymptotic decision theory) is that the exploration steps the agent learns from mostly don’t happen in its own lifetime, rather they happen in the lifetimes of similar but simpler agents. This allows exploration to work for single-shot problems such as 5 and 10. Intuitively, if you are in a 5 and 10 problem and your brain has size 10^1000, then you can simulate someone whose brain has size 10^999 doing a 5 and 10 problem, and thereby learn the relation between the agent’s action and how much utility they get. So each particular agent has some chance of exploring irrecoverably, but in aggregate not many of them will (and it’s hard to predict which will and which won’t).

As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling.

Thompson sampling still randomizes (it randomizes its belief about the world it’s in) and is therefore vulnerable to troll bridge.

 Musings on Exploration discussion post by Alex Appel 20 days ago | Vadim Kosoy likes this | 4 comments
by Vadim Kosoy 19 days ago | Alex Appel likes this | link | on: Musings on Exploration

Traps are a somewhat bigger issue than you seem to think, when you take acausal attack into account. Your prior contains low complexity hypotheses that intentionally produce good predictions until some critical pivot point at which they switch to something manipulative. So, every time your reach such a pivot point you are facing a decision with irreversible consequences and there is no prior evidence to help you. Delegative learning gets around this my having the agent delegate precisely at the pivot point.

Even disregarding that, “Once the agent has figured out some of how the world works, most environments/hypotheses where there is a trap have evidential clues elsewhere to rule them out” is not quite true.

The notion of a “trap” is relative to the way you organize your uncertainty about the world. Saying that the environment might contain traps is saying that the class of environments you consider is unlearnable. However, the specification of a Bayesian agent only depends on the prior that you get by “averaging” the environments in this class. Different ways of decomposing the prior into hypotheses might yield learnable or unlearnable classes.

For example, consider an environment in which taking action A leads to heaven with probability 70% and hell with probability 30% whereas taking action B leads to heaven with probability 50% and hell with probability 50%. In this environment, taking action A is the better choice and there is no problem. However, if you decompose it into a mixture of deterministic environments then, from that perspective, you have a trap.

To give a “realistic” example, imagine that, we think that quantum random is truly random, but actually there is an underlying theory which allows predicting quantum events deterministically. Then, a strategy that seems optimal from the perspective of an agent that only knows quantum theory might be “suicidal” (permanently sacrificing value) from the perspective of an agent that knows the deeper underlying theory.

As another example, imagine that (i) the only way to escape the heat death of our universe is by controlled vacuum collapse and (ii) because we don’t know in which string vacuum we are, there is no way to be certain about the outcome of a controlled vacuum collapse without high energy experiments that have a significant chance of triggering an uncontrolled vacuum collapse. AFAIK this situation is consistent with our knowledge of physics. So, if you consider the different string vacua to be different hypotheses, we are facing a trap. On the other hand, if you have some theory that gives you a probability distribution over these vacua then there is a well-defined optimal strategy.

The point is, it is probably not meaningful/realistic to claim that we can design an agent that will almost certainly successfully deal with all traps, but it is meaningful and realistic to claim that we can design an agent that will be optimal relatively to our own posterior belief state (which is, more or less by definition, the sort of agent that it is a good idea to build).

The reason “explorative” algorithms such as PSRL (Posterior Sampling Reinforcement Learning) cannot be trivially replaced by the Bayes optimal policy, is that the Bayes optimal policy is (more) computationally intractable. For example, if you consider a finite set of finite MDP hypotheses then PSRL can be implemented in polynomial time but the Bayes optimal policy cannot (in fact I am not 100% sure about this, but I think that hole can be patched using the PCP theorem).

 Musings on Exploration discussion post by Alex Appel 20 days ago | Vadim Kosoy likes this | 4 comments
by Abram Demski 19 days ago | link | on: Musings on Exploration

I’m not convinced exploration doesn’t tile. Exploration steps would not be self-modified away if they’re actually useful/important, and if the agent can recognize this.

In the case of chicken-rule, it’s unclear how to make a decision at all without it. Plus, the exploration never needs to occur – though, subjectively, the agent can’t know that, so that doesn’t really effect the question of tiling. But, the agent could see that removing the chicken step removes the ability of the agent to reason about the consequences of alternate actions.

I think it’s somewhat plausible that you can get something which needs chicken rule as a foundation, but which can decide not to use it in cases like troll bridge, because it is deciding whether to use the chicken rule for sensible reasons (where “sensible” includes the chicken rule itself).

Deciding to use the chicken rule is a transparent-Newcomb-like problem: your behavior in the case that you do see a proof of your own action affects your ability to reason in the case that you don’t.

The same things seem to apply to exploration.

Given our current level of understanding, chicken rule and true exploration seem closely analogous. However, it’s quite plausible that this will stop being the case with a better understanding. In particular, Sam recently pointed out to me that Löb’s theorem doesn’t go through for $$\Box \phi := \mathbb{P}(\phi) \geq 0.95$$. We have a decent picture of what tiling looks like in pure logical settings, but that’s shaped very highly by Löb’s theorem. So, tiling considerations for exploration could look very different from those for chicken-rule.

 A Difficulty With Density-Zero Exploration discussion post by Alex Appel 27 days ago | 1 comment

Update: This isn’t really an issue, you just need to impose an assumption that there is some function $$f$$ such that $$f(n)>n$$, and $$f(n)$$ is computable in time polynomial in $$f(n)$$, and you always find out whether exploration happened on turn $$f(n)$$ after $$\mathcal{O}(f(n+1))$$ days.

This is just the condition that there’s a subsequence where good feedback is possible, and is discussed significantly in section 4.3 of the logical induction paper.

If there’s a subsequence B (of your subsequence of interest, A) where you can get good feedback, then there’s infinite exploration steps on subsequence B (and also on A because it contains B)

This post is hereby deprecated. Still right, just not that relevant.

 by Abram Demski 36 days ago | link | parent | on: Distributed Cooperation Cool! I’m happy to see this written up finally. It’s been a big source of intuitions for me, so it’s good to see that the proof checks out. A possible next step to all this is to try to specify proof-based DT agents which could play this game (or something similar) based on Löbian handshakes. (In fact, part of the original motivation was to try to bring the cooperative-oracle model closer to the Löb-based cooperation you can get in prisoner’s dilemma with visible source code.) It’s unfortunate that you had to add the pareto-improvement condition to the cell rank. That part seems especially unlikely to drop out of a more general decision theory. I think I see another serious complication: Yes, not all points in $$C_{\vec{i}}$$ attain the same utility, but for a sufficiently small ϵ, the cell is really small, and for any player, the utility over the cell is well-approximated by the utility attained at the middle point in the cell. So, for any desired utility-approximation accuracy $$\delta$$, you can choose $$\epsilon$$ sufficiently small to achieve it. But, a pareto-optima in the set of middle points can be arbitrarily worse for some player than any pareto-optima of the full game; IE, taking the midpoints can hide arbitrarily large pareto improvements. For example, suppose that $$\delta$$=0.001. A pareto optima of the midpoints might give the utility vector (2, 2, 3) for the three players. There could be another midpoint (100, 100, 2.9999999), very near where the true game contains a point (100, 100, 3). So, it seems the pareto optimum of the game on midpoints which is found by the process in the post can be arbitrarily sub-optimal for all but one player, with no guarantee that this gets better as $$\epsilon$$ shrinks.
by Alex Appel 36 days ago | Abram Demski likes this | link | on: Distributed Cooperation

If you drop the Pareto-improvement condition from the cell rank, and just have “everyone sorts things by their own utility”, then you won’t necessarily get a Pareto-optimal outcome (within the set of cell center-points), but you will at least get a point where there are no strict Pareto improvements (no points that leave everyone better off).

The difference between the two is… let’s say we’ve got a 2-player 2-move game that in utility-space, makes some sort of quadrilateral. If the top and right edges join at 90 degrees, the Pareto-frontier would be the point on the corner, and the set of “no strict Pareto improvements” would be the top and the right edges.

If that corner is obtuse, then both “Pareto frontier” and “no strict Pareto improvements” agree that both line edges are within the set, and if the corner is acute, then both “Pareto frontier” and “no strict Pareto improvements” agree that only the corner is within the set. It actually isn’t much of a difference, it only manifests when the utilities for a player are exactly equal, and is easily changed by a little bit of noise.

The utility-approximation issue you pointed out seems to be pointing towards the impossibility of guaranteeing limiting to a point on the Pareto frontier (when you make the cell size smaller and smaller), precisely because of that “this set is unstable under arbitrarily small noise” issue.

But, the “set of all points that have no strict Pareto improvements by more than $$\delta$$ for all players”, ie, the $$\delta$$-fuzzed version of “set of points with no strict pareto improvement”, does seem to be robust against a little bit of noise, and doesn’t require the Pareto-improvement condition on everyone’s ranking of cells.

So I’m thinking that if that’s all we can attain (because of the complication you pointed out), then it lets us drop that inelegant Pareto-improvement condition.

I’ll work on the proof that for sufficiently small cell size $$\epsilon$$, you can get an outcome within $$\delta$$ of the set of “no strict Pareto improvements available”

Nice job spotting that flaw.

Distributed Cooperation
post by Alex Appel 36 days ago | Abram Demski and Scott Garrabrant like this | 2 comments

Reflective oracles can be approximated by computing Nash equilibria. But is there some procedure that produces a Pareto-optimal equilibrium in a game, aka, a point produced by a Cooperative oracle? It turns out there is. There are some interesting philosophical aspects to it, which will be typed up in the next post.

The result is not original to me, it’s been floating around MIRI for a while. I think Scott, Sam, and Abram worked on it, but there might have been others. All I did was formalize it a bit, and generalize from the 2-player 2-move case to the n-player n-move case. With the formalism here, it’s a bit hard to intuitively understand what’s going on, so I’ll indicate where to visualize an appropriate 3-dimensional object.

by Abram Demski 36 days ago | link | on: Distributed Cooperation

Cool! I’m happy to see this written up finally. It’s been a big source of intuitions for me, so it’s good to see that the proof checks out.

A possible next step to all this is to try to specify proof-based DT agents which could play this game (or something similar) based on Löbian handshakes. (In fact, part of the original motivation was to try to bring the cooperative-oracle model closer to the Löb-based cooperation you can get in prisoner’s dilemma with visible source code.)

It’s unfortunate that you had to add the pareto-improvement condition to the cell rank. That part seems especially unlikely to drop out of a more general decision theory.

I think I see another serious complication:

Yes, not all points in $$C_{\vec{i}}$$ attain the same utility, but for a sufficiently small ϵ, the cell is really small, and for any player, the utility over the cell is well-approximated by the utility attained at the middle point in the cell.

So, for any desired utility-approximation accuracy $$\delta$$, you can choose $$\epsilon$$ sufficiently small to achieve it. But, a pareto-optima in the set of middle points can be arbitrarily worse for some player than any pareto-optima of the full game; IE, taking the midpoints can hide arbitrarily large pareto improvements.

For example, suppose that $$\delta$$=0.001. A pareto optima of the midpoints might give the utility vector (2, 2, 3) for the three players. There could be another midpoint (100, 100, 2.9999999), very near where the true game contains a point (100, 100, 3).

So, it seems the pareto optimum of the game on midpoints which is found by the process in the post can be arbitrarily sub-optimal for all but one player, with no guarantee that this gets better as $$\epsilon$$ shrinks.

 by 258 87 days ago | Alex Appel and Abram Demski like this | link | parent | on: In memoryless Cartesian environments, every UDT po... Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below. Given some existing policy $$\pi$$, EDT+SSA recommends that upon receiving observation $$o$$ we should choose an action from $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n).$ (For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, $$\pi_{o\rightarrow a}(o')=a$$ if $$o=o'$$ and $$\pi_{o\rightarrow a}(o')=\pi(o')$$ otherwise. $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is the SSA probability of being in state $$s_i$$ of the environment trajectory $$s_1,...,s_n$$ given the observation $$o$$ and the fact that one uses the policy $$\pi_{o\rightarrow a}$$. The SSA probability $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is zero if $$m(s_i)\neq o$$ and $SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a}) = P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)}$ otherwise. Here, $$\#(o,s_1,...,s_n)=\sum_{i=1}^n \left[ m(s_i)=o \right]$$ is the number of times $$o$$ occurs in $$\#(o,s_1,...,s_n)$$. Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1/2 probability to tails in the Sleeping Beauty problem and sticks with 1/2 if it’s told that it’s Monday). Inserting this into the above, we get $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n)=\arg\max_a \sum_{s_1,...,s_n\text{ with }o} \sum_{i=1...n, m(s_i)=o} \left( P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)} \right) U(s_n),$ where the first sum on the right-hand side is over all histories that give rise to observation $$o$$ at some point. Dividing by the number of agents with observation $$o$$ in a history and setting the policy for all agents at the same time cancel each other out, such that this equals $\arg\max_a \sum_{s_1,...,s_n\text{ with }o} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n)=\arg\max_a \sum_{s_1,...,s_n} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n).$ Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; multiple policies satisfy the right-hand side of this equation and not all of these are optimal. [1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf

Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely “isomorphic”/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution is to identify with all objects in the world that are perfectly correlated with you. However, the underlying motivation is unrelated to Conitzer’s example.

[2] Vincent Conitzer: A Dutch Book against Sleeping Beauties Who Are Evidential Decision Theorists. Synthese, Volume 192, Issue 9, pp. 2887-2899, October 2015. https://arxiv.org/pdf/1705.03560.pdf

 Stable Pointers to Value II: Environmental Goals discussion post by Abram Demski 73 days ago | 1 comment

Delegative Reinforcement Learning solves this problem by keeping humans in the loop while preserving consequentialist reasoning. Ofc currently the theory is based on a lot of simplification and the ultimate learning protocol will probably look differently, but I think that the basic mechanism (delegation combined with model-based reasoning) is sound.

Further Progress on a Bayesian Version of Logical Uncertainty
post by Alex Appel 81 days ago | Scott Garrabrant likes this | 1 comment

I’d like to credit Daniel Demski for helpful discussion.

Intermediate update:

The handwavy argument about how you’d get propositional inconsistency in the limit of imposing the constraint of “the string cannot contain $$a\wedge b\wedge c...\to\neg\phi$$ and $$a$$ and $$b$$ and… and $$\phi$$

is less clear than I thought. The problem is that, while the prior may learn that that constraint applies as it updates on more sentences, that particular constraint can get you into situations where adding either $$\phi$$ or $$\neg\phi$$ leads to a violation of the constraint.

So, running the prior far enough forward leads to the probability distribution being nearly certain that, while that particular constraint applied in the past, it will stop applying at some point in the future by vetoing both possible extensions of a string of sentences, and then less-constrained conditions will apply from that point forward.

On one hand, if you don’t have the computational resources to enforce full propositional consistency, it’s expected that most of the worlds you generate will be propositionally inconsistent, and midway through generating them you’ll realize that some of them are indeed propositionally inconsistent.

On the other hand, we want to be able to believe that constraints capable of painting themselves into a corner will apply to reality forevermore.

I’ll think about this a bit more. One possible line of attack is having $$\mathbb{P}(\phi)$$ and $$\mathbb{P}(\neg\phi)$$ not add up to one, because it’s possible that the sentence generating process will just stop cold before one of the two shows up, and renormalizing them to 1. But I’d have to check if it’s still possible to $$\varepsilon$$-approximate the distribution if we introduce this renormalization, and to be honest, I wouldn’t be surprised if there was a more elegant way around this.

EDIT: yes it’s still possible to $$\varepsilon$$-approximate the distribution in known time if you have $$\mathbb{P}(\phi)$$ refer to $$\frac{probability to encounter \phi first}{1-probability to halt first}$$, although the bounds are really loose. This is because if most of the execution paths involve halting before the sentence is sampled, $$\varepsilon$$-error in the probability of sampling $$\phi$$ first will get blown up by the small denominator.

Will type up the proof later, but it basically proceeds by looking at the probability mass associated with “sample the trivial constraint that accepts everything, and sample it again on each successive round”, because this slice of probability mass has a near-guarantee of hitting $$\phi$$, and then showing that even this tiny slice has substantially more probability mass than the cumulative probability of ever sampling a really rare sentence or not hitting any of $$\phi$$, $$\neg\phi$$, or the string terminating.

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy
post by Jessica Taylor 681 days ago | Vadim Kosoy and Abram Demski like this | 3 comments

Summary: I define a memoryless Cartesian environments (which can model many familiar decision problems), note the similarity to memoryless POMDPs, and define a local optimality condition for policies, which can be roughly stated as “the policy is consistent with maximizing expected utility using CDT and subjective probabilities derived from SIA”. I show that this local optimality condition is necesssary but not sufficient for global optimality (UDT).

Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.

Given some existing policy $$\pi$$, EDT+SSA recommends that upon receiving observation $$o$$ we should choose an action from $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n).$ (For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, $$\pi_{o\rightarrow a}(o')=a$$ if $$o=o'$$ and $$\pi_{o\rightarrow a}(o')=\pi(o')$$ otherwise. $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is the SSA probability of being in state $$s_i$$ of the environment trajectory $$s_1,...,s_n$$ given the observation $$o$$ and the fact that one uses the policy $$\pi_{o\rightarrow a}$$.

The SSA probability $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is zero if $$m(s_i)\neq o$$ and $SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a}) = P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)}$ otherwise. Here, $$\#(o,s_1,...,s_n)=\sum_{i=1}^n \left[ m(s_i)=o \right]$$ is the number of times $$o$$ occurs in $$\#(o,s_1,...,s_n)$$. Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1/2 probability to tails in the Sleeping Beauty problem and sticks with 1/2 if it’s told that it’s Monday).

Inserting this into the above, we get $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n)=\arg\max_a \sum_{s_1,...,s_n\text{ with }o} \sum_{i=1...n, m(s_i)=o} \left( P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)} \right) U(s_n),$ where the first sum on the right-hand side is over all histories that give rise to observation $$o$$ at some point. Dividing by the number of agents with observation $$o$$ in a history and setting the policy for all agents at the same time cancel each other out, such that this equals $\arg\max_a \sum_{s_1,...,s_n\text{ with }o} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n)=\arg\max_a \sum_{s_1,...,s_n} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n).$ Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; multiple policies satisfy the right-hand side of this equation and not all of these are optimal.

[1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf

Logical counterfactuals and differential privacy
post by Nisan Stiennon 91 days ago | Abram Demski and Scott Garrabrant like this | 1 comment

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

This doesn’t quite work. The theorem and examples only work if you maximize the unconditional mutual information, $$H(X;Y)$$, not $$H(X;Y|A)$$. And the choice of $$X$$ is doing a lot of work — it’s not enough to make it “sufficiently rich”.

An Untrollable Mathematician
post by Abram Demski 89 days ago | Alex Appel, Sam Eisenstat, Vadim Kosoy, Jack Gallagher, Jessica Taylor, Paul Christiano, Scott Garrabrant and Vladimir Slepnev like this | 1 comment

Follow-up to All Mathematicians are Trollable.

It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the non-convergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild non-convergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

by Sam Eisenstat 89 days ago | Abram Demski likes this | link | on: An Untrollable Mathematician

I at first didn’t understand your argument for claim (2), so I wrote an alternate proof that’s a bit more obvious/careful. I now see why it works, but I’ll give my version below for anyone interested. In any case, what you really mean is the probability of deciding a sentence outside of $$\Phi$$ by having it announced by nature; there may be a high probability of sentences being decided indirectly via sentences in $$\Phi$$.

Instead of choosing $$\Phi$$ as you describe, pick $$\Phi$$ so that the probability $$\mu(\Phi)$$ of sampling something in $$\Phi$$ is greater than $$1 - \mu(\psi) \cdot \varepsilon / 2$$. Then, the probability of sampling something in $$\Phi - \{\psi\}$$ is at least $$1 - \mu(\psi) \cdot (1 + \varepsilon / 2)$$. Hence, no matter what sentences have been decided already, the probability that repeatedly sampling from $$\mu$$ selects $$\psi$$ before it selects any sentence outside of $$\Phi$$ is at least

\begin{align*} \sum_{k = 0}^\infty (1 - \mu(\psi) \cdot (1 + \varepsilon / 2))^k \cdot \mu(\psi) & = \frac{\mu(\psi)}{\mu(\psi) \cdot (1 + \varepsilon / 2)} \\ & > 1 - \varepsilon / 2 \end{align*}

as desired.

Furthermore, this argument makes it clear that the probability distribution we converge to depends only on the set of sentences which the environment will eventually assert, not on their ordering!

Oh, I didn’t notice that aspect of things. That’s pretty cool.

 by Abram Demski 101 days ago | link | parent | on: The set of Logical Inductors is not Convex This uses logical inductors of distinctly different strengths. I wonder if there’s some kind of convexity result for logical inductors which can see each other? Suppose traders in $$\mathbb{P}_n$$ have access to $$\mathbb{P}'_n$$ and vice versa. Or perhaps just assume that the markets cannot be arbitrarily exploited by such traders. Then, are linear combinations also logical inductors?

This is somewhat related to what I wrote about here. If you consider only what I call convex gamblers/traders and fix some weighting (“prior”) over the gamblers then there is a natural convex set of dominant forecasters (for each history, it is the set of minima of some convex function on $$\Delta\mathcal{O}^\omega$$.)

The set of Logical Inductors is not Convex
post by Scott Garrabrant 573 days ago | Sam Eisenstat, Abram Demski and Patrick LaVictoire like this | 3 comments

Sam Eisenstat asked the following interesting question: Given two logical inductors over the same deductive process, is every (rational) convex combination of them also a logical inductor? Surprisingly, the answer is no! Here is my counterexample.

This uses logical inductors of distinctly different strengths. I wonder if there’s some kind of convexity result for logical inductors which can see each other? Suppose traders in $$\mathbb{P}_n$$ have access to $$\mathbb{P}'_n$$ and vice versa. Or perhaps just assume that the markets cannot be arbitrarily exploited by such traders. Then, are linear combinations also logical inductors?

Smoking Lesion Steelman II
post by Abram Demski 205 days ago | Tom Everitt and Scott Garrabrant like this | 1 comment

After Johannes Treutlein’s comment on Smoking Lesion Steelman, and a number of other considerations, I had almost entirely given up on CDT. However, there were still nagging questions about whether the kind of self-ignorance needed in Smoking Lesion Steelman could arise naturally, how it should be dealt with if so, and what role counterfactuals ought to play in decision theory if CDT-like behavior is incorrect. Today I sat down to collect all the arguments which have been rolling around in my head on this and related issues, and arrived at a place much closer to CDT than I expected.

Nice writeup. Is one-boxing in Newcomb an equilibria?

 by Alex Appel 117 days ago | link | parent | on: Delegative Inverse Reinforcement Learning I don’t believe that $$x_{:n}^{!k}$$ was defined anywhere, but we “use the definition” in the proof of Lemma 1. As far as I can tell, it’s a set of (j,y) pairs, where j is the index of a hypothesis, and y is an infinite history string, rather like the set $$h^{!k}$$. How do the definitions of $$h^{!k}$$ and $$x^{!k}_{:n}$$ differ?

Hi Alex!

The definition of $$h^{!k}$$ makes sense for any $$h$$, that is, the superscript $$!k$$ in this context is a mapping from finite histories to sets of pairs as you said. In the line in question we just apply this mapping to $$x_{:n}$$ where $$x$$ is a bound variable coming from the expected value.

I hope this helps?

Delegative Inverse Reinforcement Learning
post by Vadim Kosoy 296 days ago | Alex Appel likes this | 11 comments

We introduce a reinforcement-like learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.

A summary that might be informative to other people: Where does the $$\omega(\frac{2}{3})$$ requirement on the growth rate of the “rationality parameter” $$\beta$$ come from?

Well, the expected loss of the agent comes from two sources. Making a suboptimal choice on its own, and incurring a loss from consulting a not-fully-rational advisor. The policy of the agent is basically “defer to the advisor when the expected loss over all time of acting (relative to the optimal move by an agent who knew the true environment) is too high”. Too high, in this case, cashes out as “higher than $$\beta(t)^{-1}t^{-1/x}$$”, where t is the time discount parameter and $$\beta$$ is the level-of-rationality parameter. Note that as the operator gets more rational, the agent gets less reluctant about deferring. Also note that t is reversed from what you might think, high values of t mean that the agent has a very distant planning horizon, low values mean the agent is more present-oriented.

On most rounds, the agent acts on its own, so the expected all-time loss on a single round from taking suboptimal choices is on the order of $$\beta(t)^{-1}t^{-1/x}$$, and also we’re summing up over about t rounds (technically exponential discount, but they’re similar enough). So the loss from acting on its own ends up being about $$\beta(t)^{-1}t^{(x-1)/x}$$.

On the other hand, delegation will happen on at most ~$$t^{2/x}$$ rounds, with a loss of $$\beta(t)^{-1}$$ value, so the loss from delegation ends up being around $$\beta(t)^{-1}t^{2/x}$$.

Setting these two losses equal to each other/minimizing the exponent on the t when they are smooshed together gets you x=3. And then $$\beta(t)$$ must grow asymptotically faster than $$t^{2/3}$$ to have the loss shrink to 0. So that’s basically where the 2/3 comes from, it comes from setting the delegation threshold to equalize long-term losses from the AI acting on its own, and the human picking bad choices, as the time horizon t goes to infinity.

Delegative Inverse Reinforcement Learning
post by Vadim Kosoy 296 days ago | Alex Appel likes this | 11 comments

We introduce a reinforcement-like learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.

I don’t believe that $$x_{:n}^{!k}$$ was defined anywhere, but we “use the definition” in the proof of Lemma 1.

As far as I can tell, it’s a set of (j,y) pairs, where j is the index of a hypothesis, and y is an infinite history string, rather like the set $$h^{!k}$$.

How do the definitions of $$h^{!k}$$ and $$x^{!k}_{:n}$$ differ?

Being legible to other agents by committing to using weaker reasoning systems
post by Alex Mennen 141 days ago | Stuart Armstrong and Vladimir Slepnev like this | 1 comment

Suppose that an agent $$A_{1}$$ reasons in a sound theory $$T_{1}$$, and an agent $$A_{2}$$ reasons in a theory $$T_{2}$$, such that $$T_{1}$$ proves that $$T_{2}$$ is sound. Now suppose $$A_{1}$$ is trying to reason in a way that is legible to $$A_{2}$$, in the sense that $$A_{2}$$ can rely on $$A_{1}$$ to reach correct conclusions. One way of doing this is for $$A_{1}$$ to restrict itself to some weaker theory $$T_{3}$$, which $$T_{2}$$ proves is sound, for the purposes of any reasoning that it wants to be legible to $$A_{2}$$. Of course, in order for this to work, not only would $$A_{1}$$ have to restrict itself to using $$T_{3}$$, but $$A_{2}$$ would to trust that $$A_{1}$$ had done so. A plausible way for that to happen is for $$A_{1}$$ to reach the decision quickly enough that $$A_{2}$$ can simulate $$A_{1}$$ making the decision to restrict itself to using $$T_{3}$$.

This is exactly the sort of thing I’ve wanted for ASP (Agent Simulates Predictor).

One problem that’s always blocked me, is how to know when to do this, rather than using it add-hoc - is there an easy way to know that there’s an agent out in the universe using a more limited reasoning system?

 Where does ADT Go Wrong? discussion post by Abram Demski 156 days ago | Jack Gallagher and Jessica Taylor like this | 1 comment

When considering an embedder $$F$$, in universe $$U$$, in response to which SADT picks policy $$\pi$$, I would be tempted to apply the following coherence condition:

$E[F(\pi)] = E[F(DDT)] = E[U]$

(all approximately of course)

I’m not sure if this would work though. This is definitely a necessary condition for reasonable counterfactuals, but not obviously sufficient.

A potentially useful augmentation is to use absolute expected difference: $E[|F(\pi) - F(DDT)|] = E[|F(DDT) - U|] = 0$

 by Paul Christiano 144 days ago | Vladimir Slepnev likes this | link | parent | on: Policy Selection Solves Most Problems Without reading closely, this seems very close to UDT2. Is there a problem that this gets right which UDT2 gets wrong (or for which there is ambiguity about the specification of UDT2?) Without thinking too carefully, I don’t believe the troll bridge argument. We have to be super careful about “sufficiently large,” and about Lob’s theorem. To see whether the proof goes through, it seems instructive to consider the case where a trader with 90% of the initial mass really wants to cross the bridge. What happens when they try?

The differences between this and UDT2:

1. This is something we can define precisely, whereas UDT2 isn’t.
2. Rather than being totally updateless, this is just mostly updateless, with the parameter $$f$$ determining how updateless it is.

I don’t think there’s a problem this gets right which we’d expect UDT2 to get wrong.

If we’re using the version of logical induction where the belief jumps to 100% as soon as something gets proved, then a weighty trader who believes crossing the bridge is good will just get knocked out immediately if the theorem prover starts proving that crossing is bad (which helps that step inside the Löbian proof go through). (I’d be surprised if the analysis turns out much different for the kind of LI which merely rapidly comes to believe things which get proved, but I can see how that distinction might block the proof.) But certainly it would be good to check this more thoroughly.

 by Stuart Armstrong 145 days ago | link | parent | on: Policy Selection Solves Most Problems policy selection converges to giving Omega the money so long as the difficulty of computing the coin exceeds the power of the market at $$f(n)$$ time. Would it be sensible to just look for muggings (and ASPs) at the very beginning of the process, and then decide immediately what to do as soon as one is detected? Come to think of that, precommitting to ignoring knowledge about the result of the coin seems to be the best strategy here; does this cash out into anything useful in this formalism?

Looking “at the very beginning” won’t work – the beliefs of the initial state of the logical inductor won’t be good enough to sensibly detect these things and decide what to do about them.

While ignoring the coin is OK as special-case reasoning, I don’t think everything falls nicely into the bucket of “information you want to ignore” vs “information you want to update on”. The more general concept which captures both is to ask “how do I want to react to thin information, in terms of my action?” – which is of course the idea of policy selection.

Older

NEW DISCUSSION POSTS

I think that in that case,
 by Alex Appel on Smoking Lesion Steelman | 1 like

 by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
 by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
 by Jessica Taylor on Musings on Exploration | 0 likes

 by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
 by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
 by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes

If you drop the
 by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
 by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like