by Sam Eisenstat 18 days ago | Alex Appel likes this | link | parent | on: A Loophole for Self-Applicative Soundness As you say, this isn’t a proof, but it wouldn’t be too surprising if this were consistent. There is some $$k \in \mathbb{N}$$ such that $$\square_n \phi \to \phi$$ has a proof of length $$n^k$$ by a result of Pavel Pudlák (On the length of proofs of finitistic consistency statements in first order theories). Here I’m making the dependence on $$n$$ explicit, but not the dependence on $$\phi$$. I haven’t looked at it closely, but the proof strategy in Theorems 5.4 and 5.5 suggests that $$k$$ will not depend on $$\phi$$, as long as we only ask for the weaker property that $$\square_n \phi \to \phi$$ will only be provable in length $$n^k$$ for sentences $$\phi$$ of length at most $$k$$.

I found an improved version by Pavel, that gives a way to construct a proof of $$\phi$$ from $$\Box_{n}\phi$$ that has a length of $$\mathcal{O}(n)$$. The improved version is here.

There are restrictions to this result, though. One is that the C-rule must apply to the logic. This is just the ability to go from $$\exists x:\phi(x)$$ to instantiating a $$c$$ such that $$\phi(c)$$. Pretty much all reasonable theorem provers have this.

The second restriction is that the theory must be finitely axiomatizable. No axiom schemas allowed. Again, this isn’t much of a restriction in practice, because NBG set theory, which proves the consistency of ZFC, is finitely axiomatizable.

The proof strategy is basically as follows. It’s shown that the shortest proof of a statement with quantifier depth n must have a length of $$\Omega(n^2)$$, if the maximum quantifier depth in the proof is $$2n$$ or greater.

This can be flipped around to conclude that if there’s a length-n proof of $$\phi$$, the maximum quantifier depth in the proof can be at most $$\mathcal{O}(\sqrt{n})$$.

The second part of the proof involves constructing a bounded-quantifier version of a truth predicate. By Tarski’s undefinability of truth, a full truth predicate cannot be constructed, but it’s possible to exhibit a formula for which it’s provable that $qd(\overline{\psi})\le n\to(Sat_{n}(\overline{\psi},x)\leftrightarrow\Sigma(Sat_{n},\overline{\psi},x))$ ($$\Sigma$$ is the formula laying out Tarski’s conditions for something to be a truth predicate). Also, if $$n\ge$$ quantifier depth of $$\psi$$, there’s a proof of $Sat_{n}(\overline{\psi},x)\leftrightarrow\psi[x]$ ($$\psi[x]$$ is the sentence $$\psi$$ with its free variables substituted for the elements enumerated in the list $$x$$) Also, there’s a proof that $$Sat_{n}$$ is preserved under inference rules and logical axioms, as long as everything stays below a quantifier depth of $$n$$.

All these proofs can be done in $$\mathcal{O}(n^2)$$ lines. One factor of $$n$$ comes from the formula abbreviated as $$Sat_{n}(x,y)$$ getting longer at a linear rate, and the other factor comes from having to prove $$Sat_{n}$$ for each $$n$$ seperately as an ingredient for the next proof.

Combining the two parts, the $$\mathcal{O}(\sqrt{n})$$ bound on the quantifier depth and the $$\mathcal{O}(n^2)$$ bound on how long it takes to prove stuff about the truth predicate, make it take $$\mathcal{O}(n)$$ steps to prove all the relevant theorems about a sufficiently large bounded quantifier depth truth predicate, and then you can just go “the statement that we are claiming to have been proved must have $$Sat_{n}$$ apply to it, and we’ve proved this is equivalent to the statement itself”

As a further bonus, a single $$\mathcal{O}(n)$$-length proof can establish the consistency of the theory itself for all $$n$$-length proofs.

It seems like a useful project to develop a program that will automatically write a proof of this form, to assess whether abstract unpacking of bounded proofs is usable in practice, but it will require digging into a bunch of finicky details of exactly how to encode a math theory inside itself.

 by Alex Appel 18 days ago | link | parent | on: A Loophole for Self-Applicative Soundness Caught a flaw with this proposal in the currently stated form, though it is probably patchable. When unpacking a proof, at some point the sentence $$\Box_{1}\phi$$ will be reached as a conclusion, which is a false statement.

I misunderstood your proposal, but you don’t need to do this work to get what you want. You can just take each sentence $$\square_n \phi \to \phi$$ as an axiom, but declare that this axiom takes $$n$$ symbols to invoke. This could be done by changing the notion of length of a proof, or by taking axioms $$\psi_{\phi,n} \to (\square_n \phi \to \phi)$$ and $$\psi_{\phi,n}$$ with $$\psi_{\phi,n}$$ very long.

 A Loophole for Self-Applicative Soundness discussion post by Alex Appel 18 days ago | Abram Demski likes this | 4 comments

Caught a flaw with this proposal in the currently stated form, though it is probably patchable.

When unpacking a proof, at some point the sentence $$\Box_{1}\phi$$ will be reached as a conclusion, which is a false statement.

 A Loophole for Self-Applicative Soundness discussion post by Alex Appel 18 days ago | Abram Demski likes this | 4 comments

As you say, this isn’t a proof, but it wouldn’t be too surprising if this were consistent. There is some $$k \in \mathbb{N}$$ such that $$\square_n \phi \to \phi$$ has a proof of length $$n^k$$ by a result of Pavel Pudlák (On the length of proofs of finitistic consistency statements in first order theories). Here I’m making the dependence on $$n$$ explicit, but not the dependence on $$\phi$$. I haven’t looked at it closely, but the proof strategy in Theorems 5.4 and 5.5 suggests that $$k$$ will not depend on $$\phi$$, as long as we only ask for the weaker property that $$\square_n \phi \to \phi$$ will only be provable in length $$n^k$$ for sentences $$\phi$$ of length at most $$k$$.

post by Jessica Taylor 571 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

Summary: in approximating a scheme like HCH , we would like some notion of “the best the prediction can be given available AI capabilities”. There’s a natural notion of “the best prediction of a human we should expect to get”. In general this doesn’t yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.

Note: I currently think that the basic picture of getting within $$\epsilon$$ of a good prediction is actually pretty sketchy. I wrote about the sample complexity here. Additional to the sample complexity issues, the requirement is for predictors to be Bayes-optimal, but Bayes-optimality is not possible for bounded reasoners. This is important because e.g. some adversarial predictor might make very good predictions on some subset of questions (because it’s spending its compute on those specifically), causing other predictors to be filtered out (if those questions are used to determine who the best predictor is). I don’t know what kind of analysis could get the $$\epsilon$$-accuracy result at this point.

 Doubts about Updatelessness discussion post by Alex Appel 50 days ago | Abram Demski likes this | 2 comments

Counterfactual mugging doesn’t require spoofing. Consider the following problem:

Suppose no one, given $$10^{5}$$ steps of computation, is able to compute any information about the parity of the $$10^{10}$$th digit of $$\pi$$, and everyone, given $$10^{100}$$ steps of computation, is able to compute the $$10^{10}$$th digit of $$\pi$$. Suppose that at time $$t$$, everyone has $$10^5$$ steps of computation, and at a later time $$t'$$, everyone has $$10^{100}$$ steps of computation. At the initial time $$t$$, Omega selects a probability $$p$$ equal to the conditional probability Omega assigns to the agent paying $1 at time $$t'$$ conditional on the digit being odd. (This could be because Omega is a logical inductor, or because Omega is a CDT agent whose utility function is such that selecting this value of $$p$$ is optimal). At time $$t'$$, if the digit is even, a fair coin with probability $$p$$ of coming up heads is flipped, and if it comes up heads, Omega pays the agent$10. If instead the digit is odd, then the agent has the option of paying Omega \$1.

This contains no spoofing, and the optimal policy for the agent is to pay up if asked.

 Doubts about Updatelessness discussion post by Alex Appel 50 days ago | Abram Demski likes this | 2 comments

What do you mean by “in full generality instead of the partial version attained by policy selection”?

 Maximally efficient agents will probably have an anti-daemon immune system discussion post by Jessica Taylor 485 days ago | Ryan Carey, Patrick LaVictoire and Scott Garrabrant like this | 1 comment

It seems relatively plausible that it’s “daemons all the way down,” and that a sophisticated agent from the daemon-distribution accepts this as the price of doing business (it loses value from being overtaken by its daemons, but gains the same amount of value on average from overtaking others). The main concern of such an agent would be defecting daemons that building anti-daemon immune systems, so that they can increase their influence by taking over parents but avoid being taken over themselves. However, if we have a sufficiently competitive internal environment then those defectors will be outcompeted anyway.

In this case, if we also have fractal immune systems causing log(complexity) overhead, then the orthogonality thesis is probably not true. The result would be that agents end up pursuing a “grand bargain” of whatever distribution of values efficient daemons have, rather than including a large component in the bargain for values like ours, and there would be no way for humans to subvert this directly (we may be able to subvert it indirectly by coordinating and then trading, i.e. only building an efficient but daemon-prone agent after confirming that daemon-values pay us enough to make it worth our while. But this kind of thing seems radically confusing and is unlikely to be sorted out by humans.) The process of internal value shifting amongst daemons would continue in some abstract sense, though they would eventually end up pursuing the convergent bargain of their values (in the same way that a hyperbolic discounter ends up behaving consistently after reflection).

I think this is the most likely way the orthogonality thesis could fail. When there was an arbital poll on this question a few years ago, I had by far the lowest probability on the orthogonality thesis and was quite surprised by other commenters’ confidence.

Fortunately, even if there is logarithmic overhead, it currently looks quite unlikely to me that the constants are bad enough for this to be an unrecoverable problem for us today. But as you say, it would be a dealbreaker for any attempt to prove asymptotic efficiency.

 by Johannes Treutlein 347 days ago | Tom Everitt, Alex Appel, Abram Demski and Paul Christiano like this | link | parent | on: Smoking Lesion Steelman From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT I agree with this. It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one. I’d also be interested in finding such a problem. I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a counter-example to causal decision theory. For example, consider a decision problem with the following payoff matrix: Smoke-lover: Smokes: Killed: 10 Not killed: -90 Doesn’t smoke: Killed: 0 Not killed: 0 Non-smoke-lover: Smokes: Killed: -100 Not killed: -100 Doesn’t smoke: Killed: 0 Not killed: 0 For some reason, the agent doesn’t care whether they live or die. Also, let’s say that smoking makes a smoke-lover happy, but afterwards, they get terribly sick and lose 100 utilons. So they would only smoke if they knew they were going to be killed afterwards. The non-smoke-lover doesn’t want to smoke in any case. Now, smoke-loving evidential decision theorists rightly choose smoking: they know that robots with a non-smoke-loving utility function would never have any reason to smoke, no matter which probabilities they assign. So if they end up smoking, then this means they are certainly smoke-lovers. It follows that they will be killed, and conditional on that state, smoking gives 10 more utility than not smoking. Causal decision theory, on the other hand, seems to recommend a suboptimal action. Let $$a_1$$ be smoking, $$a_2$$ not smoking, $$S_1$$ being a smoke-lover, and $$S_2$$ being a non-smoke-lover. Moreover, say the prior probability $$P(S_1)$$ is $$0.5$$. Then, for a smoke-loving CDT bot, the expected utility of smoking is just $$\mathbb{E}[U|a_1]=P(S_1)\cdot U(S_1\wedge a_1)+P(S_2)\cdot U(S_2\wedge a_1)=0.5\cdot 10 + 0.5\cdot (-90) = -40$$, which is less then the certain $$0$$ utilons for $$a_2$$. Assigning a credence of around $$1$$ to $$P(S_1|a_1)$$, a smoke-loving EDT bot calculates $$\mathbb{E}[U|a_1]=P(S_1|a_1)\cdot U(S_1\wedge a_1)+P(S_2|a_1)\cdot U(S_2\wedge a_1)\approx 1 \cdot 10 + 0\cdot (-90) = 10$$, which is higher than the expected utility of $$a_2$$. The reason CDT fails here doesn’t seem to lie in a mistaken causal structure. Also, I’m not sure whether the problem for EDT in the smoking lesion steelman is really that it can’t condition on all its inputs. If EDT can’t condition on something, then EDT doesn’t account for this information, but this doesn’t seem to be a problem per se. In my opinion, the problem lies in an inconsistency in the expected utility equations. Smoke-loving EDT bots calculate the probability of being a non-smoke-lover, but then the utility they get is actually the one from being a smoke-lover. For this reason, they can get some “back-handed” information about their own utility function from their actions. The agents basically fail to condition two factors of the same product on the same knowledge. Say we don’t know our own utility function on an epistemic level. Ordinarily, we would calculate the expected utility of an action, both as smoke-lovers and as non-smoke-lovers, as follows: $$\mathbb{E}[U|a]=P(S_1|a)\cdot \mathbb{E}[U|S_1, a]+P(S_2|a)\cdot \mathbb{E}[U|S_2, a]$$, where, if $$U_{1}$$ ($$U_{2}$$) is the utility function of a smoke-lover (non-smoke-lover), $$\mathbb{E}[U|S_i, a]$$ is equal to $$\mathbb{E}[U_{i}|a]$$. In this case, we don’t get any information about our utility function from our own action, and hence, no Newcomb-like problem arises. I’m unsure whether there is any causal decision theory derivative that gets my case (or all other possible cases in this setting) right. It seems like as long as the agent isn’t certain to be a smoke-lover from the start, there are still payoffs for which CDT would (wrongly) choose not to smoke.
by Alex Appel 64 days ago | Abram Demski likes this | link | on: Smoking Lesion Steelman

I think that in that case, the agent shouldn’t smoke, and CDT is right, although there is side-channel information that can be used to come to the conclusion that the agent should smoke. Here’s a reframing of the provided payoff matrix that makes this argument clearer. (also, your problem as stated should have 0 utility for a nonsmoker imagining the situation where they smoke and get killed)

Let’s say that there is a kingdom which contains two types of people, good people and evil people, and a person doesn’t necessarily know which type they are. There is a magical sword enchanted with a heavenly aura, and if a good person wields the sword, it will guide them do heroic things, for +10 utility (according to a good person) and 0 utility (according to a bad person). However, if an evil person wields the sword, it will afflict them for the rest of their life with extreme itchiness, for -100 utility (according to everyone).

good person’s utility estimates:

• takes sword

• I’m good: 10

• I’m evil: -90

• don’t take sword: 0

evil person’s utility estimates:

• takes sword

• I’m good: 0

• I’m evil: -100

• don’t take sword: 0

As you can clearly see, this is the exact same payoff matrix as the previous example. However, now it’s clear that if a (secretly good) CDT agent believes that most of society is evil, then it’s a bad idea to pick up the sword, because the agent is probably evil (according to the info they have) and will be tormented with itchiness for the rest of their life, and if it believes that most of society is good, then it’s a good idea to pick up the sword. Further, this situation is intuitively clear enough to argue that CDT just straight-up gets the right answer in this case.

A human (with some degree of introspective power) in this case, could correctly reason “oh hey I just got a little warm fuzzy feeling upon thinking of the hypothetical where I wield the sword and it doesn’t curse me. This is evidence that I’m good, because an evil person would not have that response, so I can safely wield the sword and will do so”.

However, what the human is doing in this case is using side-channel information that isn’t present in the problem description. They’re directly experiencing sense data as a result of the utility calculation outputting 10 in that hypothetical, and updating on that. In a society where everyone was really terrible at introspection so the only access they had to their decision algorithm was seeing their actual decision, (and assuming no previous decision problems that good and evil people decide differently on so the good person could learn that they were good by their actions), it seems to me like there’s a very intuitively strong case for not picking up the sword/not smoking.

 No Constant Distribution Can be a Logical Inductor discussion post by Alex Appel 76 days ago | Sam Eisenstat, Abram Demski, Jessica Taylor and Stuart Armstrong like this | 1 comment

Two minor comments. First, the bitstrings that you use do not all correspond to worlds, since, for example, $$\rm{Con}(\rm{PA}+\rm{Con}(\rm{PA}))$$ implies $$\rm{Con}(\rm{PA})$$, as $$\rm{PA}$$ is a subtheory of $$\rm{PA} + \rm{Con}(\rm{PA})$$. This can be fixed by instead using a tree of sentences that all diagonalize against themselves. Tsvi and I used a construction in this spirit in A limit-computable, self-reflective distribution, for example.

Second, I believe that weakening #2 in this post also cannot be satisfied by any constant distribution. To sketch my reasoning, a trader can try to buy a sequence of sentences $$\phi_1, \phi_1 \wedge \phi_2, \dots$$, spending $$\2^{-n}$$ on the $$n$$th sentence $$\phi_1 \wedge \dots \wedge \phi_n$$. It should choose the sequence of sentences so that $$\phi_1 \wedge \dots \wedge \phi_n$$ has probability at most $$2^{-n}$$, and then it will make an infinite amount of money if the sentences are simultaneously true.

The way to do this is to choose each $$\phi_n$$ from a list of all sentences. If at any point you notice that $$\phi_1 \wedge \dots \wedge \phi_n$$ has too high a probability, then pick a new sentence for $$\phi_n$$. We can sell all the conjunctions $$\phi_1 \wedge \dots \wedge \phi_k$$ for $$k \ge n$$ and get back the original amount payed by hypothesis. Then, if we can keep using sharper continuous tests of the probabilities of the sentences $$\phi_1 \wedge \dots \wedge \phi_n$$ over time, we will settle down to a sequence with the desired property.

In order to turn this sketch into a proof, we need to be more careful about how these things are to be made continuous, but it seems routine.

 by Jessica Taylor 78 days ago | link | parent | on: Musings on Exploration The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring. My understanding of logical inductor exploration (e.g. in asymptotic decision theory) is that the exploration steps the agent learns from mostly don’t happen in its own lifetime, rather they happen in the lifetimes of similar but simpler agents. This allows exploration to work for single-shot problems such as 5 and 10. Intuitively, if you are in a 5 and 10 problem and your brain has size 10^1000, then you can simulate someone whose brain has size 10^999 doing a 5 and 10 problem, and thereby learn the relation between the agent’s action and how much utility they get. So each particular agent has some chance of exploring irrecoverably, but in aggregate not many of them will (and it’s hard to predict which will and which won’t). As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling. Thompson sampling still randomizes (it randomizes its belief about the world it’s in) and is therefore vulnerable to troll bridge.
by Alex Appel 77 days ago | link | on: Musings on Exploration

A: While that is a really interesting note that I hadn’t spotted before, the standard formulation of exploration steps in logical inductor decision theory involve infinite exploration steps over all time, so even though an agent of this type would be able to inductively learn from what other agents do in different decision problems in less time than it naively appears, that wouldn’t make it explore less.

B: What I intended with the remark about Thompson sampling was that troll bridge functions on there being two distinct causes of “attempting to cross the bridge”. One is crossing because you believe it to be the best action, and the other is crossing because an exploration step occurred, and Thompson sampling doesn’t have a split decision criterion like this. Although now that you point it out, it is possible to make a Thompson sampling variant where the troll blows up the bridge when “crossing the bridge” is not the highest-ranked action.

 Musings on Exploration discussion post by Alex Appel 81 days ago | Vadim Kosoy likes this | 4 comments

The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring.

My understanding of logical inductor exploration (e.g. in asymptotic decision theory) is that the exploration steps the agent learns from mostly don’t happen in its own lifetime, rather they happen in the lifetimes of similar but simpler agents. This allows exploration to work for single-shot problems such as 5 and 10. Intuitively, if you are in a 5 and 10 problem and your brain has size 10^1000, then you can simulate someone whose brain has size 10^999 doing a 5 and 10 problem, and thereby learn the relation between the agent’s action and how much utility they get. So each particular agent has some chance of exploring irrecoverably, but in aggregate not many of them will (and it’s hard to predict which will and which won’t).

As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling.

Thompson sampling still randomizes (it randomizes its belief about the world it’s in) and is therefore vulnerable to troll bridge.

 Musings on Exploration discussion post by Alex Appel 81 days ago | Vadim Kosoy likes this | 4 comments
by Vadim Kosoy 79 days ago | Alex Appel likes this | link | on: Musings on Exploration

Traps are a somewhat bigger issue than you seem to think, when you take acausal attack into account. Your prior contains low complexity hypotheses that intentionally produce good predictions until some critical pivot point at which they switch to something manipulative. So, every time your reach such a pivot point you are facing a decision with irreversible consequences and there is no prior evidence to help you. Delegative learning gets around this my having the agent delegate precisely at the pivot point.

Even disregarding that, “Once the agent has figured out some of how the world works, most environments/hypotheses where there is a trap have evidential clues elsewhere to rule them out” is not quite true.

The notion of a “trap” is relative to the way you organize your uncertainty about the world. Saying that the environment might contain traps is saying that the class of environments you consider is unlearnable. However, the specification of a Bayesian agent only depends on the prior that you get by “averaging” the environments in this class. Different ways of decomposing the prior into hypotheses might yield learnable or unlearnable classes.

For example, consider an environment in which taking action A leads to heaven with probability 70% and hell with probability 30% whereas taking action B leads to heaven with probability 50% and hell with probability 50%. In this environment, taking action A is the better choice and there is no problem. However, if you decompose it into a mixture of deterministic environments then, from that perspective, you have a trap.

To give a “realistic” example, imagine that, we think that quantum random is truly random, but actually there is an underlying theory which allows predicting quantum events deterministically. Then, a strategy that seems optimal from the perspective of an agent that only knows quantum theory might be “suicidal” (permanently sacrificing value) from the perspective of an agent that knows the deeper underlying theory.

As another example, imagine that (i) the only way to escape the heat death of our universe is by controlled vacuum collapse and (ii) because we don’t know in which string vacuum we are, there is no way to be certain about the outcome of a controlled vacuum collapse without high energy experiments that have a significant chance of triggering an uncontrolled vacuum collapse. AFAIK this situation is consistent with our knowledge of physics. So, if you consider the different string vacua to be different hypotheses, we are facing a trap. On the other hand, if you have some theory that gives you a probability distribution over these vacua then there is a well-defined optimal strategy.

The point is, it is probably not meaningful/realistic to claim that we can design an agent that will almost certainly successfully deal with all traps, but it is meaningful and realistic to claim that we can design an agent that will be optimal relatively to our own posterior belief state (which is, more or less by definition, the sort of agent that it is a good idea to build).

The reason “explorative” algorithms such as PSRL (Posterior Sampling Reinforcement Learning) cannot be trivially replaced by the Bayes optimal policy, is that the Bayes optimal policy is (more) computationally intractable. For example, if you consider a finite set of finite MDP hypotheses then PSRL can be implemented in polynomial time but the Bayes optimal policy cannot (in fact I am not 100% sure about this, but I think that hole can be patched using the PCP theorem).

 Musings on Exploration discussion post by Alex Appel 81 days ago | Vadim Kosoy likes this | 4 comments
by Abram Demski 80 days ago | link | on: Musings on Exploration

I’m not convinced exploration doesn’t tile. Exploration steps would not be self-modified away if they’re actually useful/important, and if the agent can recognize this.

In the case of chicken-rule, it’s unclear how to make a decision at all without it. Plus, the exploration never needs to occur – though, subjectively, the agent can’t know that, so that doesn’t really effect the question of tiling. But, the agent could see that removing the chicken step removes the ability of the agent to reason about the consequences of alternate actions.

I think it’s somewhat plausible that you can get something which needs chicken rule as a foundation, but which can decide not to use it in cases like troll bridge, because it is deciding whether to use the chicken rule for sensible reasons (where “sensible” includes the chicken rule itself).

Deciding to use the chicken rule is a transparent-Newcomb-like problem: your behavior in the case that you do see a proof of your own action affects your ability to reason in the case that you don’t.

The same things seem to apply to exploration.

Given our current level of understanding, chicken rule and true exploration seem closely analogous. However, it’s quite plausible that this will stop being the case with a better understanding. In particular, Sam recently pointed out to me that Löb’s theorem doesn’t go through for $$\Box \phi := \mathbb{P}(\phi) \geq 0.95$$. We have a decent picture of what tiling looks like in pure logical settings, but that’s shaped very highly by Löb’s theorem. So, tiling considerations for exploration could look very different from those for chicken-rule.

 A Difficulty With Density-Zero Exploration discussion post by Alex Appel 88 days ago | 1 comment

Update: This isn’t really an issue, you just need to impose an assumption that there is some function $$f$$ such that $$f(n)>n$$, and $$f(n)$$ is computable in time polynomial in $$f(n)$$, and you always find out whether exploration happened on turn $$f(n)$$ after $$\mathcal{O}(f(n+1))$$ days.

This is just the condition that there’s a subsequence where good feedback is possible, and is discussed significantly in section 4.3 of the logical induction paper.

If there’s a subsequence B (of your subsequence of interest, A) where you can get good feedback, then there’s infinite exploration steps on subsequence B (and also on A because it contains B)

This post is hereby deprecated. Still right, just not that relevant.

 by Abram Demski 96 days ago | link | parent | on: Distributed Cooperation Cool! I’m happy to see this written up finally. It’s been a big source of intuitions for me, so it’s good to see that the proof checks out. A possible next step to all this is to try to specify proof-based DT agents which could play this game (or something similar) based on Löbian handshakes. (In fact, part of the original motivation was to try to bring the cooperative-oracle model closer to the Löb-based cooperation you can get in prisoner’s dilemma with visible source code.) It’s unfortunate that you had to add the pareto-improvement condition to the cell rank. That part seems especially unlikely to drop out of a more general decision theory. I think I see another serious complication: Yes, not all points in $$C_{\vec{i}}$$ attain the same utility, but for a sufficiently small ϵ, the cell is really small, and for any player, the utility over the cell is well-approximated by the utility attained at the middle point in the cell. So, for any desired utility-approximation accuracy $$\delta$$, you can choose $$\epsilon$$ sufficiently small to achieve it. But, a pareto-optima in the set of middle points can be arbitrarily worse for some player than any pareto-optima of the full game; IE, taking the midpoints can hide arbitrarily large pareto improvements. For example, suppose that $$\delta$$=0.001. A pareto optima of the midpoints might give the utility vector (2, 2, 3) for the three players. There could be another midpoint (100, 100, 2.9999999), very near where the true game contains a point (100, 100, 3). So, it seems the pareto optimum of the game on midpoints which is found by the process in the post can be arbitrarily sub-optimal for all but one player, with no guarantee that this gets better as $$\epsilon$$ shrinks.
by Alex Appel 96 days ago | Abram Demski likes this | link | on: Distributed Cooperation

If you drop the Pareto-improvement condition from the cell rank, and just have “everyone sorts things by their own utility”, then you won’t necessarily get a Pareto-optimal outcome (within the set of cell center-points), but you will at least get a point where there are no strict Pareto improvements (no points that leave everyone better off).

The difference between the two is… let’s say we’ve got a 2-player 2-move game that in utility-space, makes some sort of quadrilateral. If the top and right edges join at 90 degrees, the Pareto-frontier would be the point on the corner, and the set of “no strict Pareto improvements” would be the top and the right edges.

If that corner is obtuse, then both “Pareto frontier” and “no strict Pareto improvements” agree that both line edges are within the set, and if the corner is acute, then both “Pareto frontier” and “no strict Pareto improvements” agree that only the corner is within the set. It actually isn’t much of a difference, it only manifests when the utilities for a player are exactly equal, and is easily changed by a little bit of noise.

The utility-approximation issue you pointed out seems to be pointing towards the impossibility of guaranteeing limiting to a point on the Pareto frontier (when you make the cell size smaller and smaller), precisely because of that “this set is unstable under arbitrarily small noise” issue.

But, the “set of all points that have no strict Pareto improvements by more than $$\delta$$ for all players”, ie, the $$\delta$$-fuzzed version of “set of points with no strict pareto improvement”, does seem to be robust against a little bit of noise, and doesn’t require the Pareto-improvement condition on everyone’s ranking of cells.

So I’m thinking that if that’s all we can attain (because of the complication you pointed out), then it lets us drop that inelegant Pareto-improvement condition.

I’ll work on the proof that for sufficiently small cell size $$\epsilon$$, you can get an outcome within $$\delta$$ of the set of “no strict Pareto improvements available”

Nice job spotting that flaw.

Distributed Cooperation
post by Alex Appel 96 days ago | Abram Demski and Scott Garrabrant like this | 2 comments

Reflective oracles can be approximated by computing Nash equilibria. But is there some procedure that produces a Pareto-optimal equilibrium in a game, aka, a point produced by a Cooperative oracle? It turns out there is. There are some interesting philosophical aspects to it, which will be typed up in the next post.

The result is not original to me, it’s been floating around MIRI for a while. I think Scott, Sam, and Abram worked on it, but there might have been others. All I did was formalize it a bit, and generalize from the 2-player 2-move case to the n-player n-move case. With the formalism here, it’s a bit hard to intuitively understand what’s going on, so I’ll indicate where to visualize an appropriate 3-dimensional object.

by Abram Demski 96 days ago | link | on: Distributed Cooperation

Cool! I’m happy to see this written up finally. It’s been a big source of intuitions for me, so it’s good to see that the proof checks out.

A possible next step to all this is to try to specify proof-based DT agents which could play this game (or something similar) based on Löbian handshakes. (In fact, part of the original motivation was to try to bring the cooperative-oracle model closer to the Löb-based cooperation you can get in prisoner’s dilemma with visible source code.)

It’s unfortunate that you had to add the pareto-improvement condition to the cell rank. That part seems especially unlikely to drop out of a more general decision theory.

I think I see another serious complication:

Yes, not all points in $$C_{\vec{i}}$$ attain the same utility, but for a sufficiently small ϵ, the cell is really small, and for any player, the utility over the cell is well-approximated by the utility attained at the middle point in the cell.

So, for any desired utility-approximation accuracy $$\delta$$, you can choose $$\epsilon$$ sufficiently small to achieve it. But, a pareto-optima in the set of middle points can be arbitrarily worse for some player than any pareto-optima of the full game; IE, taking the midpoints can hide arbitrarily large pareto improvements.

For example, suppose that $$\delta$$=0.001. A pareto optima of the midpoints might give the utility vector (2, 2, 3) for the three players. There could be another midpoint (100, 100, 2.9999999), very near where the true game contains a point (100, 100, 3).

So, it seems the pareto optimum of the game on midpoints which is found by the process in the post can be arbitrarily sub-optimal for all but one player, with no guarantee that this gets better as $$\epsilon$$ shrinks.

 by 258 148 days ago | Alex Appel and Abram Demski like this | link | parent | on: In memoryless Cartesian environments, every UDT po... Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below. Given some existing policy $$\pi$$, EDT+SSA recommends that upon receiving observation $$o$$ we should choose an action from $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n).$ (For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, $$\pi_{o\rightarrow a}(o')=a$$ if $$o=o'$$ and $$\pi_{o\rightarrow a}(o')=\pi(o')$$ otherwise. $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is the SSA probability of being in state $$s_i$$ of the environment trajectory $$s_1,...,s_n$$ given the observation $$o$$ and the fact that one uses the policy $$\pi_{o\rightarrow a}$$. The SSA probability $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is zero if $$m(s_i)\neq o$$ and $SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a}) = P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)}$ otherwise. Here, $$\#(o,s_1,...,s_n)=\sum_{i=1}^n \left[ m(s_i)=o \right]$$ is the number of times $$o$$ occurs in $$\#(o,s_1,...,s_n)$$. Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1/2 probability to tails in the Sleeping Beauty problem and sticks with 1/2 if it’s told that it’s Monday). Inserting this into the above, we get $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n)=\arg\max_a \sum_{s_1,...,s_n\text{ with }o} \sum_{i=1...n, m(s_i)=o} \left( P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)} \right) U(s_n),$ where the first sum on the right-hand side is over all histories that give rise to observation $$o$$ at some point. Dividing by the number of agents with observation $$o$$ in a history and setting the policy for all agents at the same time cancel each other out, such that this equals $\arg\max_a \sum_{s_1,...,s_n\text{ with }o} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n)=\arg\max_a \sum_{s_1,...,s_n} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n).$ Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; multiple policies satisfy the right-hand side of this equation and not all of these are optimal. [1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf

Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely “isomorphic”/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution is to identify with all objects in the world that are perfectly correlated with you. However, the underlying motivation is unrelated to Conitzer’s example.

[2] Vincent Conitzer: A Dutch Book against Sleeping Beauties Who Are Evidential Decision Theorists. Synthese, Volume 192, Issue 9, pp. 2887-2899, October 2015. https://arxiv.org/pdf/1705.03560.pdf

 Stable Pointers to Value II: Environmental Goals discussion post by Abram Demski 133 days ago | 1 comment

Delegative Reinforcement Learning solves this problem by keeping humans in the loop while preserving consequentialist reasoning. Ofc currently the theory is based on a lot of simplification and the ultimate learning protocol will probably look differently, but I think that the basic mechanism (delegation combined with model-based reasoning) is sound.

Further Progress on a Bayesian Version of Logical Uncertainty
post by Alex Appel 141 days ago | Scott Garrabrant likes this | 1 comment

I’d like to credit Daniel Demski for helpful discussion.

Intermediate update:

The handwavy argument about how you’d get propositional inconsistency in the limit of imposing the constraint of “the string cannot contain $$a\wedge b\wedge c...\to\neg\phi$$ and $$a$$ and $$b$$ and… and $$\phi$$

is less clear than I thought. The problem is that, while the prior may learn that that constraint applies as it updates on more sentences, that particular constraint can get you into situations where adding either $$\phi$$ or $$\neg\phi$$ leads to a violation of the constraint.

So, running the prior far enough forward leads to the probability distribution being nearly certain that, while that particular constraint applied in the past, it will stop applying at some point in the future by vetoing both possible extensions of a string of sentences, and then less-constrained conditions will apply from that point forward.

On one hand, if you don’t have the computational resources to enforce full propositional consistency, it’s expected that most of the worlds you generate will be propositionally inconsistent, and midway through generating them you’ll realize that some of them are indeed propositionally inconsistent.

On the other hand, we want to be able to believe that constraints capable of painting themselves into a corner will apply to reality forevermore.

I’ll think about this a bit more. One possible line of attack is having $$\mathbb{P}(\phi)$$ and $$\mathbb{P}(\neg\phi)$$ not add up to one, because it’s possible that the sentence generating process will just stop cold before one of the two shows up, and renormalizing them to 1. But I’d have to check if it’s still possible to $$\varepsilon$$-approximate the distribution if we introduce this renormalization, and to be honest, I wouldn’t be surprised if there was a more elegant way around this.

EDIT: yes it’s still possible to $$\varepsilon$$-approximate the distribution in known time if you have $$\mathbb{P}(\phi)$$ refer to $$\frac{probability to encounter \phi first}{1-probability to halt first}$$, although the bounds are really loose. This is because if most of the execution paths involve halting before the sentence is sampled, $$\varepsilon$$-error in the probability of sampling $$\phi$$ first will get blown up by the small denominator.

Will type up the proof later, but it basically proceeds by looking at the probability mass associated with “sample the trivial constraint that accepts everything, and sample it again on each successive round”, because this slice of probability mass has a near-guarantee of hitting $$\phi$$, and then showing that even this tiny slice has substantially more probability mass than the cumulative probability of ever sampling a really rare sentence or not hitting any of $$\phi$$, $$\neg\phi$$, or the string terminating.

In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy
post by Jessica Taylor 741 days ago | Vadim Kosoy and Abram Demski like this | 3 comments

Summary: I define a memoryless Cartesian environments (which can model many familiar decision problems), note the similarity to memoryless POMDPs, and define a local optimality condition for policies, which can be roughly stated as “the policy is consistent with maximizing expected utility using CDT and subjective probabilities derived from SIA”. I show that this local optimality condition is necesssary but not sufficient for global optimality (UDT).

Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.

Given some existing policy $$\pi$$, EDT+SSA recommends that upon receiving observation $$o$$ we should choose an action from $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n).$ (For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, $$\pi_{o\rightarrow a}(o')=a$$ if $$o=o'$$ and $$\pi_{o\rightarrow a}(o')=\pi(o')$$ otherwise. $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is the SSA probability of being in state $$s_i$$ of the environment trajectory $$s_1,...,s_n$$ given the observation $$o$$ and the fact that one uses the policy $$\pi_{o\rightarrow a}$$.

The SSA probability $$SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})$$ is zero if $$m(s_i)\neq o$$ and $SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a}) = P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)}$ otherwise. Here, $$\#(o,s_1,...,s_n)=\sum_{i=1}^n \left[ m(s_i)=o \right]$$ is the number of times $$o$$ occurs in $$\#(o,s_1,...,s_n)$$. Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1/2 probability to tails in the Sleeping Beauty problem and sticks with 1/2 if it’s told that it’s Monday).

Inserting this into the above, we get $\arg\max_a \sum_{s_1,...,s_n} \sum_{i=1}^n SSA(s_i\text{ in }s_1,...,s_n\mid o, \pi_{o\rightarrow a})U(s_n)=\arg\max_a \sum_{s_1,...,s_n\text{ with }o} \sum_{i=1...n, m(s_i)=o} \left( P(s_1,...,s_n\mid \pi_{o\rightarrow a}) \frac{1}{\#(o,s_1,...,s_n)} \right) U(s_n),$ where the first sum on the right-hand side is over all histories that give rise to observation $$o$$ at some point. Dividing by the number of agents with observation $$o$$ in a history and setting the policy for all agents at the same time cancel each other out, such that this equals $\arg\max_a \sum_{s_1,...,s_n\text{ with }o} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n)=\arg\max_a \sum_{s_1,...,s_n} P(s_1,...,s_n\mid \pi_{o\rightarrow a}) U(s_n).$ Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; multiple policies satisfy the right-hand side of this equation and not all of these are optimal.

[1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf

Logical counterfactuals and differential privacy
post by Nisan Stiennon 151 days ago | Abram Demski and Scott Garrabrant like this | 1 comment

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

This doesn’t quite work. The theorem and examples only work if you maximize the unconditional mutual information, $$H(X;Y)$$, not $$H(X;Y|A)$$. And the choice of $$X$$ is doing a lot of work — it’s not enough to make it “sufficiently rich”.

An Untrollable Mathematician
post by Abram Demski 150 days ago | Alex Appel, Sam Eisenstat, Vadim Kosoy, Jack Gallagher, Jessica Taylor, Paul Christiano, Scott Garrabrant and Vladimir Slepnev like this | 1 comment

Follow-up to All Mathematicians are Trollable.

It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the non-convergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild non-convergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

by Sam Eisenstat 149 days ago | Abram Demski likes this | link | on: An Untrollable Mathematician

I at first didn’t understand your argument for claim (2), so I wrote an alternate proof that’s a bit more obvious/careful. I now see why it works, but I’ll give my version below for anyone interested. In any case, what you really mean is the probability of deciding a sentence outside of $$\Phi$$ by having it announced by nature; there may be a high probability of sentences being decided indirectly via sentences in $$\Phi$$.

Instead of choosing $$\Phi$$ as you describe, pick $$\Phi$$ so that the probability $$\mu(\Phi)$$ of sampling something in $$\Phi$$ is greater than $$1 - \mu(\psi) \cdot \varepsilon / 2$$. Then, the probability of sampling something in $$\Phi - \{\psi\}$$ is at least $$1 - \mu(\psi) \cdot (1 + \varepsilon / 2)$$. Hence, no matter what sentences have been decided already, the probability that repeatedly sampling from $$\mu$$ selects $$\psi$$ before it selects any sentence outside of $$\Phi$$ is at least

\begin{align*} \sum_{k = 0}^\infty (1 - \mu(\psi) \cdot (1 + \varepsilon / 2))^k \cdot \mu(\psi) & = \frac{\mu(\psi)}{\mu(\psi) \cdot (1 + \varepsilon / 2)} \\ & > 1 - \varepsilon / 2 \end{align*}

as desired.

Furthermore, this argument makes it clear that the probability distribution we converge to depends only on the set of sentences which the environment will eventually assert, not on their ordering!

Oh, I didn’t notice that aspect of things. That’s pretty cool.

 by Abram Demski 162 days ago | link | parent | on: The set of Logical Inductors is not Convex This uses logical inductors of distinctly different strengths. I wonder if there’s some kind of convexity result for logical inductors which can see each other? Suppose traders in $$\mathbb{P}_n$$ have access to $$\mathbb{P}'_n$$ and vice versa. Or perhaps just assume that the markets cannot be arbitrarily exploited by such traders. Then, are linear combinations also logical inductors?

This is somewhat related to what I wrote about here. If you consider only what I call convex gamblers/traders and fix some weighting (“prior”) over the gamblers then there is a natural convex set of dominant forecasters (for each history, it is the set of minima of some convex function on $$\Delta\mathcal{O}^\omega$$.)

The set of Logical Inductors is not Convex
post by Scott Garrabrant 633 days ago | Sam Eisenstat, Abram Demski and Patrick LaVictoire like this | 3 comments

Sam Eisenstat asked the following interesting question: Given two logical inductors over the same deductive process, is every (rational) convex combination of them also a logical inductor? Surprisingly, the answer is no! Here is my counterexample.

This uses logical inductors of distinctly different strengths. I wonder if there’s some kind of convexity result for logical inductors which can see each other? Suppose traders in $$\mathbb{P}_n$$ have access to $$\mathbb{P}'_n$$ and vice versa. Or perhaps just assume that the markets cannot be arbitrarily exploited by such traders. Then, are linear combinations also logical inductors?

Older

### NEW DISCUSSION POSTS

I found an improved version
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

I misunderstood your
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 0 likes

Caught a flaw with this
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

As you say, this isn't a
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 1 like

Note: I currently think that
 by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
 by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
 by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
 by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
 by Alex Appel on Smoking Lesion Steelman | 1 like

 by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
 by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
 by Jessica Taylor on Musings on Exploration | 0 likes