1.An Untrollable Mathematician
post by Abram Demski 510 days ago | Alex Appel, Sam Eisenstat, Vanessa Kosoy, Jack Gallagher, Jessica Taylor, Paul Christiano, Scott Garrabrant and Vladimir Slepnev like this | 1 comment

Follow-up to All Mathematicians are Trollable.

It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the non-convergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild non-convergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

2.Autopoietic systems and difficulty of AGI alignment
post by Jessica Taylor 668 days ago | Ryan Carey, Owen Cotton-Barratt and Paul Christiano like this | 13 comments

I have recently come to the opinion that AGI alignment is probably extremely hard. But it’s not clear exactly what AGI or AGI alignment are. And there are some forms of aligment of “AI” systems that are easy. Here I operationalize “AGI” and “AGI alignment” in some different ways and evaluate their difficulties.

3.Density Zero Exploration
post by Alex Mennen 669 days ago | Abram Demski, Paul Christiano and Scott Garrabrant like this | discuss

The idea here is due to Scott Garrabrant. All I did was write it.

4.Logical Induction with incomputable sequences
post by Alex Mennen 669 days ago | Abram Demski, Paul Christiano and Scott Garrabrant like this | discuss

In the definition of a logical inductor, the deductive process is required to be computable. This, of course, does not allow the logical inductor to use randomness, or predict uncomputable sequences. The way traders were defined in the logical induction paper, this was necessary, because the traders were not given access to the output of the deductive process.

5.The Three Levels of Goodhart's Curse
post by Scott Garrabrant 669 days ago | Vanessa Kosoy, Abram Demski and Paul Christiano like this | 2 comments

Note: I now consider this post deprecated and instead recommend this updated version.

Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.

6.Current thoughts on Paul Christano's research agenda
post by Jessica Taylor 700 days ago | Ryan Carey, Owen Cotton-Barratt, Sam Eisenstat, Paul Christiano, Stuart Armstrong and Wei Dai like this | 15 comments

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.

7.Smoking Lesion Steelman
post by Abram Demski 715 days ago | Tom Everitt, Sam Eisenstat, Vanessa Kosoy, Paul Christiano and Scott Garrabrant like this | 10 comments

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

 8. Where's the first benign agent? link by Jacob Kopczynski 793 days ago | Patrick LaVictoire and Paul Christiano like this | 15 comments
9.On motivations for MIRI's highly reliable agent design research
post by Jessica Taylor 874 days ago | Ryan Carey, Sam Eisenstat, Daniel Dewey, Nate Soares, Patrick LaVictoire, Paul Christiano, Tsvi Benson-Tilsen and Vladimir Nesov like this | 10 comments

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

 10. Towards learning incomplete models using inner prediction markets discussion post by Vanessa Kosoy 890 days ago | Jessica Taylor and Paul Christiano like this | 4 comments
 11. Pursuing convergent instrumental subgoals on the user's behalf doesn't always require good priors discussion post by Jessica Taylor 899 days ago | Daniel Dewey, Paul Christiano and Stuart Armstrong like this | 9 comments
12.Predicting HCH using expert advice
post by Jessica Taylor 931 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

Summary: in approximating a scheme like HCH , we would like some notion of “the best the prediction can be given available AI capabilities”. There’s a natural notion of “the best prediction of a human we should expect to get”. In general this doesn’t yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.

13.postCDT: Decision Theory using post-selected Bayes nets
post by Scott Garrabrant 952 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

The purpose of this post is to document a minor idea about a new type of decision theory that works using a Bayes net. This is not a concrete proposal, since I will give no insight on which Bayes net to use. I am not that excited by this proposal, but think it is worth writing up anyway.

 14. Attacking the grain of truth problem using Bayes-Savage agents discussion post by Vanessa Kosoy 970 days ago | Paul Christiano likes this | discuss
 15. Asymptotic Decision Theory link by Jack Gallagher 975 days ago | Abram Demski, Jessica Taylor, Patrick LaVictoire, Paul Christiano and Tsvi Benson-Tilsen like this | 2 comments
16.Online Learning 1: Bias-detecting online learners
post by Ryan Carey 983 days ago | Vanessa Kosoy, Jessica Taylor, Nate Soares and Paul Christiano like this | 7 comments

Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.

17.What does it mean for correct operation to rely on transfer learning?
post by Jessica Taylor 1200 days ago | Daniel Dewey, Patrick LaVictoire, Paul Christiano and Stuart Armstrong like this | discuss

Summary: Some approaches to AI value alignment rely on transfer learning. I attempt to explain this idea more clearly.

 18. A possible training procedure for human-imitators discussion post by Jessica Taylor 1216 days ago | Patrick LaVictoire and Paul Christiano like this | 4 comments
 19. Some work on connecting UDT and Reinforcement Learning link by David Krueger 1277 days ago | Patrick LaVictoire and Paul Christiano like this | 5 comments
20.Existence of distributions that are expectation-reflective and know it
post by Tsvi Benson-Tilsen 1286 days ago | Kaya Stechly, Abram Demski, Jessica Taylor, Nate Soares and Paul Christiano like this | discuss

We prove the existence of a probability distribution over a theory $${T}$$ with the property that for certain definable quantities $${\varphi}$$, the expectation of the value of a function $${E}[{\ulcorner {\varphi}\urcorner}]$$ is accurate, i.e. it equals the actual expectation of $${\varphi}$$; and with the property that it assigns probability 1 to $${E}$$ behaving this way. This may be useful for self-verification, by allowing an agent to satisfy a reflective consistency property and at the same time believe itself or similar agents to satisfy the same property. Thanks to Sam Eisenstat for listening to an earlier version of this proof, and pointing out a significant gap in the argument. The proof presented here has not been vetted yet.

21.A limit-computable, self-reflective distribution
post by Tsvi Benson-Tilsen 1310 days ago | Sam Eisenstat, Vanessa Kosoy, Abram Demski, Jessica Taylor, Nate Soares, Patrick LaVictoire, Paul Christiano and Scott Garrabrant like this | 1 comment

We present a $$\Delta_2$$-definable probability distribution $${\Psi}$$ that satisfies Christiano’s reflection schema for its own defining formula. The strategy is analogous to the chicken step employed by modal decision theory to obfuscate itself from the eyes of $${\mathsf{PA}}$$; we will prevent the base theory $${T}$$ from knowing much about $${\Psi}$$, so that $${\Psi}$$ can be coherent over $${T}$$ and also consistently believe in reflection statements. So, the method used here is technical and not fundamental, but it does at least show that limit-computable and reflective distributions exist. These results are due to Sam Eisenstat and me, and this post benefited greatly from extensive notes from Sam; any remaining errors are probably mine.

Prerequisites: we assume familiarity with Christiano’s original result and the methods used there. In particular, we will freely use Kakutani’s fixed point theorem. See Christiano et al.’s paper.

 22. Multibit reflective oracles discussion post by Benja Fallenstein 1605 days ago | Jessica Taylor, Nate Soares and Paul Christiano like this | discuss
 23. Improving the modal UDT optimality result discussion post by Benja Fallenstein 1666 days ago | Patrick LaVictoire and Paul Christiano like this | 2 comments
24.Exploiting EDT
post by Benja Fallenstein 1681 days ago | Ryan Carey, Abram Demski, Daniel Dewey, Nate Soares, Patrick LaVictoire and Paul Christiano like this | 9 comments

The problem with EDT is, as David Lewis put it, its “irrational policy of managing the news” (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it’s taking these actions tells it about events the agent can’t affect at all. The canonical example is the smoking lesion problem.

I’ve long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values “smoking” (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the “tickle defense” of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb’s problem. (It’s a variation of the thought experiment in my LessWrong post on “the sin of updating when you can change whether you exist”.)

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes