1. No Constant Distribution Can be a Logical Inductor discussion post by Alex Appel 129 days ago | Sam Eisenstat, Vadim Kosoy, Abram Demski, Jessica Taylor and Stuart Armstrong like this | 1 comment
 2. Musings on Exploration discussion post by Alex Appel 134 days ago | Vadim Kosoy likes this | 4 comments
3.An Untrollable Mathematician
post by Abram Demski 203 days ago | Alex Appel, Sam Eisenstat, Vadim Kosoy, Jack Gallagher, Jessica Taylor, Paul Christiano, Scott Garrabrant and Vladimir Slepnev like this | 1 comment

Follow-up to All Mathematicians are Trollable.

It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the non-convergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild non-convergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

4.Reflective oracles as a solution to the converse Lawvere problem
post by Sam Eisenstat 271 days ago | Alex Mennen, Alex Appel, Vadim Kosoy, Abram Demski, Jessica Taylor, Scott Garrabrant and Vladimir Slepnev like this | discuss

1 Introduction

Before the work of Turing, one could justifiably be skeptical of the idea of a universal computable function. After all, there is no computable function $$f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}$$ such that for all computable $$g\colon\mathbb{N}\to\mathbb{N}$$ there is some index $$i_{g}$$ such that $$f\left(i_{g},n\right)=g\left(n\right)$$ for all $$n$$. If there were, we could pick $$g\left(n\right)=f\left(n,n\right)+1$$, and then $g\left(i_{g}\right)=f\left(i_{g},i_{g}\right)+1=g\left(i_{g}\right)+1,$ a contradiction. Of course, universal Turing machines don’t run into this obstacle; as Gödel put it, “By a kind of miracle it is not necessary to distinguish orders, and the diagonal procedure does not lead outside the defined notion.” [1]

The miracle of Turing machines is that there is a partial computable function $$f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}$$ such that for all partial computable $$g\colon\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}$$ there is an index $$i$$ such that $$f\left(i,n\right)=g\left(n\right)$$ for all $$n$$. Here, we look at a different “miracle”, that of reflective oracles [2,3]. As we will see in Theorem 1, given a reflective oracle $$O$$, there is a (stochastic) $$O$$-computable function $$f\colon\mathbb{N}\times\mathbb{N}\to\left\{ 0,1\right\}$$ such that for any (stochastic) $$O$$-computable function $$g\colon\mathbb{N}\to\left\{ 0,1\right\}$$, there is some index $$i$$ such that $$f\left(i,n\right)$$ and $$g\left(n\right)$$ have the same distribution for all $$n$$. This existence theorem seems to skirt even closer to the contradiction mentioned above.

We use this idea to answer “in spirit” the converse Lawvere problem posed in [4]. These methods also generalize to prove a similar analogue of the ubiquitous converse Lawvere problem from [5]. The original questions, stated in terms of topology, remain open, but I find that the model proposed here, using computability, is equally satisfying from the point of view of studying reflective agents. Those references can be consulted for more motivation on these problems from the perspective of reflective agency.

Section 3 proves the main lemma, and proves the converse Lawvere theorem for reflective oracles. In section 4, we use that to give a (circular) proof of Brouwer’s fixed point theorem, as mentioned in [4]. In section 5, we prove the ubiquitous converse Lawvere theorem for reflective oracles.

 5. Announcing the AI Alignment Prize link by Vladimir Slepnev 283 days ago | Vadim Kosoy likes this | discuss
6.Hyperreal Brouwer
post by Scott Garrabrant 313 days ago | Vadim Kosoy and Stuart Armstrong like this | 2 comments

This post explains how to view Kakutani’s fixed point theorem as a special case of Brouwer’s fixed point theorem with hyperreal numbers. This post is just math intuitions, but I found them useful in thinking about Kakutani’s fixed point theorem and many things in agent foundations. This came out of conversations with Sam Eisenstat.

 7. Funding opportunity for AI alignment research link by Paul Christiano 352 days ago | Vadim Kosoy likes this | 3 comments
8.The Three Levels of Goodhart's Curse
post by Scott Garrabrant 363 days ago | Vadim Kosoy, Abram Demski and Paul Christiano like this | 2 comments

Note: I now consider this post deprecated and instead recommend this updated version.

Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.

 9. Open Problems Regarding Counterfactuals: An Introduction For Beginners link by Alex Appel 393 days ago | Vadim Kosoy, Tsvi Benson-Tilsen, Vladimir Nesov and Wei Dai like this | 2 comments
10.Smoking Lesion Steelman
post by Abram Demski 409 days ago | Tom Everitt, Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this | 10 comments

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

11.A cheating approach to the tiling agents problem
post by Vladimir Slepnev 410 days ago | Alex Mennen, Vadim Kosoy and Abram Demski like this | 2 comments

(This post resulted from a conversation with Wei Dai.)

Formalizing the tiling agents problem is very delicate. In this post I’ll show a toy problem and a solution to it, which arguably meets all the desiderata stated before, but only by cheating in a new and unusual way.

Here’s a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor must solve another math question and design its own successor, and so on. The questions get harder each time, so they can’t all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.

Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself on each round, not just design a custom-made successor that solves the math question with PA and reproduces itself by quining.

12.Loebian cooperation in the tiling agents problem
post by Vladimir Slepnev 416 days ago | Alex Mennen, Vadim Kosoy, Abram Demski, Patrick LaVictoire and Stuart Armstrong like this | 4 comments

The tiling agents problem is about formalizing how AIs can create successor AIs that are at least as smart. Here’s a toy model I came up with, which is similar to Benya’s old model but simpler. A computer program X is asked one of two questions:

• Would you like some chocolate?

• Here’s the source code of another program Y. Do you accept it as your successor?

13.Cooperative Oracles: Stratified Pareto Optima and Almost Stratified Pareto Optima
post by Scott Garrabrant 438 days ago | Vadim Kosoy, Patrick LaVictoire and Stuart Armstrong like this | 8 comments

In this post, we generalize the notions in Cooperative Oracles: Nonexploited Bargaining to deal with the possibility of introducing extra agents that have no control but have preferences. We further generalize this to infinitely many agents. (Part of the series started here.)

14.Cooperative Oracles: Nonexploited Bargaining
post by Scott Garrabrant 459 days ago | Vadim Kosoy, Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | 6 comments

In this post, we formalize and generalize the phenomenon described in the Eliezer Yudkowsky post Cooperating with agents with different ideas of fairness, while resisting exploitation. (Part of the series started here.)

 15. Finding reflective oracle distributions using a Kakutani map discussion post by Jessica Taylor 470 days ago | Vadim Kosoy likes this | discuss
16.A correlated analogue of reflective oracles
post by Jessica Taylor 470 days ago | Sam Eisenstat, Vadim Kosoy, Abram Demski and Scott Garrabrant like this | discuss

Summary: Reflective oracles correspond to Nash equilibria. A correlated version of reflective oracles exists and corresponds to correlated equilibria. The set of these objects is convex, which is useful.

17.Generalizing Foundations of Decision Theory II
post by Abram Demski 478 days ago | Sam Eisenstat, Vadim Kosoy, Jessica Taylor and Patrick LaVictoire like this | 4 comments

As promised in the previous post, I develop my formalism for justifying as many of the decision-theoretic axioms as possible with generalized dutch-book arguments. (I’ll use the term “generalized dutch-book” to refer to arguments with a family resemblance to dutch-book or money-pump.) The eventual goal is to relax these assumptions in a way which addresses bounded processing power, but for now the goal is to get as much of classical decision theory as possible justified by a generalized dutch-book.

18.Formal Open Problem in Decision Theory
post by Scott Garrabrant 502 days ago | Marcello Herreshoff, Sam Eisenstat, Vadim Kosoy, Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | 13 comments

In this post, I present a new formal open problem. A positive answer would be valuable for decision theory research. A negative answer would be helpful, mostly for figuring out what is the closest we can get to a positive answer. I also give some motivation for the problem, and some partial progress.

Open Problem: Does there exist a topological space $$X$$ (in some convenient category of topological spaces) such that there exists a continuous surjection from $$X$$ to the space $$[0,1]^X$$ (of continuous functions from $$X$$ to $$[0,1]$$)?

 19. Generalizing Foundations of Decision Theory discussion post by Abram Demski 534 days ago | Ryan Carey, Vadim Kosoy, Jessica Taylor and Scott Garrabrant like this | 8 comments
20.Entangled Equilibria and the Twin Prisoners' Dilemma
post by Scott Garrabrant 549 days ago | Vadim Kosoy and Patrick LaVictoire like this | 2 comments

In this post, I present a generalization of Nash equilibria to non-CDT agents. I will use this formulation to model mutual cooperation in a twin prisoners’ dilemma, caused by the belief that the other player is similar to you, and not by mutual prediction. (This post came mostly out of a conversation with Sam Eisenstat, as well as contributions from Tsvi Benson-Tilsen and Jessica Taylor)

 21. Neural nets designing neural nets link by Stuart Armstrong 573 days ago | Vadim Kosoy likes this | discuss
22.My current take on the Paul-MIRI disagreement on alignability of messy AI
post by Jessica Taylor 599 days ago | Ryan Carey, Vadim Kosoy, Daniel Dewey, Patrick LaVictoire, Scott Garrabrant and Stuart Armstrong like this | 40 comments

Paul Christiano and “MIRI” have disagreed on an important research question for a long time: should we focus research on aligning “messy” AGI (e.g. one found through gradient descent or brute force search) with human values, or on developing “principled” AGI (based on theories similar to Bayesian probability theory)? I’m going to present my current model of this disagreement and additional thoughts about it.

23.Uninfluenceable agents
post by Stuart Armstrong 614 days ago | Vadim Kosoy and Patrick LaVictoire like this | 7 comments

A putative new idea for AI control; index here.

After explaining biased learning processes, we can now define influenceable (and uninfluenceable) learning processes.

Recall that the (unbiased) influence problem is due to agents randomising their preferences, as a sort of artificial `learning’ process, if the real learning process is slow or incomplete.

 24. The universal prior is malign link by Paul Christiano 622 days ago | Ryan Carey, Vadim Kosoy, Jessica Taylor and Patrick LaVictoire like this | 4 comments
25.Online Learning 3: Adversarial bandit learning with catastrophes
post by Ryan Carey 638 days ago | Vadim Kosoy and Patrick LaVictoire like this | discuss

Note: This describes an idea of Jessica Taylor’s.

In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.

Older

### NEW DISCUSSION POSTS

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes