1.Existence of distributions that are expectation-reflective and know it
post by Tsvi Benson-Tilsen 1315 days ago | Kaya Stechly, Abram Demski, Jessica Taylor, Nate Soares and Paul Christiano like this | discuss

We prove the existence of a probability distribution over a theory $${T}$$ with the property that for certain definable quantities $${\varphi}$$, the expectation of the value of a function $${E}[{\ulcorner {\varphi}\urcorner}]$$ is accurate, i.e. it equals the actual expectation of $${\varphi}$$; and with the property that it assigns probability 1 to $${E}$$ behaving this way. This may be useful for self-verification, by allowing an agent to satisfy a reflective consistency property and at the same time believe itself or similar agents to satisfy the same property. Thanks to Sam Eisenstat for listening to an earlier version of this proof, and pointing out a significant gap in the argument. The proof presented here has not been vetted yet.

2.What do we need value learning for?
post by Jessica Taylor 1326 days ago | Kaya Stechly, Abram Demski and Nate Soares like this | discuss

I will be writing a sequence of posts about value learning. The purpose of these posts is to create more explicit models of some value learning ideas, such as those discussed in The Value Learning Problem. Although these explicit models are unlikely to capture the complexity of real value learning systems, it is at least helpful to have some explicit model of value learning in mind when thinking about problems such as corrigibility.

This came up because I was discussing value learning with some people at MIRI and FHI. There were disagreements about some aspects of the problem, such as whether a value-learning AI could automatically learn how to be corrigible. I realized that my thinking about value learning was somewhat confused. Making concrete models will make my thinking clearer and also create more common models that people can discuss.

A value learning model is an algorithm that observes human behaviors and determines what values humans have. Roughly, the model consists of:

1. a type of values, $$\mathcal{V}$$
2. a prior over values, $$P(V)$$
3. a conditional distribution of human behavior given their values and observation, $$P(A | V, O)$$

Of course, this is very simplified: in real life the model must account for beliefs, memory, etc. Such a model can be used for multiple purposes. Each of these purposes requires different things from the model. It is important to look at these applications when constructing these models, so it is clear what target we are shooting for.

 3. Sequential Extensions of Causal and Evidential Decision Theory link by Tom Everitt 1370 days ago | Kaya Stechly and Patrick LaVictoire like this | discuss
 4. Attempting to refine "maximization" with 3 new -izers link by Pasha Kamyshev 1436 days ago | Kaya Stechly and Patrick LaVictoire like this | 1 comment
5.PA+100 cannot always predict modal UDT
post by Jessica Taylor 1527 days ago | Kaya Stechly, Benja Fallenstein and Patrick LaVictoire like this | 3 comments

Summary: we might expect $$PA + m$$ to be able to predict the behavior of an escalating modal UDT agent using $$PA + n$$ for $$n < m$$. However, this is not the case. Furthermore, in environments where $$PA + m$$ predicts the agent, escalating modal UDT can be outperformed by a constant agent.

6.Reflective probabilistic logic cannot assign positive probability to its own coherence and an inner reflection principle
post by Jessica Taylor 1537 days ago | Kaya Stechly, Benja Fallenstein, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this | 5 comments

Summary: although we can find an assignment of probabilities to logical statements that satisfies an outer reflection principle, we have more difficulty satisfying both the outer reflection principle and an inner reflection principle. This post presents an impossibility result.

7.Forum Digest: Corrigibility, utility indifference, & related control ideas
post by Benja Fallenstein 1577 days ago | Kaya Stechly, Jessica Taylor, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this | 3 comments

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent’s goal system wrong, it doesn’t try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent’s goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It’s current as of 3/21/15.

8.Forum Digest: Reflective Oracles
post by Jessica Taylor 1578 days ago | Kaya Stechly, Sam Eisenstat, Benja Fallenstein, Nate Soares and Patrick LaVictoire like this | discuss

Summary: This is a quick expository recap, with links, of writing (in papers and on this forum) on reflective oracles, through 3/21/15. Read this if you want to learn more about reflective oracles, or if you’re curious about what we’ve been working on lately!

 9. Single-bit reflective oracles are enough discussion post by Benja Fallenstein 1582 days ago | Kaya Stechly, Jessica Taylor, Nate Soares and Patrick LaVictoire like this | 1 comment
10.The odd counterfactuals of playing chicken
post by Benja Fallenstein 1626 days ago | Kaya Stechly, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this | discuss

In this post, I examine an odd consequence of “playing chicken with the universe”, as used in proof-based UDT. Let’s say that our agent uses PA, and that it has a provability oracle, so that if it doesn’t find a proof, there really isn’t one. In this case, one way of looking at UDT is to say that it treats the models of PA as impossible possible worlds: UDT thinks that taking action $$a$$ leads to utility $$u$$ iff the universe program $$U()$$ returns $$u$$ in all models $$\mathcal{M}$$ in which $$A()$$ returns $$a$$. The chicken step ensures that for every $$a$$, there is at least one model $$\mathcal{M}$$ in which this is true. But how? Well, even though in the “real world”, $$\mathbb{N}$$, the sentence $$\ulcorner{\bar A() \neq \bar a}\urcorner$$ isn’t provable—that is, $$\mathbb{N}\vDash\neg\square\ulcorner{\bar{\bar A}() \neq \bar{\bar a}}\urcorner$$—there are other models $$\mathcal{M}$$ such that $$\mathcal{M}\vDash\square\ulcorner{\bar{\bar A}() \neq \bar{\bar a}}\urcorner$$, and in these models, the chicken step can make $$A()$$ output $$a$$.

In general, the only “impossible possible worlds” in which $$A() = a$$ are models $$\mathcal{M}$$ according to which it is provable that $$A() \neq a$$. In this post, I show that this odd way of constructing the counterfactual “what would happen if I did $$a$$” can cause problems for modal UDT and the corresponding notion of third-party counterfactuals.

11.Third-person counterfactuals
post by Benja Fallenstein 1627 days ago | Kaya Stechly, Nate Soares and Patrick LaVictoire like this | 4 comments

If you’re thinking about the counterfactual world where you do X in the process of deciding whether to do X, let’s call that a first-person counterfactual. If you’re thinking about it in the process of deciding whether another agent A should have done X instead of Y, let’s call that a third-person counterfactual. The definition of, e.g., modal UDT uses first-person counterfactuals, but when we try to prove a theorem showing that modal UDT is “optimal” in some sense, then we need to use third-person counterfactuals.

UDT’s first-person counterfactuals are logical counterfactuals, but our optimality result evaluates UDT by using physical third-party counterfactuals: it asks, would another agent have done better, not, would a different action by the same agent have lead to a better outcome? The former is easier to analyze, but the latter seems to be what we really care about. Nate’s recent post on “global UDT” points towards turning UDT into a notion of third-party counterfactuals, and describes some problems. In this post, I’ll give a fuller UDT-based notion of logical third-party counterfactuals, which at least fails visibly (returns an error) in the kinds of cases Nate describes. However, in a follow-up post I’ll give an example where this definition returns a non-error value which intuitively seems wrong.

12.Predictors that don't try to manipulate you(?)
post by Benja Fallenstein 1705 days ago | Kaya Stechly, Nate Soares and Patrick LaVictoire like this | 1 comment

One idea for how to make a safe superintelligent agent is to make a system that only answers questions, but doesn’t try to act in the world—an “oracle”, in Nick Bostrom’s terms. One of the things that make this difficult is that it’s not clear what it should mean, formally, to optimize for “truthfully answering questions” without in some way trying to influencing the world; intuitively: Won’t an agent that is trying to maximize the number of truthfully answered questions want to manipulate you into asking easier questions?

In this post, I consider a formal toy model of an agent that is trying to make correct predictions about future input, but, in a certain formal sense, has no incentive to make its future input easier to predict. I’m not claiming that this model definitely avoids all unintended optimizations—I don’t understand it that deeply yet—but it seems like an interesting proposal that is worth thinking about.

 13. An optimality result for modal UDT discussion post by Benja Fallenstein 1707 days ago | Kaya Stechly, Abram Demski, Nate Soares and Patrick LaVictoire like this | discuss
 14. "Evil" decision problems in provability logic discussion post by Benja Fallenstein 1708 days ago | Kaya Stechly, Jessica Taylor, Nate Soares and Patrick LaVictoire like this | 4 comments
 15. Topological truth predicates: Towards a model of perfect Bayesian agents discussion post by Benja Fallenstein 1716 days ago | Kaya Stechly, Abram Demski, Nate Soares, Tsvi Benson-Tilsen and Vladimir Slepnev like this | 6 comments

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes