 1.  Existence of distributions that are expectationreflective and know it   post by Tsvi BensonTilsen 1315 days ago  Kaya Stechly, Abram Demski, Jessica Taylor, Nate Soares and Paul Christiano like this  discuss  
 We prove the existence of a probability distribution over a theory \({T}\) with the property that for certain definable quantities \({\varphi}\), the expectation of the value of a function \({E}[{\ulcorner {\varphi}\urcorner}]\) is accurate, i.e. it equals the actual expectation of \({\varphi}\); and with the property that it assigns probability 1 to \({E}\) behaving this way. This may be useful for selfverification, by allowing an agent to satisfy a reflective consistency property and at the same time believe itself or similar agents to satisfy the same property. Thanks to Sam Eisenstat for listening to an earlier version of this proof, and pointing out a significant gap in the argument. The proof presented here has not been vetted yet.  
  2.  What do we need value learning for?   post by Jessica Taylor 1326 days ago  Kaya Stechly, Abram Demski and Nate Soares like this  discuss  
 I will be writing a sequence of posts about value learning. The purpose of these posts is to create more explicit models of some value learning ideas, such as those discussed in The Value Learning Problem. Although these explicit models are unlikely to capture the complexity of real value learning systems, it is at least helpful to have some explicit model of value learning in mind when thinking about problems such as corrigibility.
This came up because I was discussing value learning with some people at MIRI and FHI. There were disagreements about some aspects of the problem, such as whether a valuelearning AI could automatically learn how to be corrigible. I realized that my thinking about value learning was somewhat confused. Making concrete models will make my thinking clearer and also create more common models that people can discuss.
A value learning model is an algorithm that observes human behaviors and determines what values humans have. Roughly, the model consists of:
 a type of values, \(\mathcal{V}\)
 a prior over values, \(P(V)\)
 a conditional distribution of human behavior given their values and observation, \(P(A  V, O)\)
Of course, this is very simplified: in real life the model must account for beliefs, memory, etc. Such a model can be used for multiple purposes. Each of these purposes requires different things from the model. It is important to look at these applications when constructing these models, so it is clear what target we are shooting for.
 
         10.  The odd counterfactuals of playing chicken   post by Benja Fallenstein 1626 days ago  Kaya Stechly, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this  discuss  
 In this post, I examine an odd consequence of “playing chicken with the universe”, as used in proofbased UDT. Let’s say that our agent uses PA, and that it has a provability oracle, so that if it doesn’t find a proof, there really isn’t one. In this case, one way of looking at UDT is to say that it treats the models of PA as impossible possible worlds: UDT thinks that taking action \(a\) leads to utility \(u\) iff the universe program \(U()\) returns \(u\) in all models \(\mathcal{M}\) in which \(A()\) returns \(a\). The chicken step ensures that for every \(a\), there is at least one model \(\mathcal{M}\) in which this is true. But how? Well, even though in the “real world”, \(\mathbb{N}\), the sentence \(\ulcorner{\bar A() \neq \bar a}\urcorner\) isn’t provable—that is, \(\mathbb{N}\vDash\neg\square\ulcorner{\bar{\bar A}() \neq \bar{\bar a}}\urcorner\)—there are other models \(\mathcal{M}\) such that \(\mathcal{M}\vDash\square\ulcorner{\bar{\bar A}() \neq \bar{\bar a}}\urcorner\), and in these models, the chicken step can make \(A()\) output \(a\).
In general, the only “impossible possible worlds” in which \(A() = a\) are models \(\mathcal{M}\) according to which it is provable that \(A() \neq a\). In this post, I show that this odd way of constructing the counterfactual “what would happen if I did \(a\)” can cause problems for modal UDT and the corresponding notion of thirdparty counterfactuals.
 
  11.  Thirdperson counterfactuals   post by Benja Fallenstein 1627 days ago  Kaya Stechly, Nate Soares and Patrick LaVictoire like this  4 comments  
 If you’re thinking about the counterfactual world where you do X in the process of deciding whether to do X, let’s call that a firstperson counterfactual. If you’re thinking about it in the process of deciding whether another agent A should have done X instead of Y, let’s call that a thirdperson counterfactual. The definition of, e.g., modal UDT uses firstperson counterfactuals, but when we try to prove a theorem showing that modal UDT is “optimal” in some sense, then we need to use thirdperson counterfactuals.
UDT’s firstperson counterfactuals are logical counterfactuals, but our optimality result evaluates UDT by using physical thirdparty counterfactuals: it asks, would another agent have done better, not, would a different action by the same agent have lead to a better outcome? The former is easier to analyze, but the latter seems to be what we really care about. Nate’s recent post on “global UDT” points towards turning UDT into a notion of thirdparty counterfactuals, and describes some problems. In this post, I’ll give a fuller UDTbased notion of logical thirdparty counterfactuals, which at least fails visibly (returns an error) in the kinds of cases Nate describes. However, in a followup post I’ll give an example where this definition returns a nonerror value which intuitively seems wrong.
 
  12.  Predictors that don't try to manipulate you(?)   post by Benja Fallenstein 1705 days ago  Kaya Stechly, Nate Soares and Patrick LaVictoire like this  1 comment  
 One idea for how to make a safe superintelligent agent is to make a system that only answers questions, but doesn’t try to act in the world—an “oracle”, in Nick Bostrom’s terms. One of the things that make this difficult is that it’s not clear what it should mean, formally, to optimize for “truthfully answering questions” without in some way trying to influencing the world; intuitively: Won’t an agent that is trying to maximize the number of truthfully answered questions want to manipulate you into asking easier questions?
In this post, I consider a formal toy model of an agent that is trying to make correct predictions about future input, but, in a certain formal sense, has no incentive to make its future input easier to predict. I’m not claiming that this model definitely avoids all unintended optimizations—I don’t understand it that deeply yet—but it seems like an interesting proposal that is worth thinking about.
 
    

 NEW POSTSNEW DISCUSSION POSTS[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables  0 likes 
There should be a chat icon
Apparently "You must be
There is a replacement for
Regarding the physical
by Vanessa Kosoy on The LearningTheoretic AI Alignment Research Agend...  0 likes 
I think that we should expect
by Vanessa Kosoy on The LearningTheoretic AI Alignment Research Agend...  0 likes 
I think I understand your
by Jessica Taylor on The LearningTheoretic AI Alignment Research Agend...  0 likes 
This seems like a hack. The
by Jessica Taylor on The LearningTheoretic AI Alignment Research Agend...  0 likes 
After thinking some more,
by Vanessa Kosoy on The LearningTheoretic AI Alignment Research Agend...  0 likes 
Yes, I think that we're
by Vanessa Kosoy on The LearningTheoretic AI Alignment Research Agend...  0 likes 
My intuition is that it must
by Vanessa Kosoy on The LearningTheoretic AI Alignment Research Agend...  0 likes 
To first approximation, a
by Vanessa Kosoy on The LearningTheoretic AI Alignment Research Agend...  0 likes 
Actually, I *am* including
by Vanessa Kosoy on The LearningTheoretic AI Alignment Research Agend...  0 likes 
Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds  0 likes 
> Well, we could give up on
by Jessica Taylor on The LearningTheoretic AI Alignment Research Agend...  0 likes 
