    Conditioning on Conditionals   post by Scott Garrabrant 7 hours ago  Abram Demski likes this  discuss  
 (From conversations with Sam, Abram, Tsvi, Marcello, and Ashwin Sah) A basic EDT agent starts with a prior, updates on a bunch of observations, and then has an choice between various actions. It conditions on each possible action it could take, and takes the action for which this conditional leads the the highest expected utility. An updateless (but nonpolicy selection) EDT agent has a problem here. It wants to not update on the observations, but it wants to condition on the fact that its takes a specific action given its observations. It is not obvious what this conditional should look like. In this post, I agrue for a particular way to interpret this conditioning on this conditional (of taking a specific action given a specific observation).  
     "Like this world, but..."   post by Stuart Armstrong 33 days ago  discuss  
 A putative new idea for AI control; index here.
Pick a very unsafe goal: \(G=\)“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?
I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.
Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.
 
   Smoking Lesion Steelman   post by Abram Demski 46 days ago  Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this  5 comments  
 It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.
I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.
 
  Delegative Inverse Reinforcement Learning   post by Vadim Kosoy 46 days ago  6 comments  
 We introduce a reinforcementlike learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.
The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (longterm planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a “soft maximizer” satisfies the conditions. Moreover, we allow for the existence of “corrupt states” in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don’t know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayesoptimality w.r.t. the advisor’s beliefs, which is arguably the best we can expect.
 
  A cheating approach to the tiling agents problem   post by Vladimir Slepnev 47 days ago  Alex Mennen and Vadim Kosoy like this  3 comments  
 (This post resulted from a conversation with Wei Dai.)
Formalizing the tiling agents problem is very delicate. In this post I’ll show a toy problem and a solution to it, which arguably meets all the desiderata stated before, but only by cheating in a new and unusual way.
Here’s a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor must solve another math question and design its own successor, and so on. The questions get harder each time, so they can’t all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.
Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself on each round, not just design a custommade successor that solves the math question with PA and reproduces itself by quining.
 
       Futarchy Fix   post by Abram Demski 79 days ago  Scott Garrabrant and Stuart Armstrong like this  9 comments  
 Robin Hanson’s Futarchy is a proposal to let prediction markets make governmental decisions. We can view an operating Futarchy as an agent, and ask if it is aligned with the interests of its constituents. I am aware of two main failures of alignment: (1) since predicting rare events is rewarded in proportion to their rareness, prediction markets heavily incentivise causing rare events to happen (I’ll call this the entropymarket problem); (2) it seems prediction markets would not be able to assign probability to existential risk, since you can’t collect on bets after everyone’s dead (I’ll call this the existential risk problem). I provide three formulations of (1) and solve two of them, and make some comments on (2). (Thanks to Scott for pointing out the second of these problems to me; I don’t remember who originally told me about the first problem, but also thanks.)
 
        
Older 
 NEW POSTSNEW DISCUSSION POSTSI have stopped working on
The only assumptions about
by Vadim Kosoy on Delegative Inverse Reinforcement Learning  0 likes 
So this requires the agent's
by Tom Everitt on Delegative Inverse Reinforcement Learning  0 likes 
If the agent always delegates
by Vadim Kosoy on Delegative Inverse Reinforcement Learning  0 likes 
Hi Vadim!
So basically the
by Tom Everitt on Delegative Inverse Reinforcement Learning  0 likes 
Hi Tom!
There is a
by Vadim Kosoy on Delegative Inverse Reinforcement Learning  0 likes 
Hi Alex!
I agree that the
by Vadim Kosoy on Cooperative Oracles: Stratified Pareto Optima and ...  0 likes 
That is a good question. I
Adversarial examples for
"The use of an advisor allows
by Tom Everitt on Delegative Inverse Reinforcement Learning  0 likes 
If we're talking about you,
by Wei Dai on Current thoughts on Paul Christano's research agen...  0 likes 
Suppose that I, Paul, use a
by Paul Christiano on Current thoughts on Paul Christano's research agen...  0 likes 
When you wrote "suppose I use
by Wei Dai on Current thoughts on Paul Christano's research agen...  0 likes 
> but that kind of whitebox
by Paul Christiano on Current thoughts on Paul Christano's research agen...  0 likes 
>Competence can be an
by Wei Dai on Current thoughts on Paul Christano's research agen...  0 likes 
