 Humans can be assigned any values whatsoever...   post by Stuart Armstrong 5 days ago  discuss  
 A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.
Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.
 
  Hyperreal Brouwer   post by Scott Garrabrant 12 days ago  Stuart Armstrong likes this  1 comment  
 This post explains how to view Kakutani’s fixed point theorem as a special case of Brouwer’s fixed point theorem with hyperreal numbers. This post is just math intuitions, but I found them useful in thinking about Kakutani’s fixed point theorem and many things in agent foundations. This came out of conversations with Sam Eisenstat.  
    Resolving human inconsistency in a simple model   post by Stuart Armstrong 14 days ago  Abram Demski likes this  discuss  
 A putative new idea for AI control; index here.
This post will present a simple model of an inconsistent human, and ponder how to resolve their inconsistency.
Let \(\bf{H}\) be our agent, in a turnbased world. Let \(R^l\) and \(R^s\) be two simple reward functions at each turn. The reward \(R^l\) is thought of as being a ‘longterm’ reward, while \(R^s\) is a shortterm one.
 
  Smoking Lesion Steelman II   post by Abram Demski 18 days ago  Scott Garrabrant likes this  discuss  
 After Johannes Treutlein’s comment on Smoking Lesion Steelman, and a number of other considerations, I had almost entirely given up on CDT. However, there were still nagging questions about whether the kind of selfignorance needed in Smoking Lesion Steelman could arise naturally, how it should be dealt with if so, and what role counterfactuals ought to play in decision theory if CDTlike behavior is incorrect. Today I sat down to collect all the arguments which have been rolling around in my head on this and related issues, and arrived at a place much closer to CDT than I expected.
 
   Delegative Reinforcement Learning with a Merely Sane Advisor   post by Vadim Kosoy 48 days ago  discuss  
 Previously, we defined a setting called “Delegative Inverse Reinforcement Learning” (DIRL) in which the agent can delegate actions to an “advisor” and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is \(O(n^{2/3})\)) for oneshot DIRL (as opposed to standard regret bounds in RL which are only applicable in the episodic setting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider “Delegative Reinforcement Learning” (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoid catastrophic actions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a oneshot regret bound (in traditional normalization, \(O(n^{3/4})\)). Analogously to before, we allow for “corrupt” states in which both the advisor and the reward signal stop being reliable.
 
      Conditioning on Conditionals   post by Scott Garrabrant 62 days ago  Abram Demski likes this  discuss  
 (From conversations with Sam, Abram, Tsvi, Marcello, and Ashwin Sah) A basic EDT agent starts with a prior, updates on a bunch of observations, and then has an choice between various actions. It conditions on each possible action it could take, and takes the action for which this conditional leads the the highest expected utility. An updateless (but nonpolicy selection) EDT agent has a problem here. It wants to not update on the observations, but it wants to condition on the fact that its takes a specific action given its observations. It is not obvious what this conditional should look like. In this post, I agrue for a particular way to interpret this conditioning on this conditional (of taking a specific action given a specific observation).  
     "Like this world, but..."   post by Stuart Armstrong 95 days ago  discuss  
 A putative new idea for AI control; index here.
Pick a very unsafe goal: \(G=\)“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?
I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.
Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.
 
   Smoking Lesion Steelman   post by Abram Demski 108 days ago  Tom Everitt, Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this  8 comments  
 It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.
I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.
 
  Delegative Inverse Reinforcement Learning   post by Vadim Kosoy 109 days ago  8 comments  
 We introduce a reinforcementlike learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.
The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (longterm planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a “soft maximizer” satisfies the conditions. Moreover, we allow for the existence of “corrupt states” in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don’t know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayesoptimality w.r.t. the advisor’s beliefs, which is arguably the best we can expect.
 
  A cheating approach to the tiling agents problem   post by Vladimir Slepnev 110 days ago  Alex Mennen and Vadim Kosoy like this  2 comments  
 (This post resulted from a conversation with Wei Dai.)
Formalizing the tiling agents problem is very delicate. In this post I’ll show a toy problem and a solution to it, which arguably meets all the desiderata stated before, but only by cheating in a new and unusual way.
Here’s a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor must solve another math question and design its own successor, and so on. The questions get harder each time, so they can’t all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.
Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself on each round, not just design a custommade successor that solves the math question with PA and reproduces itself by quining.
 
     
Older 
 NEW POSTSNEW DISCUSSION POSTSWhat does the Law of Logical
by Alex Appel on Smoking Lesion Steelman III: Revenge of the Tickle...  0 likes 
To quote the straw vulcan:
I intend to crosspost often.
by Scott Garrabrant on Should I post technical ideas here or on LessWrong...  1 like 
I think technical research
by Vadim Kosoy on Should I post technical ideas here or on LessWrong...  2 likes 
I am much more likely to miss
by Abram Demski on Should I post technical ideas here or on LessWrong...  1 like 
Note that the problem with
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd...  0 likes 
Typos on page 5:
*
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd...  0 likes 
Ah, you're right. So gain
> Do you have ideas for how
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
I think I understand what
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
>You don’t have to solve
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
Your confusion is because you
by Vadim Kosoy on Delegative Inverse Reinforcement Learning  0 likes 
My confusion is the
by Tom Everitt on Delegative Inverse Reinforcement Learning  0 likes 
> First of all, it seems to
> figure out what my values
by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
