         Distributed Cooperation  post by Alex Appel 36 days ago  Abram Demski and Scott Garrabrant like this  2 comments  
Reflective oracles can be approximated by computing Nash equilibria. But is there some procedure that produces a Paretooptimal equilibrium in a game, aka, a point produced by a Cooperative oracle? It turns out there is. There are some interesting philosophical aspects to it, which will be typed up in the next post.
The result is not original to me, it’s been floating around MIRI for a while. I think Scott, Sam, and Abram worked on it, but there might have been others. All I did was formalize it a bit, and generalize from the 2player 2move case to the nplayer nmove case. With the formalism here, it’s a bit hard to intuitively understand what’s going on, so I’ll indicate where to visualize an appropriate 3dimensional object.

 
     In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy  post by Jessica Taylor 681 days ago  Vadim Kosoy and Abram Demski like this  3 comments  
Summary: I define a memoryless Cartesian environments (which can model many familiar decision problems), note the similarity to memoryless POMDPs, and define a local optimality condition for policies, which can be roughly stated as “the policy is consistent with maximizing expected utility using CDT and subjective probabilities derived from SIA”. I show that this local optimality condition is necesssary but not sufficient for global optimality (UDT).

 
   An Untrollable Mathematician  post by Abram Demski 89 days ago  Alex Appel, Sam Eisenstat, Vadim Kosoy, Jack Gallagher, Jessica Taylor, Paul Christiano, Scott Garrabrant and Vladimir Slepnev like this  1 comment  
Followup to All Mathematicians are Trollable.
It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the nonconvergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild nonconvergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

 
    Smoking Lesion Steelman II  post by Abram Demski 205 days ago  Tom Everitt and Scott Garrabrant like this  1 comment  
After Johannes Treutlein’s comment on Smoking Lesion Steelman, and a number of other considerations, I had almost entirely given up on CDT. However, there were still nagging questions about whether the kind of selfignorance needed in Smoking Lesion Steelman could arise naturally, how it should be dealt with if so, and what role counterfactuals ought to play in decision theory if CDTlike behavior is incorrect. Today I sat down to collect all the arguments which have been rolling around in my head on this and related issues, and arrived at a place much closer to CDT than I expected.

 
   Delegative Inverse Reinforcement Learning  post by Vadim Kosoy 296 days ago  Alex Appel likes this  11 comments  
We introduce a reinforcementlike learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.
The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (longterm planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a “soft maximizer” satisfies the conditions. Moreover, we allow for the existence of “corrupt states” in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don’t know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayesoptimality w.r.t. the advisor’s beliefs, which is arguably the best we can expect.

 
  Delegative Inverse Reinforcement Learning  post by Vadim Kosoy 296 days ago  Alex Appel likes this  11 comments  
We introduce a reinforcementlike learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.
The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (longterm planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a “soft maximizer” satisfies the conditions. Moreover, we allow for the existence of “corrupt states” in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don’t know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayesoptimality w.r.t. the advisor’s beliefs, which is arguably the best we can expect.

 
  Being legible to other agents by committing to using weaker reasoning systems  post by Alex Mennen 141 days ago  Stuart Armstrong and Vladimir Slepnev like this  1 comment  
Suppose that an agent \(A_{1}\) reasons in a sound theory \(T_{1}\), and an agent \(A_{2}\) reasons in a theory \(T_{2}\), such that \(T_{1}\) proves that \(T_{2}\) is sound. Now suppose \(A_{1}\) is trying to reason in a way that is legible to \(A_{2}\), in the sense that \(A_{2}\) can rely on \(A_{1}\) to reach correct conclusions. One way of doing this is for \(A_{1}\) to restrict itself to some weaker theory \(T_{3}\), which \(T_{2}\) proves is sound, for the purposes of any reasoning that it wants to be legible to \(A_{2}\). Of course, in order for this to work, not only would \(A_{1}\) have to restrict itself to using \(T_{3}\), but \(A_{2}\) would to trust that \(A_{1}\) had done so. A plausible way for that to happen is for \(A_{1}\) to reach the decision quickly enough that \(A_{2}\) can simulate \(A_{1}\) making the decision to restrict itself to using \(T_{3}\). 
 
    
Older 
 NEW POSTSNEW DISCUSSION POSTSI think that in that case,
Two minor comments. First,
by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor  1 like 
A: While that is a really
> The true reason to do
A few comments.
Traps are
I'm not convinced exploration
Update: This isn't really an
by Alex Appel on A Difficulty With DensityZero Exploration  0 likes 
If you drop the
Cool! I'm happy to see this
Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po...  2 likes 
[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals  1 like 
Intermediate update:
The
by Alex Appel on Further Progress on a Bayesian Version of Logical ...  0 likes 
Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po...  2 likes 
This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy  0 likes 
I at first didn't understand
