 1.  Delegative Reinforcement Learning with a Merely Sane Advisor   post by Vadim Kosoy 20 days ago  discuss  
 Previously, we defined a setting called “Delegative Inverse Reinforcement Learning” (DIRL) in which the agent can delegate actions to an “advisor” and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is \(O(n^{2/3})\)) for oneshot DIRL (as opposed to standard regret bounds in RL which are only applicable in the episodic setting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider “Delegative Reinforcement Learning” (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoid catastrophic actions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a oneshot regret bound (in traditional normalization, \(O(n^{3/4})\)). Analogously to before, we allow for “corrupt” states in which both the advisor and the reward signal stop being reliable.
 
    4.  Delegative Inverse Reinforcement Learning   post by Vadim Kosoy 80 days ago  8 comments  
 We introduce a reinforcementlike learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.
The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (longterm planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a “soft maximizer” satisfies the conditions. Moreover, we allow for the existence of “corrupt states” in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don’t know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayesoptimality w.r.t. the advisor’s beliefs, which is arguably the best we can expect.
 
        11.  Minimax forecasting   post by Vadim Kosoy 282 days ago  2 comments  
 This post continues the research programme of attacking the grain of truth problem by departure from the Bayesian paradigm. In the previous post, I suggested using Savage’s minimax regret decision rule, but here I fall back to the simple minimax decision rule. This is because the mathematics is considerably simpler, and minimax should be sufficient to get IUD play in general games and Nash equilibrium in zerosum twoplayer games. I hope to build on these results to get analogous results for minimax regret in the future.
We consider “semiBayesian” agents following the minimax expected utility decision rule, in oblivious environments with full monitoring (a setting that we will refer to as “forecasting”). This setting is considered in order to avoid the need to enforce exploration, as a preparation for analysis of general environments. We show that such agents satisfy a certain asymptotic optimality theorem. Intuitively, this theorem means that whenever the environment satisfies an incomplete model that is included in the prior, the agent will eventually learn this model i.e. extract at least as much utility as can be guaranteed for this model.
 
   13.  IRL is hard   post by Vadim Kosoy 371 days ago  6 comments  
 We show that assuming the existence of publickey cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the “teacher” agent, the environment and the utility functions are computable in polynomialtime and there is only 1 bit of information to learn.  
  14.  Stabilizing logical counterfactuals by pseudorandomization   post by Vadim Kosoy 581 days ago  Abram Demski likes this  2 comments  
 Previously, we discussed the construction of logical counterfactuals in the language of optimal predictors. These counterfactuals were found to be wellbehaved when a certain nondegeneracy condition is met which can be understood as a bound on the agent’s ability to predict itself. We also demonstrated that desired gametheoretic behavior seems to require randomization (thermalizing instead of maximizing) which has to be logical randomization to implement metathreat game theory by logical counterfactuals. Both of these considerations suggest that the agent has to pseudorandomize (randomize in the logical uncertainty sense) its own behavior. Here, we show how to implement this pseudorandomization and prove it indeed guarantees the nondegeneracy condition.
Results
The proofs of the results are given in Appendix A.
 
  15.  Logical counterfactuals for random algorithms   post by Vadim Kosoy 622 days ago  Abram Demski, Nate Soares and Patrick LaVictoire like this  discuss  
 Updateless decision theory was informally defined by Wei Dai in terms of logical conditional expected utility, where the condition corresponds to an algorithm (the agent) producing a given output (action or policy). This kind of conditional expected values can be formalised by optimal predictors. However, since optimal predictor systems which are required to apply optimal predictors to decision theory generally have random advice, we need counterfactuals welldefined for random algorithms i.e. algorithms that produce different outputs with different probabilities depending on internal coin tosses. We propose to define these counterfactuals by a generalization of the notion of conditional expected utility which amounts to linear regression of utility with respect to the probabilities of different outputs in the space of “impossible possible worlds.” We formalise this idea by introducing “relative optimal predictors,” prove the analogue of the conditional probability formula (which takes matrix form) and uniqueness theorems.
Motivation
We start by explaining the analogous construction in classical probability theory and proceed to defining the logical counterpart in the Results section.
Consider \(\zeta\) a probability measure on some space, a random variable \(u\) representing utility, a finite set \(\mathcal{A}\) representing possible actions and another random variable \(p\) taking values in \([0,1]^{\mathcal{A}}\) and satisfying \(\sum_{a \in \mathcal{A}} p_a = 1\) representing the probabilities of taking different actions. For a deterministic algorithm, \(p\) takes values \(\{0,1\}^{\mathcal{A}}\) allowing defining conditional expected utility as
\[u_a := \operatorname{E}_\zeta[u \mid p_a = 1] = \frac{\operatorname{E}_\zeta[u p_a]}{\operatorname{E}_\zeta[p_a]}\]
In the general case, it is tempting to consider
\[\operatorname{E}_{\zeta \ltimes p}[u \mid a] = \frac{\operatorname{E}_\zeta[u p_a]}{\operatorname{E}_\zeta[p_a]}\]
where \(\zeta \ltimes p\) stands for the semidirect product of \(\zeta\) with \(p\), the latter regarded as a Markov kernel with target \(\mathcal{A}\). However, this would lead to behavior similar to EDT since conditioning by \(a\) is meaningful even for a single “world” (i.e. completely deterministic \(u\) and \(p\)). Instead, we select \(u^* \in {\mathbb{R}}^{\mathcal{A}}\) that minimizes \(\operatorname{E}_\zeta[(u  p^t u^*)^2]\) (we regard elements of \({\mathbb{R}}^{\mathcal{A}}\) as column vectors so \(p^t\) is a row vector). This means \(u^*\) has to satisfy the matrix equation
\[\operatorname{E}_\zeta[p p^t] u^* = \operatorname{E}_\zeta[u p]\]
The solution to this equation is only unique when \(\operatorname{E}_\zeta[p p^t]\) is nondegenerate. This corresponds to requiring positive probability of the condition for usual conditional expected values. In case \(p\) takes values in \(\{0,1\}^{\mathcal{A}}\), \(u^*\) is the usual conditional expected value.
Preliminaries
 
  16.  Implementing CDT with optimal predictor systems   post by Vadim Kosoy 639 days ago  Patrick LaVictoire likes this  2 comments  
 We consider transparent games between bounded CDT agents (“transparent” meaning each player has a model of the other players). The agents compute the expected utility of each possible action by executing an optimal predictor of a causal counterfactual, i.e. an optimal predictor for a function that evaluates the other players and computes the utility for the selected action. Since the agents simultaneously attempt to predict each other, the optimal predictors form an optimal predictor system for the reflective system comprised by the causal counterfactuals of all agents. We show that for strict maximizers, the resulting outcome is a bounded analogue of an approximate Nash equilibrium, i.e. a strategy which is an optimal response within certain resource constraints up to an asymptotically small error. For “thermalizers” (agents that choose an action with probability proportional to \(2^{\frac{u}{T}}\)), we get a similar result with expected utility \(\operatorname{E}_s[u]\) replaced by “free utility” \(\operatorname{E}_s[u]+T \operatorname{H}(s)\). Thus, such optimal predictor systems behave like bounded counterparts of reflective oracles.
Preliminaries
The proofs for this section are given in Appendix A.
We redefine \(\mathcal{E}_{2(ll,\phi)}\) and \(\mathcal{E}_{2(ll)}\) to be somewhat smaller protoerror spaces which nevertheless yield the same existence theorems as before. This is thanks to Lemma A.1.
 
  17.  Reflection with optimal predictors   post by Vadim Kosoy 663 days ago  Patrick LaVictoire likes this  discuss  
 A change in terminology: It is convenient when important concepts have short names. The concept of an “optimal predictor scheme” seems much more important than its historical predecessor, the “optimal predictor”. Therefore “optimal predictor schemes” will be henceforth called just “optimal predictors” while the previous concept of “optimal predictor” might be called “flat optimal predictor”.
We study systems of computations which have access to optimal predictors for each other. We expect such systems to play an important role in decision theory (where selfprediction is required to define logical counterfactuals and mutual prediction is required for a collection of agents in a game) and Vingean reflection (where the different computations correspond to different successor agents). The previously known existence theorems for optimal predictors are not directly applicable to this case. To overcome this we prove new, specifically tailored existence theorems.
The Results section states the main novelties, Appendix A contains adaptations of old theorems, Appendix B proves selected claims from Appendix A and Appendix C proves the novel results.
Results
 
   19.  Bounded Solomonoff induction using optimal predictor schemes   post by Vadim Kosoy 679 days ago  Patrick LaVictoire likes this  discuss  
 Most of the content of this post was covered by the talk I gave in Los Angeles MIRIx in October, minus the proofs and a minor amendment of Theorem 1 (the role of \(\Delta_{sqp,\phi}^2\)).
We define variants of the concept of generatable distributional estimation problem and show that these variants also admits a uniformly universal optimal predictor scheme. We show how to use this to implement a form of bounded Solomonoff induction.
Results
We have previously defined a “word ensemble” to be a collection \(\{\mu^k\}_{k \in {\mathbb{N}}}\) of probability measures on \({{\{ 0, 1 \}^*}}\) s.t. for some polynomial \(p\), \(\operatorname{supp}\mu^k \subseteq {{\{ 0, 1 \}^{\leq{p(k)}}}}\). This was convenient when the formalism was based on Boolean circuits but unnecessary for Turing machines. It is enough to assume that the Turing machine is allowed to read only the beginning of the input and thus halt in time arbitrarily smaller than the length of the input. In the following we will use “word ensemble” to mean an arbitrary sequence of probability measures on \({{\{ 0, 1 \}^*}}\), allow such word ensembles in distributional estimation problems etc.
All proofs are in the Appendix.
We start by defining “\(\Delta(log)\)sampler” and “\(\Delta(log)\)generator” for \(\Delta\) an error space of rank 2 (they were previously defined for an error space of rank 1). Fix such an error space.
 
     23.  Towards reflection with relative optimal predictor schemes   post by Vadim Kosoy 720 days ago  Jessica Taylor likes this  discuss  
 We construct a family or oracles relative to which all estimation problems which have approximate solutions in exponential time are generatable. Relatively to these oracles, we can use the optimal predictor scheme \(\Lambda\) for assigning expectation values to arbitrary deterministic computations within time that is superquasipolynomial in the logarithm of the computation time. In particular, this should allow reflection in the sense of Garrabrant once we succeed to derandomize \(\Lambda\) relative to the same oracle.
Results
 
   
Older 
 NEW POSTSNEW DISCUSSION POSTSNote that the problem with
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd...  0 likes 
Typos on page 5:
*
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd...  0 likes 
Ah, you're right. So gain
> Do you have ideas for how
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
I think I understand what
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
>You don’t have to solve
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
Your confusion is because you
by Vadim Kosoy on Delegative Inverse Reinforcement Learning  0 likes 
My confusion is the
by Tom Everitt on Delegative Inverse Reinforcement Learning  0 likes 
> First of all, it seems to
> figure out what my values
by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
I agree that selection bias
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
>It seems quite plausible
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
> defending against this type
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
2. I think that we can avoid
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
I hope you stay engaged with
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen...  0 likes 
