1. Meta: IAFF vs LessWrong discussion post by Vadim Kosoy 324 days ago | Jessica Taylor likes this | 5 comments
2.The Learning-Theoretic AI Alignment Research Agenda
post by Vadim Kosoy 324 days ago | Alex Appel and Jessica Taylor like this | 36 comments

In this essay I will try to explain the overall structure and motivation of my AI alignment research agenda. The discussion is informal and no new theorems are proved here. The main features of my research agenda, as I explain them here, are

• Viewing AI alignment theory as part of a general abstract theory of intelligence

• Using desiderata and axiomatic definitions as starting points, rather than specific algorithms and constructions

• Formulating alignment problems in the language of learning theory

• Evaluating solutions by their formal mathematical properties, ultimately aiming at a quantitative theory of risk assessment

• Relying on the mathematical intuition derived from learning theory to pave the way to solving philosophical questions

 3. Computing an exact quantilal policy discussion post by Vadim Kosoy 403 days ago | discuss
4.Quantilal control for finite MDPs
post by Vadim Kosoy 415 days ago | Ryan Carey, Alex Appel and Abram Demski like this | discuss

We introduce a variant of the concept of a “quantilizer” for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the “quantilal” policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the “quantilum” value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter $$\lambda$$ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.

5.More precise regret bound for DRL
post by Vadim Kosoy 514 days ago | Alex Appel likes this | discuss

We derive a regret bound for DRL reflecting dependence on:

• Number of hypotheses

• Mixing time of MDP hypotheses

• The probability with which the advisor takes optimal actions

That is, the regret bound we get is fully explicit up to a multiplicative constant (which can also be made explicit). Currently we focus on plain (as opposed to catastrophe) and uniform (finite number of hypotheses, uniform prior) DRL, although this result can and should be extended to the catastrophe and/or non-uniform settings.

 6. Why DRL doesn't work for arbitrary environments discussion post by Vadim Kosoy 536 days ago | discuss
 7. Catastrophe Mitigation Using DRL (Appendices) discussion post by Vadim Kosoy 544 days ago | discuss
8.Catastrophe Mitigation Using DRL

Previously we derived a regret bound for DRL which assumed the advisor is “locally sane.” Such an advisor can only take actions that don’t lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into “dangerous” (impending catastrophe) and “safe” (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain “leaking” of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the “value leak” rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1.

9.Delegative Reinforcement Learning with a Merely Sane Advisor
post by Vadim Kosoy 628 days ago | discuss

Previously, we defined a setting called “Delegative Inverse Reinforcement Learning” (DIRL) in which the agent can delegate actions to an “advisor” and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is $$O(n^{2/3})$$) for one-shot DIRL (as opposed to standard regret bounds in RL which are only applicable in the episodic setting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider “Delegative Reinforcement Learning” (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoid catastrophic actions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a one-shot regret bound (in traditional normalization, $$O(n^{3/4})$$). Analogously to before, we allow for “corrupt” states in which both the advisor and the reward signal stop being reliable.

 10. On the computational feasibility of forecasting using gamblers discussion post by Vadim Kosoy 671 days ago | discuss
 11. Improved formalism for corruption in DIRL discussion post by Vadim Kosoy 677 days ago | discuss
12.Delegative Inverse Reinforcement Learning
post by Vadim Kosoy 688 days ago | Alex Appel likes this | 11 comments

We introduce a reinforcement-like learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.

13.Learning incomplete models using dominant markets
post by Vadim Kosoy 794 days ago | Jessica Taylor likes this | discuss

This post is formal treatment of the idea outlined here.

Given a countable set of incomplete models, we define a forecasting function that converges in the Kantorovich-Rubinstein metric with probability 1 to every one of the models which is satisfied by the true environment. This is analogous to Blackwell-Dubins merging of opinions for complete models, except that Kantorovich-Rubinstein convergence is weaker than convergence in total variation. The forecasting function is a dominant stochastic market for a suitably constructed set of traders.

14.Dominant stochastic markets
post by Vadim Kosoy 801 days ago | discuss

We generalize the formalism of dominant markets to account for stochastic “deductive processes,” and prove a theorem regarding the asymptotic behavior of such markets. In a following post, we will show how to use these tools to formalize the ideas outlined here.

 15. A measure-theoretic generalization of logical induction discussion post by Vadim Kosoy 855 days ago | Jessica Taylor and Scott Garrabrant like this | discuss
 16. Towards learning incomplete models using inner prediction markets discussion post by Vadim Kosoy 862 days ago | Jessica Taylor and Paul Christiano like this | 4 comments
 17. Subagent perfect minimax discussion post by Vadim Kosoy 864 days ago | discuss
 18. Minimax and dynamic (in)consistency discussion post by Vadim Kosoy 890 days ago | discuss
19.Minimax forecasting

This post continues the research programme of attacking the grain of truth problem by departure from the Bayesian paradigm. In the previous post, I suggested using Savage’s minimax regret decision rule, but here I fall back to the simple minimax decision rule. This is because the mathematics is considerably simpler, and minimax should be sufficient to get IUD play in general games and Nash equilibrium in zero-sum two-player games. I hope to build on these results to get analogous results for minimax regret in the future.

We consider “semi-Bayesian” agents following the minimax expected utility decision rule, in oblivious environments with full monitoring (a setting that we will refer to as “forecasting”). This setting is considered in order to avoid the need to enforce exploration, as a preparation for analysis of general environments. We show that such agents satisfy a certain asymptotic optimality theorem. Intuitively, this theorem means that whenever the environment satisfies an incomplete model that is included in the prior, the agent will eventually learn this model i.e. extract at least as much utility as can be guaranteed for this model.

 20. Attacking the grain of truth problem using Bayes-Savage agents discussion post by Vadim Kosoy 942 days ago | Paul Christiano likes this | discuss
21.IRL is hard

We show that assuming the existence of public-key cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the “teacher” agent, the environment and the utility functions are computable in polynomial-time and there is only 1 bit of information to learn.

22.Stabilizing logical counterfactuals by pseudorandomization
post by Vadim Kosoy 1189 days ago | Abram Demski likes this | 2 comments

Previously, we discussed the construction of logical counterfactuals in the language of optimal predictors. These counterfactuals were found to be well-behaved when a certain non-degeneracy condition is met which can be understood as a bound on the agent’s ability to predict itself. We also demonstrated that desired game-theoretic behavior seems to require randomization (thermalizing instead of maximizing) which has to be logical randomization to implement metathreat game theory by logical counterfactuals. Both of these considerations suggest that the agent has to pseudorandomize (randomize in the logical uncertainty sense) its own behavior. Here, we show how to implement this pseudorandomization and prove it indeed guarantees the non-degeneracy condition.

## Results

The proofs of the results are given in Appendix A.

23.Logical counterfactuals for random algorithms
post by Vadim Kosoy 1230 days ago | Abram Demski, Nate Soares and Patrick LaVictoire like this | discuss

Updateless decision theory was informally defined by Wei Dai in terms of logical conditional expected utility, where the condition corresponds to an algorithm (the agent) producing a given output (action or policy). This kind of conditional expected values can be formalised by optimal predictors. However, since optimal predictor systems which are required to apply optimal predictors to decision theory generally have random advice, we need counterfactuals well-defined for random algorithms i.e. algorithms that produce different outputs with different probabilities depending on internal coin tosses. We propose to define these counterfactuals by a generalization of the notion of conditional expected utility which amounts to linear regression of utility with respect to the probabilities of different outputs in the space of “impossible possible worlds.” We formalise this idea by introducing “relative optimal predictors,” prove the analogue of the conditional probability formula (which takes matrix form) and uniqueness theorems.

## Motivation

We start by explaining the analogous construction in classical probability theory and proceed to defining the logical counterpart in the Results section.

Consider $$\zeta$$ a probability measure on some space, a random variable $$u$$ representing utility, a finite set $$\mathcal{A}$$ representing possible actions and another random variable $$p$$ taking values in $$[0,1]^{\mathcal{A}}$$ and satisfying $$\sum_{a \in \mathcal{A}} p_a = 1$$ representing the probabilities of taking different actions. For a deterministic algorithm, $$p$$ takes values $$\{0,1\}^{\mathcal{A}}$$ allowing defining conditional expected utility as

$u_a := \operatorname{E}_\zeta[u \mid p_a = 1] = \frac{\operatorname{E}_\zeta[u p_a]}{\operatorname{E}_\zeta[p_a]}$

In the general case, it is tempting to consider

$\operatorname{E}_{\zeta \ltimes p}[u \mid a] = \frac{\operatorname{E}_\zeta[u p_a]}{\operatorname{E}_\zeta[p_a]}$

where $$\zeta \ltimes p$$ stands for the semidirect product of $$\zeta$$ with $$p$$, the latter regarded as a Markov kernel with target $$\mathcal{A}$$. However, this would lead to behavior similar to EDT since conditioning by $$a$$ is meaningful even for a single “world” (i.e. completely deterministic $$u$$ and $$p$$). Instead, we select $$u^* \in {\mathbb{R}}^{\mathcal{A}}$$ that minimizes $$\operatorname{E}_\zeta[(u - p^t u^*)^2]$$ (we regard elements of $${\mathbb{R}}^{\mathcal{A}}$$ as column vectors so $$p^t$$ is a row vector). This means $$u^*$$ has to satisfy the matrix equation

$\operatorname{E}_\zeta[p p^t] u^* = \operatorname{E}_\zeta[u p]$

The solution to this equation is only unique when $$\operatorname{E}_\zeta[p p^t]$$ is non-degenerate. This corresponds to requiring positive probability of the condition for usual conditional expected values. In case $$p$$ takes values in $$\{0,1\}^{\mathcal{A}}$$, $$u^*$$ is the usual conditional expected value.

## Preliminaries

24.Implementing CDT with optimal predictor systems
post by Vadim Kosoy 1247 days ago | Patrick LaVictoire likes this | 2 comments

We consider transparent games between bounded CDT agents (“transparent” meaning each player has a model of the other players). The agents compute the expected utility of each possible action by executing an optimal predictor of a causal counterfactual, i.e. an optimal predictor for a function that evaluates the other players and computes the utility for the selected action. Since the agents simultaneously attempt to predict each other, the optimal predictors form an optimal predictor system for the reflective system comprised by the causal counterfactuals of all agents. We show that for strict maximizers, the resulting outcome is a bounded analogue of an approximate Nash equilibrium, i.e. a strategy which is an optimal response within certain resource constraints up to an asymptotically small error. For “thermalizers” (agents that choose an action with probability proportional to $$2^{\frac{u}{T}}$$), we get a similar result with expected utility $$\operatorname{E}_s[u]$$ replaced by “free utility” $$\operatorname{E}_s[u]+T \operatorname{H}(s)$$. Thus, such optimal predictor systems behave like bounded counterparts of reflective oracles.

## Preliminaries

The proofs for this section are given in Appendix A.

We redefine $$\mathcal{E}_{2(ll,\phi)}$$ and $$\mathcal{E}_{2(ll)}$$ to be somewhat smaller proto-error spaces which nevertheless yield the same existence theorems as before. This is thanks to Lemma A.1.

25.Reflection with optimal predictors
post by Vadim Kosoy 1271 days ago | Patrick LaVictoire likes this | discuss

A change in terminology: It is convenient when important concepts have short names. The concept of an “optimal predictor scheme” seems much more important than its historical predecessor, the “optimal predictor”. Therefore “optimal predictor schemes” will be henceforth called just “optimal predictors” while the previous concept of “optimal predictor” might be called “flat optimal predictor”.

We study systems of computations which have access to optimal predictors for each other. We expect such systems to play an important role in decision theory (where self-prediction is required to define logical counterfactuals and mutual prediction is required for a collection of agents in a game) and Vingean reflection (where the different computations correspond to different successor agents). The previously known existence theorems for optimal predictors are not directly applicable to this case. To overcome this we prove new, specifically tailored existence theorems.

The Results section states the main novelties, Appendix A contains adaptations of old theorems, Appendix B proves selected claims from Appendix A and Appendix C proves the novel results.

## Results

Older

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes