Meta: IAFF vs LessWrong discussion post by Vadim Kosoy 113 days ago | Jessica Taylor likes this | 5 comments
The Learning-Theoretic AI Alignment Research Agenda
post by Vadim Kosoy 113 days ago | Alex Appel and Jessica Taylor like this | 36 comments

In this essay I will try to explain the overall structure and motivation of my AI alignment research agenda. The discussion is informal and no new theorems are proved here. The main features of my research agenda, as I explain them here, are

• Viewing AI alignment theory as part of a general abstract theory of intelligence

• Using desiderata and axiomatic definitions as starting points, rather than specific algorithms and constructions

• Formulating alignment problems in the language of learning theory

• Evaluating solutions by their formal mathematical properties, ultimately aiming at a quantitative theory of risk assessment

• Relying on the mathematical intuition derived from learning theory to pave the way to solving philosophical questions

Logical Inductor Tiling and Why it's Hard
post by Alex Appel 138 days ago | Sam Eisenstat and Abram Demski like this | discuss

(Tiling result due to Sam, exposition of obstacles due to me)

 A Loophole for Self-Applicative Soundness discussion post by Alex Appel 140 days ago | Abram Demski likes this | 4 comments
Logical Inductors Converge to Correlated Equilibria (Kinda)
post by Alex Appel 149 days ago | Sam Eisenstat and Jessica Taylor like this | 1 comment

Logical inductors of “similar strength”, playing against each other in a repeated game, will converge to correlated equilibria of the one-shot game, for the same reason that players that react to the past plays of their opponent converge to correlated equilibria. In fact, this proof is essentially just the proof from Calibrated Learning and Correlated Equilibrium by Forster (1997), adapted to a logical inductor setting.

 Logical Inductor Lemmas discussion post by Alex Appel 149 days ago | discuss
Two Notions of Best Response
post by Alex Appel 149 days ago | discuss

In game theory, there are two different notions of “best response” at play. Causal best-response corresponds to standard game-theoretic reasoning, because it assumes that the joint probability distribution over everyone else’s moves remains unchanged if one player changes their move. The second one, Evidential best-response, can model cases where the actions of the various players are not subjectively independent, such as Death in Damascus, Twin Prisoner’s Dilemma, Troll Bridge, Newcomb, and Smoking Lesion, and will be useful to analyze the behavior of logical inductors in repeated games. This is just a quick rundown of the basic properties of these two notions of best response.

 Doubts about Updatelessness discussion post by Alex Appel 172 days ago | Abram Demski likes this | 3 comments
 Computing an exact quantilal policy discussion post by Vadim Kosoy 193 days ago | discuss
 Resource-Limited Reflective Oracles discussion post by Alex Appel 194 days ago | Sam Eisenstat, Abram Demski and Jessica Taylor like this | 1 comment
 No Constant Distribution Can be a Logical Inductor discussion post by Alex Appel 198 days ago | Sam Eisenstat, Vadim Kosoy, Abram Demski, Jessica Taylor and Stuart Armstrong like this | 1 comment
 Musings on Exploration discussion post by Alex Appel 202 days ago | Vadim Kosoy likes this | 4 comments
Quantilal control for finite MDPs
post by Vadim Kosoy 205 days ago | Ryan Carey, Alex Appel and Abram Demski like this | discuss

We introduce a variant of the concept of a “quantilizer” for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the “quantilal” policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the “quantilum” value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter $$\lambda$$ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.

 A Difficulty With Density-Zero Exploration discussion post by Alex Appel 209 days ago | 1 comment
Distributed Cooperation
post by Alex Appel 218 days ago | Abram Demski and Scott Garrabrant like this | 2 comments

Reflective oracles can be approximated by computing Nash equilibria. But is there some procedure that produces a Pareto-optimal equilibrium in a game, aka, a point produced by a Cooperative oracle? It turns out there is. There are some interesting philosophical aspects to it, which will be typed up in the next post.

The result is not original to me, it’s been floating around MIRI for a while. I think Scott, Sam, and Abram worked on it, but there might have been others. All I did was formalize it a bit, and generalize from the 2-player 2-move case to the n-player n-move case. With the formalism here, it’s a bit hard to intuitively understand what’s going on, so I’ll indicate where to visualize an appropriate 3-dimensional object.

 Passing Troll Bridge discussion post by Alex Appel 241 days ago | Abram Demski likes this | discuss
Why we want unbiased learning processes
post by Stuart Armstrong 244 days ago | discuss

Crossposted at Lesserwrong.

tl;dr: if an agent has a biased learning process, it may choose actions that are worse (with certainty) for every possible reward function it could be learning.

 Two Types of Updatelessness discussion post by Abram Demski 248 days ago | discuss
 Stable Pointers to Value II: Environmental Goals discussion post by Abram Demski 255 days ago | 1 comment
Further Progress on a Bayesian Version of Logical Uncertainty
post by Alex Appel 263 days ago | Scott Garrabrant likes this | 1 comment

I’d like to credit Daniel Demski for helpful discussion.

 Strategy Nonconvexity Induced by a Choice of Potential Oracles discussion post by Alex Appel 268 days ago | Abram Demski likes this | discuss
An Untrollable Mathematician
post by Abram Demski 271 days ago | Alex Appel, Sam Eisenstat, Vadim Kosoy, Jack Gallagher, Jessica Taylor, Paul Christiano, Scott Garrabrant and Vladimir Slepnev like this | 1 comment

Follow-up to All Mathematicians are Trollable.

It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the non-convergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild non-convergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

Logical counterfactuals and differential privacy
post by Nisan Stiennon 273 days ago | Abram Demski and Scott Garrabrant like this | 1 comment

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

More precise regret bound for DRL
post by Vadim Kosoy 303 days ago | Alex Appel likes this | discuss

We derive a regret bound for DRL reflecting dependence on:

• Number of hypotheses

• Mixing time of MDP hypotheses

• The probability with which the advisor takes optimal actions

That is, the regret bound we get is fully explicit up to a multiplicative constant (which can also be made explicit). Currently we focus on plain (as opposed to catastrophe) and uniform (finite number of hypotheses, uniform prior) DRL, although this result can and should be extended to the catastrophe and/or non-uniform settings.

 Value learning subproblem: learning goals of simple agents discussion post by Alex Mennen 308 days ago | discuss
Older

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes