A measure-theoretic generalization of logical induction discussion post by Vadim Kosoy 23 hours ago | Jessica Taylor likes this | discuss
 Open problem: thin logical priors discussion post by Tsvi Benson-Tilsen 4 days ago | Ryan Carey, Jessica Taylor and Scott Garrabrant like this | 2 comments
 Towards learning incomplete models using inner prediction markets discussion post by Vadim Kosoy 8 days ago | Paul Christiano likes this | 4 comments
 Subagent perfect minimax discussion post by Vadim Kosoy 9 days ago | discuss
 Pursuing convergent instrumental subgoals on the user's behalf doesn't always require good priors discussion post by Jessica Taylor 17 days ago | Daniel Dewey, Paul Christiano and Stuart Armstrong like this | 9 comments
My current take on the Paul-MIRI disagreement on alignability of messy AI
post by Jessica Taylor 23 days ago | Ryan Carey, Vadim Kosoy, Daniel Dewey, Scott Garrabrant and Stuart Armstrong like this | 40 comments

Paul Christiano and “MIRI” have disagreed on an important research question for a long time: should we focus research on aligning “messy” AGI (e.g. one found through gradient descent or brute force search) with human values, or on developing “principled” AGI (based on theories similar to Bayesian probability theory)? I’m going to present my current model of this disagreement and additional thoughts about it.

 Ontology, lost purposes, and instrumental goals discussion post by Stuart Armstrong 25 days ago | discuss
The best value indifference method (so far)
post by Stuart Armstrong 33 days ago | 8 comments

When dealing with the problem of bias, I stumbled upon what I believe is the best way of getting value indifference, one that solves almost all of the problems with the previous methods.

 Minimax and dynamic (in)consistency discussion post by Vadim Kosoy 36 days ago | discuss
Minimax forecasting

This post continues the research programme of attacking the grain of truth problem by departure from the Bayesian paradigm. In the previous post, I suggested using Savage’s minimax regret decision rule, but here I fall back to the simple minimax decision rule. This is because the mathematics is considerably simpler, and minimax should be sufficient to get IUD play in general games and Nash equilibrium in zero-sum two-player games. I hope to build on these results to get analogous results for minimax regret in the future.

We consider “semi-Bayesian” agents following the minimax expected utility decision rule, in oblivious environments with full monitoring (a setting that we will refer to as “forecasting”). This setting is considered in order to avoid the need to enforce exploration, as a preparation for analysis of general environments. We show that such agents satisfy a certain asymptotic optimality theorem. Intuitively, this theorem means that whenever the environment satisfies an incomplete model that is included in the prior, the agent will eventually learn this model i.e. extract at least as much utility as can be guaranteed for this model.

Uninfluenceable agents
post by Stuart Armstrong 39 days ago | Vadim Kosoy likes this | 7 comments

After explaining biased learning processes, we can now define influenceable (and uninfluenceable) learning processes.

Recall that the (unbiased) influence problem is due to agents randomising their preferences, as a sort of artificial `learning’ process, if the real learning process is slow or incomplete.

 Counterfactuals on POMDP discussion post by Stuart Armstrong 39 days ago | discuss
How to judge moral learning failure
post by Stuart Armstrong 39 days ago | 2 comments

I’m finding many different results, that show problems and biases with the reward learning process.

But there is a meta problem, which is answering the question: “If the AI gets it wrong, how bad is it?”. It’s clear that there are some outcomes which might be slightly suboptimal – like the complete extinction of all intelligent life across the entire reachable universe. But it’s not so clear what to do if the error is smaller.

Biased reward learning
post by Stuart Armstrong 39 days ago | Ryan Carey likes this | discuss

What are the biggest failure modes of reward learning agents?

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

Reward/value learning for reinforcement learning
post by Stuart Armstrong 47 days ago | 2 comments

Along with Jan Leike and Laurent Orseau, I’ve been working to formalise many of the issues with AIs learning human values.

I’ll be presenting part of this at NIPS and the whole of it at some later conference. Therefore it seems best to formulate the whole problem in the reinforcement learning formalism. The results can generally be easily reformulated for general systems (including expected utility).

 My recent posts discussion post by Paul Christiano 47 days ago | Ryan Carey, Jessica Taylor, Patrick LaVictoire, Stuart Armstrong and Tsvi Benson-Tilsen like this | discuss
post by Jessica Taylor 49 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | discuss

Summary: in approximating a scheme like HCH , we would like some notion of “the best the prediction can be given available AI capabilities”. There’s a natural notion of “the best prediction of a human we should expect to get”. In general this doesn’t yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.

 ALBA requires incremental design of good long-term memory systems discussion post by Jessica Taylor 49 days ago | Ryan Carey likes this | discuss
 An algorithm with preferences: from zero to one variable discussion post by Stuart Armstrong 61 days ago | Ryan Carey, Jessica Taylor and Patrick LaVictoire like this | discuss
Online Learning 3: Adversarial bandit learning with catastrophes
post by Ryan Carey 62 days ago | Vadim Kosoy and Patrick LaVictoire like this | discuss

Note: This describes an idea of Jessica Taylor’s.

In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.

postCDT: Decision Theory using post-selected Bayes nets
post by Scott Garrabrant 70 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

The purpose of this post is to document a minor idea about a new type of decision theory that works using a Bayes net. This is not a concrete proposal, since I will give no insight on which Bayes net to use. I am not that excited by this proposal, but think it is worth writing up anyway.

Updatelessness and Son of X
post by Scott Garrabrant 72 days ago | Ryan Carey, Abram Demski and Jessica Taylor like this | 8 comments

The purpose of this post is to discuss the relationship between the concepts of Updatelessness and the “Son of” operator.

 A failed attempt at Updatelessness using Universal Inductors discussion post by Scott Garrabrant 73 days ago | Jessica Taylor and Patrick LaVictoire like this | 1 comment
Vector-Valued Reinforcement Learning
post by Patrick LaVictoire 76 days ago | Ryan Carey and Jessica Taylor like this | 1 comment

In order to study algorithms that can modify their own reward functions, we can define vector-valued versions of reinforcement learning concepts.

Imagine that there are several different goods that we could care about; then a utility function is represented by a preference vector $$\vec\theta$$. Furthermore, if it is possible for the agent (or the environment or other agents) to modify $$\vec \theta$$, then we will want to index them by the timestep.

Consider an agent that can take actions, some of which affect its own reward function. This agent would (and should) wirehead if it attempts to maximize the discounted rewards as calculated by its future selves; i.e. at timestep $$n$$ it would choose actions to maximize \begin{eqnarray} U_n = \sum_{k\geq n} \gamma_k \vec{x}_k\cdot\vec{\theta}_k\end{eqnarray}

where $$\vec x_k$$ is the vector of goods gained at time $$k$$, $$\vec \theta_k$$ is the preference vector at timestep $$k$$, and $$\gamma_k$$ is the time discount factor at time $$k$$. (We will often use the case of an exponential discount $$\gamma^k$$ for $$0<\gamma<1$$.)

However, we might instead maximize the value of tomorrow’s actions in light of today’s reward function, \begin{eqnarray} V_n = \sum_{k\geq n} \gamma_k\vec{x}_k\cdot\vec{\theta}_{n} \end{eqnarray}

(the only difference being $$\vec \theta_n$$ rather than $$\vec \theta_k$$). Genuinely maximizing this should lead to more stable goals; concretely, we can consider environments that can offer “bribes” to self-modify, and a learner maximizing $$U_n$$ would generally accept such bribes, while a learner maximizing $$V_n$$ would be cautious about doing so.

So what do we see when we adapt existing RL algorithms to such problems? There’s then a distinction between Q-learning and SARSA, where Q-learning foolishly accepts bribes that SARSA passes on, and this seems to be the flip side of the concept of interruptibility!

Online Learning 2: Exploration-only bandit learning with catastrophes
post by Ryan Carey 78 days ago | 5 comments

Note: This describes an idea of Jessica Taylor’s.

The usual training procedures for machine learning models are not always well-equipped to avoid rare catastrophes. In order to maintain the safety of powerful AI systems, it will be important to have training procedures that can efficiently learn from such events. [1]

We can model this situation with the problem of exploration-only online bandit learning. We will show that if agents allocate more of their attention to risky inputs, they can more efficiently achieve a low regret on this problem.

Older

### NEW DISCUSSION POSTS

I agree that the epistemic
 by Tsvi Benson-Tilsen on Open problem: thin logical priors | 0 likes

A very similar idea is
 by Paul Christiano on Online Learning 1: Bias-detecting online learners | 0 likes

I think the fact that traders
 by Paul Christiano on Open problem: thin logical priors | 1 like

Prior to working more on
 by Paul Christiano on Updatelessness and Son of X | 0 likes

It seems quite challenging to
 by Vadim Kosoy on Towards learning incomplete models using inner pre... | 0 likes

> I see minimally
 by Paul Christiano on My current take on the Paul-MIRI disagreement on a... | 0 likes

> If such a recipe existed
 by Paul Christiano on My current take on the Paul-MIRI disagreement on a... | 0 likes

> My current estimate is that
 by Paul Christiano on Towards learning incomplete models using inner pre... | 0 likes

Regarding exploration, I
 by Vadim Kosoy on Towards learning incomplete models using inner pre... | 0 likes

If an AI causes its human
 by Wei Dai on My current take on the Paul-MIRI disagreement on a... | 0 likes

This result features in the
 by Ryan Carey on In memoryless Cartesian environments, every UDT po... | 0 likes

Cool! It seems to me that
 by Paul Christiano on Towards learning incomplete models using inner pre... | 0 likes

I see what you're arguing.
 by Jessica Taylor on Pursuing convergent instrumental subgoals on the u... | 0 likes

It's just meant to be a
 by Jessica Taylor on My current take on the Paul-MIRI disagreement on a... | 0 likes

Thanks, I think I understand
 by David Krueger on My current take on the Paul-MIRI disagreement on a... | 0 likes