Reward/value learning for reinforcement learning
post by Stuart Armstrong 3 days ago | 2 comments

Along with Jan Leike and Laurent Orseau, I’ve been working to formalise many of the issues with AIs learning human values.

I’ll be presenting part of this at NIPS and the whole of it at some later conference. Therefore it seems best to formulate the whole problem in the reinforcement learning formalism. The results can generally be easily reformulated for general systems (including expected utility).

 My recent posts discussion post by Paul Christiano 4 days ago | Ryan Carey, Jessica Taylor and Tsvi Benson-Tilsen like this | discuss
post by Jessica Taylor 6 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | discuss

Summary: in approximating a scheme like HCH , we would like some notion of “the best the prediction can be given available AI capabilities”. There’s a natural notion of “the best prediction of a human we should expect to get”. In general this doesn’t yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.

 ALBA requires incremental design of good long-term memory systems discussion post by Jessica Taylor 6 days ago | Ryan Carey likes this | discuss
 An algorithm with preferences: from zero to one variable discussion post by Stuart Armstrong 18 days ago | Ryan Carey, Jessica Taylor and Patrick LaVictoire like this | discuss
Online Learning 3: Adversarial bandit learning with catastrophes
post by Ryan Carey 19 days ago | discuss

Note: This describes an idea of Jessica Taylor’s.

In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.

postCDT: Decision Theory using post-selected Bayes nets
post by Scott Garrabrant 27 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | discuss

The purpose of this post is to document a minor idea about a new type of decision theory that works using a Bayes net. This is not a concrete proposal, since I will give no insight on which Bayes net to use. I am not that excited by this proposal, but think it is worth writing up anyway.

Updatelessness and Son of X
post by Scott Garrabrant 29 days ago | Ryan Carey, Abram Demski and Jessica Taylor like this | 7 comments

The purpose of this post is to discuss the relationship between the concepts of Updatelessness and the “Son of” operator.

 A failed attempt at Updatelessness using Universal Inductors discussion post by Scott Garrabrant 30 days ago | Jessica Taylor and Patrick LaVictoire like this | 1 comment
Vector-Valued Reinforcement Learning
post by Patrick LaVictoire 33 days ago | Ryan Carey and Jessica Taylor like this | 1 comment

In order to study algorithms that can modify their own reward functions, we can define vector-valued versions of reinforcement learning concepts.

Imagine that there are several different goods that we could care about; then a utility function is represented by a preference vector $$\vec\theta$$. Furthermore, if it is possible for the agent (or the environment or other agents) to modify $$\vec \theta$$, then we will want to index them by the timestep.

Consider an agent that can take actions, some of which affect its own reward function. This agent would (and should) wirehead if it attempts to maximize the discounted rewards as calculated by its future selves; i.e. at timestep $$n$$ it would choose actions to maximize \begin{eqnarray} U_n = \sum_{k\geq n} \gamma_k \vec{x}_k\cdot\vec{\theta}_k\end{eqnarray}

where $$\vec x_k$$ is the vector of goods gained at time $$k$$, $$\vec \theta_k$$ is the preference vector at timestep $$k$$, and $$\gamma_k$$ is the time discount factor at time $$k$$. (We will often use the case of an exponential discount $$\gamma^k$$ for $$0<\gamma<1$$.)

However, we might instead maximize the value of tomorrow’s actions in light of today’s reward function, \begin{eqnarray} V_n = \sum_{k\geq n} \gamma_k\vec{x}_k\cdot\vec{\theta}_{n} \end{eqnarray}

(the only difference being $$\vec \theta_n$$ rather than $$\vec \theta_k$$). Genuinely maximizing this should lead to more stable goals; concretely, we can consider environments that can offer “bribes” to self-modify, and a learner maximizing $$U_n$$ would generally accept such bribes, while a learner maximizing $$V_n$$ would be cautious about doing so.

So what do we see when we adapt existing RL algorithms to such problems? There’s then a distinction between Q-learning and SARSA, where Q-learning foolishly accepts bribes that SARSA passes on, and this seems to be the flip side of the concept of interruptibility!

Online Learning 2: Exploration-only bandit learning with catastrophes
post by Ryan Carey 35 days ago | 5 comments

Note: This describes an idea of Jessica Taylor’s.

The usual training procedures for machine learning models are not always well-equipped to avoid rare catastrophes. In order to maintain the safety of powerful AI systems, it will be important to have training procedures that can efficiently learn from such events. [1]

We can model this situation with the problem of exploration-only online bandit learning. We will show that if agents allocate more of their attention to risky inputs, they can more efficiently achieve a low regret on this problem.

 Counterfactual do-what-I-mean discussion post by Stuart Armstrong 37 days ago | 3 comments
 Training Garrabrant inductors to predict counterfactuals discussion post by Tsvi Benson-Tilsen 38 days ago | Jessica Taylor and Scott Garrabrant like this | discuss
 Desiderata for decision theory discussion post by Tsvi Benson-Tilsen 38 days ago | Jessica Taylor and Scott Garrabrant like this | 1 comment
 Transitive negotiations with counterfactual agents discussion post by Scott Garrabrant 44 days ago | Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss
 Attacking the grain of truth problem using Bayes-Savage agents discussion post by Vadim Kosoy 44 days ago | Paul Christiano likes this | discuss
post by Ryan Carey 49 days ago | Vadim Kosoy, Nate Soares and Patrick LaVictoire like this | discuss

Note: This describes an idea of Jessica Taylor’s.

Control and security
post by Paul Christiano 49 days ago | Ryan Carey, Jessica Taylor and Vladimir Nesov like this | 7 comments

I used to think of AI security as largely unrelated to AI control, and my impression is that some people on this forum probably still do. I’ve recently shifted towards seeing control and security as basically the same, and thinking that security may often be a more appealing way to think and talk about control.

Online Learning 1: Bias-detecting online learners
post by Ryan Carey 58 days ago | Vadim Kosoy, Jessica Taylor, Nate Soares and Paul Christiano like this | 6 comments

Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.

 Index of some decision theory posts discussion post by Tsvi Benson-Tilsen 58 days ago | Ryan Carey, Jack Gallagher, Jessica Taylor and Scott Garrabrant like this | discuss
Logical inductor limits are dense under pointwise convergence
post by Sam Eisenstat 59 days ago | Abram Demski, Patrick LaVictoire, Scott Garrabrant and Tsvi Benson-Tilsen like this | discuss

Logical inductors [1] are very complex objects, and even their limits are hard to get a handle on. In this post, I investigate the topological properties of the set of all limits of logical inductors.

The set of Logical Inductors is not Convex
post by Scott Garrabrant 68 days ago | Sam Eisenstat, Abram Demski and Patrick LaVictoire like this | 1 comment

Sam Eisenstat asked the following interesting question: Given two logical inductors over the same deductive process, is every (rational) convex combination of them also a logical inductor? Surprisingly, the answer is no! Here is my counterexample.

Logical Inductors contain Logical Inductors over other complexity classes
post by Scott Garrabrant 68 days ago | Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss

In the Logical Induction paper, we give a definition of logical inductors over polynomial time traders. It is clear from our definition that our use of polynomial time is rather arbitrary, and we could define e.g. an exponential time logical inductor. However, it may be less clear that actually logical inductors over one complexity class contain logical inductors over other complexity classes within them.

 Learning doesn't solve philosophy of ethics discussion post by Stuart Armstrong 68 days ago | discuss
Model of human (ir)rationality
post by Stuart Armstrong 68 days ago | discuss

A putative new idea for AI control; index here.

This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it’s our definition which determines what’s a bias and what’s a preference in humans).

Older

### NEW DISCUSSION POSTS

Rewards and POMDP rather than
 by Stuart Armstrong on Reward/value learning for reinforcement learning | 2 likes

What are the main differences
 by Jessica Taylor on Reward/value learning for reinforcement learning | 0 likes

Nice! One thing that might be
 by Patrick LaVictoire on (Non-)Interruptibility of Sarsa(λ) and Q-Learning | 0 likes

The sentence ending the first
 by Abram Demski on Asymptotic Decision Theory | 0 likes

I agree with most of what you
 by Wei Dai on Desiderata for decision theory | 0 likes

Well, the time to take a
 by Vadim Kosoy on Updatelessness and Son of X | 0 likes

So my plan is to "solve" the
 by Scott Garrabrant on Updatelessness and Son of X | 0 likes

I may be misunderstanding
 by Wei Dai on Updatelessness and Son of X | 0 likes

>But we know that cooperation
 by Wei Dai on Updatelessness and Son of X | 1 like

This does seem to be the
 by Wei Dai on Updatelessness and Son of X | 3 likes

UDT, in its global policy
 by Vladimir Nesov on Updatelessness and Son of X | 1 like

From my perspective, the
 by Paul Christiano on A failed attempt at Updatelessness using Universal... | 0 likes

This is more or less what I
 by Vadim Kosoy on Updatelessness and Son of X | 0 likes

There is a decent-sized
 by Ryan Carey on Vector-Valued Reinforcement Learning | 0 likes

I still don't understand the
 by Jessica Taylor on Counterfactual do-what-I-mean | 0 likes