1.Quantilal control for finite MDPs
post by Vanessa Kosoy 443 days ago | Ryan Carey, Alex Appel and Abram Demski like this | discuss

We introduce a variant of the concept of a “quantilizer” for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the “quantilal” policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the “quantilum” value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter $$\lambda$$ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.

2.Humans can be assigned any values whatsoever...
post by Stuart Armstrong 612 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

3.Autopoietic systems and difficulty of AGI alignment
post by Jessica Taylor 668 days ago | Ryan Carey, Owen Cotton-Barratt and Paul Christiano like this | 13 comments

I have recently come to the opinion that AGI alignment is probably extremely hard. But it’s not clear exactly what AGI or AGI alignment are. And there are some forms of aligment of “AI” systems that are easy. Here I operationalize “AGI” and “AGI alignment” in some different ways and evaluate their difficulties.

4.Current thoughts on Paul Christano's research agenda
post by Jessica Taylor 700 days ago | Ryan Carey, Owen Cotton-Barratt, Sam Eisenstat, Paul Christiano, Stuart Armstrong and Wei Dai like this | 15 comments

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.

5.AI safety: three human problems and one AI issue
post by Stuart Armstrong 759 days ago | Ryan Carey and Daniel Dewey like this | 2 comments

A putative new idea for AI control; index here.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

6.Why I am not currently working on the AAMLS agenda
post by Jessica Taylor 765 days ago | Ryan Carey, Marcello Herreshoff, Sam Eisenstat, Abram Demski, Daniel Dewey, Scott Garrabrant and Stuart Armstrong like this | 2 comments

(note: this is not an official MIRI statement, this is a personal statement. I am not speaking for others who have been involved with the agenda.)

The AAMLS (Alignment for Advanced Machine Learning Systems) agenda is a project at MIRI that is about determining how to use hypothetical highly advanced machine learning systems safely. I was previously working on problems in this agenda and am currently not.

 7. Generalizing Foundations of Decision Theory discussion post by Abram Demski 841 days ago | Ryan Carey, Vanessa Kosoy, Jessica Taylor and Scott Garrabrant like this | 8 comments
 8. Maximally efficient agents will probably have an anti-daemon immune system discussion post by Jessica Taylor 844 days ago | Ryan Carey, Patrick LaVictoire and Scott Garrabrant like this | 1 comment
9.Emergency learning
post by Stuart Armstrong 870 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here.

Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?

Well, drinking coffee by the barrel at Miri’s emergency research retreat I’d… still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn’t reveal any new approaches, I’d try and get something like this working.

10.Thoughts on Quantilizers
post by Stuart Armstrong 870 days ago | Ryan Carey and Abram Demski like this | discuss

A putative new idea for AI control; index here.

This post will look at some of the properties of quantilizers, when they succeed and how they might fail.

Roughly speaking, let $$f$$ be some true objective function that we want to maximise. We haven’t been able to specify it fully, so we have instead a proxy function $$g$$. There is a cost function $$c=f-g$$ which measures how much $$g$$ falls short of $$f$$. Then a quantilizer will choose actions (or policies) radomly from the top $$n\%$$ of actions available, ranking those actions according to $$g$$.

11.On motivations for MIRI's highly reliable agent design research
post by Jessica Taylor 874 days ago | Ryan Carey, Sam Eisenstat, Daniel Dewey, Nate Soares, Patrick LaVictoire, Paul Christiano, Tsvi Benson-Tilsen and Vladimir Nesov like this | 10 comments

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

 12. Open problem: thin logical priors discussion post by Tsvi Benson-Tilsen 886 days ago | Ryan Carey, Jessica Taylor, Patrick LaVictoire and Scott Garrabrant like this | 2 comments
13.My current take on the Paul-MIRI disagreement on alignability of messy AI
post by Jessica Taylor 905 days ago | Ryan Carey, Vanessa Kosoy, Daniel Dewey, Patrick LaVictoire, Scott Garrabrant and Stuart Armstrong like this | 40 comments

Paul Christiano and “MIRI” have disagreed on an important research question for a long time: should we focus research on aligning “messy” AGI (e.g. one found through gradient descent or brute force search) with human values, or on developing “principled” AGI (based on theories similar to Bayesian probability theory)? I’m going to present my current model of this disagreement and additional thoughts about it.

14.Rigged reward learning
post by Stuart Armstrong 922 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here.

NOTE: What used to be called ‘bias’, is now called ‘rigging’, because ‘bias’ is very overloaded. The post has not yet been updated with the new terminology, however.

What are the biggest failure modes of reward learning agents?

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

 15. The universal prior is malign link by Paul Christiano 928 days ago | Ryan Carey, Vanessa Kosoy, Jessica Taylor and Patrick LaVictoire like this | 4 comments
 16. My recent posts discussion post by Paul Christiano 930 days ago | Ryan Carey, Jessica Taylor, Patrick LaVictoire, Stuart Armstrong and Tsvi Benson-Tilsen like this | discuss
post by Jessica Taylor 931 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

Summary: in approximating a scheme like HCH , we would like some notion of “the best the prediction can be given available AI capabilities”. There’s a natural notion of “the best prediction of a human we should expect to get”. In general this doesn’t yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.

 18. ALBA requires incremental design of good long-term memory systems discussion post by Jessica Taylor 931 days ago | Ryan Carey likes this | 1 comment
 19. An algorithm with preferences: from zero to one variable discussion post by Stuart Armstrong 944 days ago | Ryan Carey, Jessica Taylor and Patrick LaVictoire like this | discuss
20.postCDT: Decision Theory using post-selected Bayes nets
post by Scott Garrabrant 952 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

The purpose of this post is to document a minor idea about a new type of decision theory that works using a Bayes net. This is not a concrete proposal, since I will give no insight on which Bayes net to use. I am not that excited by this proposal, but think it is worth writing up anyway.

21.Updatelessness and Son of X
post by Scott Garrabrant 954 days ago | Ryan Carey, Abram Demski and Jessica Taylor like this | 8 comments

The purpose of this post is to discuss the relationship between the concepts of Updatelessness and the “Son of” operator.

22.Vector-Valued Reinforcement Learning
post by Patrick LaVictoire 958 days ago | Ryan Carey and Jessica Taylor like this | 1 comment

In order to study algorithms that can modify their own reward functions, we can define vector-valued versions of reinforcement learning concepts.

Imagine that there are several different goods that we could care about; then a utility function is represented by a preference vector $$\vec\theta$$. Furthermore, if it is possible for the agent (or the environment or other agents) to modify $$\vec \theta$$, then we will want to index them by the timestep.

Consider an agent that can take actions, some of which affect its own reward function. This agent would (and should) wirehead if it attempts to maximize the discounted rewards as calculated by its future selves; i.e. at timestep $$n$$ it would choose actions to maximize \begin{eqnarray} U_n = \sum_{k\geq n} \gamma_k \vec{x}_k\cdot\vec{\theta}_k\end{eqnarray}

where $$\vec x_k$$ is the vector of goods gained at time $$k$$, $$\vec \theta_k$$ is the preference vector at timestep $$k$$, and $$\gamma_k$$ is the time discount factor at time $$k$$. (We will often use the case of an exponential discount $$\gamma^k$$ for $$0<\gamma<1$$.)

However, we might instead maximize the value of tomorrow’s actions in light of today’s reward function, \begin{eqnarray} V_n = \sum_{k\geq n} \gamma_k\vec{x}_k\cdot\vec{\theta}_{n} \end{eqnarray}

(the only difference being $$\vec \theta_n$$ rather than $$\vec \theta_k$$). Genuinely maximizing this should lead to more stable goals; concretely, we can consider environments that can offer “bribes” to self-modify, and a learner maximizing $$U_n$$ would generally accept such bribes, while a learner maximizing $$V_n$$ would be cautious about doing so.

So what do we see when we adapt existing RL algorithms to such problems? There’s then a distinction between Q-learning and SARSA, where Q-learning foolishly accepts bribes that SARSA passes on, and this seems to be the flip side of the concept of interruptibility!

23.Control and security
post by Paul Christiano 974 days ago | Ryan Carey, Jessica Taylor and Vladimir Nesov like this | 7 comments

I used to think of AI security as largely unrelated to AI control, and my impression is that some people on this forum probably still do. I’ve recently shifted towards seeing control and security as basically the same, and thinking that security may often be a more appealing way to think and talk about control.

 24. Index of some decision theory posts discussion post by Tsvi Benson-Tilsen 983 days ago | Ryan Carey, Jack Gallagher, Jessica Taylor and Scott Garrabrant like this | discuss
 25. The many counterfactuals of counterfactual mugging discussion post by Scott Garrabrant 1160 days ago | Ryan Carey and Tsvi Benson-Tilsen like this | 2 comments
Older

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes