Indifference and compensatory rewards discussion post by Stuart Armstrong 4 days ago | discuss
 Are daemons a problem for ideal agents? discussion post by Jessica Taylor 9 days ago | 1 comment
Entangled Equilibria and the Twin Prisoners' Dilemma
post by Scott Garrabrant 9 days ago | Vadim Kosoy and Patrick LaVictoire like this | 2 comments

In this post, I present a generalization of Nash equilibria to non-CDT agents. I will use this formulation to model mutual cooperation in a twin prisoners’ dilemma, caused by the belief that the other player is similar to you, and not by mutual prediction. (This post came mostly out of a conversation with Sam Eisenstat, as well as contributions from Tsvi Benson-Tilsen and Jessica Taylor)

 How likely is a random AGI to be honest? discussion post by Jessica Taylor 10 days ago | 1 comment
 Minimizing Empowerment for Safety discussion post by David Krueger 11 days ago | 1 comment
True understanding comes from passing exams
post by Stuart Armstrong 13 days ago | 3 comments

I’ll try to clarify what I was doing with the AI truth setup in a previous post. First I’ll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.

 Does UDT *really* get counter-factually mugged? discussion post by David Krueger 15 days ago | 7 comments
 Learning Impact in RL discussion post by David Krueger 15 days ago | Daniel Dewey likes this | 5 comments
Humans as a truth channel
post by Stuart Armstrong 18 days ago | discuss

Defining truth and accuracy is tricky, so when I’ve proposed designs for things like Oracles, I’ve either used a very specific and formal question, or and indirect criteria for truth.

Here I’ll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.

 Hacking humans discussion post by Stuart Armstrong 18 days ago | discuss
 Censoring out-of-domain representations discussion post by Patrick LaVictoire 19 days ago | Jessica Taylor and Stuart Armstrong like this | 3 comments
Emergency learning
post by Stuart Armstrong 23 days ago | Ryan Carey likes this | discuss

Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?

Well, drinking coffee by the barrel at Miri’s emergency research retreat I’d… still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn’t reveal any new approaches, I’d try and get something like this working.

Thoughts on Quantilizers
post by Stuart Armstrong 23 days ago | Ryan Carey likes this | discuss

This post will look at some of the properties of quantilizers, when they succeed and how they might fail.

Roughly speaking, let $$f$$ be some true objective function that we want to maximise. We haven’t been able to specify it fully, so we have instead a proxy function $$g$$. There is a cost function $$c=f-g$$ which measures how much $$g$$ falls short of $$f$$. Then a quantilizer will choose actions (or policies) radomly from the top $$n\%$$ of actions available, ranking those actions according to $$g$$.

The radioactive burrito and learning from positive examples
post by Stuart Armstrong 26 days ago | 2 comments

Jessica presented a system learning only from positive examples. Given examples of burritos, it computes a distribution $$b$$ over possible burritos. When it comes to creating its own burritos, however, it can only construct them from the feasible set $$f$$.

On motivations for MIRI's highly reliable agent design research
post by Jessica Taylor 27 days ago | Ryan Carey, Daniel Dewey, Nate Soares, Patrick LaVictoire, Paul Christiano, Tsvi Benson-Tilsen and Vladimir Nesov like this | 10 comments

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

Strategies for coalitions in unit-sum games
post by Jessica Taylor 28 days ago | Patrick LaVictoire and Stuart Armstrong like this | 3 comments

I’m going to formalize some ideas related to my previous post about pursuing convergent instrumental goals without good priors and prove theorems about how much power a coalition can guarantee. The upshot is that, while non-majority coalitions can’t guarantee controlling a non-negligible fraction of the expected power, majority coalitions can guarantee controlling a large fraction of the expected power.

 An impossibility result for doing without good priors discussion post by Jessica Taylor 31 days ago | Stuart Armstrong likes this | discuss
 Corrigibility thoughts III: manipulating versus deceiving discussion post by Stuart Armstrong 33 days ago | discuss
 Corrigibility thoughts II: the robot operator discussion post by Stuart Armstrong 33 days ago | 3 comments
Corrigibility thoughts I: caring about multiple things
post by Stuart Armstrong 33 days ago | discuss

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 2 and 3).

The desiderata for corrigibility are:

 A note on misunderstanding the boundaries of models discussion post by Stuart Armstrong 33 days ago | discuss
 A measure-theoretic generalization of logical induction discussion post by Vadim Kosoy 35 days ago | Jessica Taylor and Scott Garrabrant like this | discuss
 Open problem: thin logical priors discussion post by Tsvi Benson-Tilsen 39 days ago | Ryan Carey, Jessica Taylor, Patrick LaVictoire and Scott Garrabrant like this | 2 comments
 Towards learning incomplete models using inner prediction markets discussion post by Vadim Kosoy 42 days ago | Jessica Taylor and Paul Christiano like this | 4 comments
 Subagent perfect minimax discussion post by Vadim Kosoy 44 days ago | discuss
Older

### NEW DISCUSSION POSTS

Why wouldn't it work? The
 by Jessica Taylor on True understanding comes from passing exams | 0 likes

It would be weird if the
 by Jessica Taylor on Are daemons a problem for ideal agents? | 0 likes

The second AI doesn't get to
 by Stuart Armstrong on True understanding comes from passing exams | 0 likes

Fixed the $\varepsilon$,
 by Scott Garrabrant on Entangled Equilibria and the Twin Prisoners' Dilem... | 0 likes

I think you meant to divide
 by Vadim Kosoy on Entangled Equilibria and the Twin Prisoners' Dilem... | 0 likes

Yup, this isn't robust to
 by Patrick LaVictoire on Censoring out-of-domain representations | 0 likes

I don't think "honesty" is
 by Paul Christiano on How likely is a random AGI to be honest? | 2 likes

Discussed briefly in Concrete
 by Daniel Dewey on Minimizing Empowerment for Safety | 2 likes

I reason as follows: 1.
 by David Krueger on Does UDT *really* get counter-factually mugged? | 1 like

I agree... if there are
 by David Krueger on Censoring out-of-domain representations | 0 likes

Game-aligned agents aren't
 by Vladimir Nesov on Does UDT *really* get counter-factually mugged? | 0 likes

The issue in the OP is that
 by Vladimir Nesov on Does UDT *really* get counter-factually mugged? | 0 likes

This seems only loosely
 by David Krueger on Does UDT *really* get counter-factually mugged? | 0 likes

OK that makes sense, thanks.
 by David Krueger on Does UDT *really* get counter-factually mugged? | 0 likes

It's not the same (but
 by David Krueger on Learning Impact in RL | 0 likes