Intelligent Agent Foundations Forumsign up / log in
Indifference and compensatory rewards
discussion post by Stuart Armstrong 4 days ago | discuss
Are daemons a problem for ideal agents?
discussion post by Jessica Taylor 9 days ago | 1 comment
Entangled Equilibria and the Twin Prisoners' Dilemma
post by Scott Garrabrant 9 days ago | Vadim Kosoy and Patrick LaVictoire like this | 2 comments

In this post, I present a generalization of Nash equilibria to non-CDT agents. I will use this formulation to model mutual cooperation in a twin prisoners’ dilemma, caused by the belief that the other player is similar to you, and not by mutual prediction. (This post came mostly out of a conversation with Sam Eisenstat, as well as contributions from Tsvi Benson-Tilsen and Jessica Taylor)

continue reading »
How likely is a random AGI to be honest?
discussion post by Jessica Taylor 10 days ago | 1 comment
Minimizing Empowerment for Safety
discussion post by David Krueger 11 days ago | 1 comment
True understanding comes from passing exams
post by Stuart Armstrong 13 days ago | 3 comments

I’ll try to clarify what I was doing with the AI truth setup in a previous post. First I’ll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.

continue reading »
Does UDT *really* get counter-factually mugged?
discussion post by David Krueger 15 days ago | 7 comments
Learning Impact in RL
discussion post by David Krueger 15 days ago | Daniel Dewey likes this | 5 comments
Humans as a truth channel
post by Stuart Armstrong 18 days ago | discuss

Defining truth and accuracy is tricky, so when I’ve proposed designs for things like Oracles, I’ve either used a very specific and formal question, or and indirect criteria for truth.

Here I’ll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.

continue reading »
Hacking humans
discussion post by Stuart Armstrong 18 days ago | discuss
Censoring out-of-domain representations
discussion post by Patrick LaVictoire 19 days ago | Jessica Taylor and Stuart Armstrong like this | 3 comments
Emergency learning
post by Stuart Armstrong 23 days ago | Ryan Carey likes this | discuss

Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?

Well, drinking coffee by the barrel at Miri’s emergency research retreat I’d… still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn’t reveal any new approaches, I’d try and get something like this working.

continue reading »
Thoughts on Quantilizers
post by Stuart Armstrong 23 days ago | Ryan Carey likes this | discuss

This post will look at some of the properties of quantilizers, when they succeed and how they might fail.

Roughly speaking, let \(f\) be some true objective function that we want to maximise. We haven’t been able to specify it fully, so we have instead a proxy function \(g\). There is a cost function \(c=f-g\) which measures how much \(g\) falls short of \(f\). Then a quantilizer will choose actions (or policies) radomly from the top \(n\%\) of actions available, ranking those actions according to \(g\).

continue reading »
The radioactive burrito and learning from positive examples
post by Stuart Armstrong 26 days ago | 2 comments

Jessica presented a system learning only from positive examples. Given examples of burritos, it computes a distribution \(b\) over possible burritos. When it comes to creating its own burritos, however, it can only construct them from the feasible set \(f\).

continue reading »
On motivations for MIRI's highly reliable agent design research
post by Jessica Taylor 27 days ago | Ryan Carey, Daniel Dewey, Nate Soares, Patrick LaVictoire, Paul Christiano, Tsvi Benson-Tilsen and Vladimir Nesov like this | 10 comments

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

continue reading »
Strategies for coalitions in unit-sum games
post by Jessica Taylor 28 days ago | Patrick LaVictoire and Stuart Armstrong like this | 3 comments

I’m going to formalize some ideas related to my previous post about pursuing convergent instrumental goals without good priors and prove theorems about how much power a coalition can guarantee. The upshot is that, while non-majority coalitions can’t guarantee controlling a non-negligible fraction of the expected power, majority coalitions can guarantee controlling a large fraction of the expected power.

continue reading »
An impossibility result for doing without good priors
discussion post by Jessica Taylor 31 days ago | Stuart Armstrong likes this | discuss
Corrigibility thoughts III: manipulating versus deceiving
discussion post by Stuart Armstrong 33 days ago | discuss
Corrigibility thoughts II: the robot operator
discussion post by Stuart Armstrong 33 days ago | 3 comments
Corrigibility thoughts I: caring about multiple things
post by Stuart Armstrong 33 days ago | discuss

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 2 and 3).

The desiderata for corrigibility are:

continue reading »
A note on misunderstanding the boundaries of models
discussion post by Stuart Armstrong 33 days ago | discuss
A measure-theoretic generalization of logical induction
discussion post by Vadim Kosoy 35 days ago | Jessica Taylor and Scott Garrabrant like this | discuss
Open problem: thin logical priors
discussion post by Tsvi Benson-Tilsen 39 days ago | Ryan Carey, Jessica Taylor, Patrick LaVictoire and Scott Garrabrant like this | 2 comments
Towards learning incomplete models using inner prediction markets
discussion post by Vadim Kosoy 42 days ago | Jessica Taylor and Paul Christiano like this | 4 comments
Subagent perfect minimax
discussion post by Vadim Kosoy 44 days ago | discuss
Older

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Why wouldn't it work? The
by Jessica Taylor on True understanding comes from passing exams | 0 likes

It would be weird if the
by Jessica Taylor on Are daemons a problem for ideal agents? | 0 likes

The second AI doesn't get to
by Stuart Armstrong on True understanding comes from passing exams | 0 likes

Fixed the $\varepsilon$,
by Scott Garrabrant on Entangled Equilibria and the Twin Prisoners' Dilem... | 0 likes

I think you meant to divide
by Vadim Kosoy on Entangled Equilibria and the Twin Prisoners' Dilem... | 0 likes

Yup, this isn't robust to
by Patrick LaVictoire on Censoring out-of-domain representations | 0 likes

I don't think "honesty" is
by Paul Christiano on How likely is a random AGI to be honest? | 2 likes

Discussed briefly in Concrete
by Daniel Dewey on Minimizing Empowerment for Safety | 2 likes

I reason as follows: 1.
by David Krueger on Does UDT *really* get counter-factually mugged? | 1 like

I agree... if there are
by David Krueger on Censoring out-of-domain representations | 0 likes

Game-aligned agents aren't
by Vladimir Nesov on Does UDT *really* get counter-factually mugged? | 0 likes

The issue in the OP is that
by Vladimir Nesov on Does UDT *really* get counter-factually mugged? | 0 likes

This seems only loosely
by David Krueger on Does UDT *really* get counter-factually mugged? | 0 likes

OK that makes sense, thanks.
by David Krueger on Does UDT *really* get counter-factually mugged? | 0 likes

It's not the same (but
by David Krueger on Learning Impact in RL | 0 likes

RSS

Privacy & Terms (NEW 04/01/15)