1.AI safety: three human problems and one AI issue
post by Stuart Armstrong 787 days ago | Ryan Carey and Daniel Dewey like this | 2 comments

A putative new idea for AI control; index here.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

2.Why I am not currently working on the AAMLS agenda
post by Jessica Taylor 794 days ago | Ryan Carey, Marcello Herreshoff, Sam Eisenstat, Abram Demski, Daniel Dewey, Scott Garrabrant and Stuart Armstrong like this | 2 comments

(note: this is not an official MIRI statement, this is a personal statement. I am not speaking for others who have been involved with the agenda.)

The AAMLS (Alignment for Advanced Machine Learning Systems) agenda is a project at MIRI that is about determining how to use hypothetical highly advanced machine learning systems safely. I was previously working on problems in this agenda and am currently not.

 3. Learning Impact in RL discussion post by David Krueger 891 days ago | Daniel Dewey likes this | 6 comments
4.On motivations for MIRI's highly reliable agent design research
post by Jessica Taylor 903 days ago | Ryan Carey, Sam Eisenstat, Daniel Dewey, Nate Soares, Patrick LaVictoire, Paul Christiano, Tsvi Benson-Tilsen and Vladimir Nesov like this | 10 comments

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

 5. Pursuing convergent instrumental subgoals on the user's behalf doesn't always require good priors discussion post by Jessica Taylor 928 days ago | Daniel Dewey, Paul Christiano and Stuart Armstrong like this | 9 comments
6.My current take on the Paul-MIRI disagreement on alignability of messy AI
post by Jessica Taylor 934 days ago | Ryan Carey, Vanessa Kosoy, Daniel Dewey, Patrick LaVictoire, Scott Garrabrant and Stuart Armstrong like this | 40 comments

Paul Christiano and “MIRI” have disagreed on an important research question for a long time: should we focus research on aligning “messy” AGI (e.g. one found through gradient descent or brute force search) with human values, or on developing “principled” AGI (based on theories similar to Bayesian probability theory)? I’m going to present my current model of this disagreement and additional thoughts about it.

 7. Working on a series of safety environments for OpenAI gym. Would love comments and ideas. link by Rafael Cosman 1129 days ago | Daniel Dewey, Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss
 8. The overfitting utility problem for value learning AIs discussion post by Stuart Armstrong 1129 days ago | Abram Demski, Daniel Dewey and Patrick LaVictoire like this | discuss
9.What does it mean for correct operation to rely on transfer learning?
post by Jessica Taylor 1229 days ago | Daniel Dewey, Patrick LaVictoire, Paul Christiano and Stuart Armstrong like this | discuss

Summary: Some approaches to AI value alignment rely on transfer learning. I attempt to explain this idea more clearly.

 10. Notes from a conversation on act-based and goal-directed systems discussion post by Jessica Taylor 1244 days ago | Vanessa Kosoy, Daniel Dewey and Patrick LaVictoire like this | 50 comments
 11. A complexity theoretic approach to logical uncertainty (Draft) link by Vanessa Kosoy 1526 days ago | Benja Fallenstein, Daniel Dewey, Jessica Taylor, Nate Soares and Patrick LaVictoire like this | 2 comments
12.Optimal and Causal Counterfactual Worlds
post by Scott Garrabrant 1530 days ago | Sam Eisenstat, Abram Demski, Daniel Dewey, Nate Soares and Patrick LaVictoire like this | 3 comments

Let $$L$$ denote the language of Peano arithmetic. A (counterfactual) world $$W$$ is any subset of $$L$$. These worlds need not be consistent. Let $$\mathcal{W}$$ denote the set of all worlds. The actual world $$W_\mathbb{N}\in\mathcal{W}$$ is the world consisting of all sentences that are true about $$\mathbb{N}$$.

 13. High impact from low impact discussion post by Stuart Armstrong 1550 days ago | Daniel Dewey likes this | 2 comments
 14. A toy model of a corrigibility problem link by Patrick LaVictoire 1576 days ago | Benja Fallenstein, Daniel Dewey, Jessica Taylor and Nate Soares like this | discuss
 15. Un-manipulable counterfactuals discussion post by Stuart Armstrong 1620 days ago | Daniel Dewey likes this | 5 comments
 16. Non-manipulative oracles discussion post by Stuart Armstrong 1620 days ago | Daniel Dewey and Nate Soares like this | 1 comment
 17. Probabilistic Oracle Machines and Nash Equilibria discussion post by Jessica Taylor 1621 days ago | Daniel Dewey, Nate Soares and Stuart Armstrong like this | discuss
18.Why conditioning on "the agent takes action a" isn't enough
post by Nate Soares 1642 days ago | Ryan Carey, Benja Fallenstein, Daniel Dewey, Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | discuss

This post expands a bit on a point that I didn’t have enough space to make in the paper Toward Idealized Decision Theory.

 19. Model-free decisions post by Paul Christiano 1686 days ago | Daniel Dewey, Jessica Taylor, Nate Soares and Stuart Armstrong like this | 3 comments Much concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes by directly specifying what actions are good. I flesh out one possible alternative here. (As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments. Big-picture discussion should probably stay here.)
20.Trustworthy automated philosophy?
post by Benja Fallenstein 1698 days ago | Daniel Dewey, Jessica Taylor and Nate Soares like this | 3 comments

Paul’s post, and subsequent discussions with Eliezer and Nate, have made me update significantly towards the hypothesis that the best way to get a positive intelligence explosion might be to (1) create a seed agent that’s human-level at doing mathematical philosophy, (2) have this agent improve a small number of times (like “tens” of significant improvements), thereby making it significantly smarter-than-human (like 10x or 100x according to some relevant scale), and (3) have the resulting agent invent and approve a decision-making framework reliable enough to undergo an intelligence explosion.

My first reaction was that it seemed probably too difficult to create an initial agent which is capable of doing the necessary philosophy and whose decision-making system is (knowably) sufficiently reliable that we can trust it with our future. Subsequent discussions have made me revise this, and I expect that as a result I’ll shift somewhat in what problems I’ll be working on, but I’m still rather worried about this, and think that it is probably the core of remaining disagreement with Paul in this area.

(I apologize for taking so long to reply! I’ve been finding it really hard to articulate my thoughts while having my assumptions jumbled around.)

21.Stable self-improvement as a research problem
post by Paul Christiano 1703 days ago | Abram Demski, Benja Fallenstein, Daniel Dewey, Nate Soares and Stuart Armstrong like this | 6 comments

“Stable self-improvement” seems to be a primary focus of MIRI’s work. As I understand it, the problem is “How do we build an agent which rationally pursues some goal, is willing to modify itself, and with very high probability continues to pursue the same goal after modification?”

The key difficulty is that it is impossible for an agent to formally “trust” its own reasoning, i.e. to believe that “anything that I believe is true.” Indeed, even the natural concept of “truth” is logically problematic. But without such a notion of trust, why should an agent even believe that its own continued existence is valuable?

I agree that there are open philosophical questions concerning reasoning under logical uncertainty, and that reflective reasoning highlights some of the difficulties. But I am not yet convinced that stable self-improvement as an especially important problem; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI. I would prefer for energy to be used on other aspects of the AI safety problem.

22.Exploiting EDT
post by Benja Fallenstein 1710 days ago | Ryan Carey, Abram Demski, Daniel Dewey, Nate Soares, Patrick LaVictoire and Paul Christiano like this | 9 comments

The problem with EDT is, as David Lewis put it, its “irrational policy of managing the news” (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it’s taking these actions tells it about events the agent can’t affect at all. The canonical example is the smoking lesion problem.

I’ve long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values “smoking” (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the “tickle defense” of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb’s problem. (It’s a variation of the thought experiment in my LessWrong post on “the sin of updating when you can change whether you exist”.)

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes