Causal graphs and counterfactuals discussion post by Stuart Armstrong 55 minutes ago | discuss
Simplified explanation of stratification
post by Stuart Armstrong 4 hours ago | discuss

A putative new idea for AI control; index here.

I’ve previously talked about stratified indifference/learning. In this short post, I’ll try and present the idea, as simply and clearly as possible.

 Corrigibility through stratified indifference and learning discussion post by Stuart Armstrong 11 days ago | 9 comments
Modeling the capabilities of advanced AI systems as episodic reinforcement learning
post by Jessica Taylor 11 days ago | discuss

Here I’ll summarize the main abstraction I use for thinking about future AI systems. This is essentially the same model that Paul uses. I’m not actually introducing any new ideas in this post; mostly this is intended to summarize my current views.

 Can we hybridize Absent-Minded Driver with Death in Damascus? discussion post by Eliezer Yudkowsky 28 days ago | Patrick LaVictoire likes this | 1 comment
Learning (meta-)preferences
post by Stuart Armstrong 34 days ago | Patrick LaVictoire likes this | 2 comments

A putative new idea for AI control; index here.

There are various methods, such as Cooperative Inverse Reinforcement Learning (CIRL), that aim to have an AI deduce human preferences in some fashion.

The problem is that humans are not rational - citation certainly not needed. But, worse than that, they are not rational in ways that seriously complicate the task of fitting a reward or utility function to them. I presented one problem this entails in a previous post. That talked about the problems that emerged when an AI could influence a human’s preference through the ways it presented the issues.

What does an imperfect agent want?
post by Stuart Armstrong 34 days ago | Patrick LaVictoire likes this | discuss

A putative new idea for AI control; index here.

I’ll roughly divide ways of establishing human preferences into four categories:

1. Assume true
2. Best fit
3. Proxy measures
4. Modelled irrationality
Three Oracle designs
post by Stuart Armstrong 41 days ago | Patrick LaVictoire likes this | discuss

A putative new idea for AI control; index here.

An initial draft looking at three ways of getting information out of Oracles, information that’s useful and safe - in theory.

One thing I may need to do, is find slightly better names for them ^_^

Good and safe uses of AI Oracles

Abstract:

Abstract model of human bias
post by Stuart Armstrong 55 days ago | 5 comments

A putative new idea for AI control; index here.

Any suggestions for refining this model are welcome!

Somewhat inspired by the previous post, this is a model of human bias that can be used to test theories that want to compute the “true” human preferences. The basic idea is to formalise the question:

• If the AI can make the human give any answer to any question, can it figure out what humans really want?
When the AI closes a door, it opens a window
post by Stuart Armstrong 55 days ago | discuss

A putative new idea for AI control; index here.

Some methods, such as Cooperative Inverse Reinforcement Learning, have the AI assume that humans have access to a true reward function, that the AI will then attempt to maximise. This post is an attempt to clarify a specific potential problem with these methods; it is related to the third problem described here, but hopefully makes it clearer.

 Generative adversarial models, informed by arguments discussion post by Jessica Taylor 63 days ago | discuss
 Simpler, cruder, virtual world AIs discussion post by Stuart Armstrong 65 days ago | Patrick LaVictoire likes this | discuss
 Questioning GLS-Coherence discussion post by Abram Demski 72 days ago | discuss
 Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences discussion post by Patrick LaVictoire 73 days ago | Jessica Taylor and Stuart Armstrong like this | discuss
Guarded learning
post by Stuart Armstrong 73 days ago | discuss

A putative new idea for AI control; index here.

“Guarded learning” is a model for unbiased learning, the kind of learning where the AI has an incentive to learn its values, but not to bias the direction in which it learns.

AIs in virtual worlds: discounted mixed utility/reward
post by Stuart Armstrong 74 days ago | discuss

A putative new idea for AI control; index here.

In a previous post on AIs in virtual worlds, I described the idea of a utility function that motivates the AI to do operate in a virtual world with certain goals in mind, but to shutdown immediately if it detects that the outside world – us – is having an impact in the virtual world.

This is one way to implement such a goal, given certain restrictions on the AI’s in-world utility. The restrictions are more natural for rewards rather than utilities, so they will be phrased in those terms.

 General Cooperative Inverse RL Convergence discussion post by Jan Leike 74 days ago | Jessica Taylor likes this | discuss
 Conservation of Expected Ethics isn't enough discussion post by Stuart Armstrong 77 days ago | Jessica Taylor likes this | discuss
Learning values versus indifference
post by Stuart Armstrong 78 days ago | Patrick LaVictoire likes this | 3 comments

A putative new idea for AI control; index here.

Corrigibility should allow safe value or policy change. Indifference allows the agent to accept changes without objecting. However, an indifferent agent is similarly indifferent to the learning process.

Classical uncertainty over values has the opposite problem: the AI is motivated to learn more about its values (and preserve the learning process) BUT is also motivated to manipulate its values.

Both these effects can be illustrated on a single graph. Assume that the AI follows utility $$U$$ is uncertain between utilities $$v$$ and $$w$$, and has a probability $$p$$ that $$U=v$$.

 The overfitting utility problem for value learning AIs discussion post by Stuart Armstrong 79 days ago | Abram Demski, Daniel Dewey and Patrick LaVictoire like this | discuss
In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy
post by Jessica Taylor 80 days ago | Vadim Kosoy and Abram Demski like this | discuss

Summary: I define a memoryless Cartesian environments (which can model many familiar decision problems), note the similarity to memoryless POMDPs, and define a local optimality condition for policies, which can be roughly stated as “the policy is consistent with maximizing expected utility using CDT and subjective probabilities derived from SIA”. I show that this local optimality condition is necesssary but not sufficient for global optimality (UDT).

 Indifference utility functions discussion post by Stuart Armstrong 80 days ago | discuss
Confirmed Selective Oracle
post by Stuart Armstrong 80 days ago | discuss

A putative new idea for AI control; index here.

I originally came up with a method for – safely? – extracting high impact from low impact AIs.

But it just occurred to me that the idea could be used for standard Oracles, in a way that keeps them arguably safe, without needing to define truth and similar. So introducing:

The Confirmed Selective Oracle!

Cake or Death toy model for corrigibility
post by Stuart Armstrong 80 days ago | Patrick LaVictoire likes this | discuss

Let’s think of the classical Cake or Death problem from the point of view of corrigibility. The aim here is to construct a toy model sufficiently complex that it shows all the problems that derail classical value learning and corrigibility.

The alternate hypothesis for AIs in virtual worlds
post by Stuart Armstrong 80 days ago | discuss

A putative new idea for AI control; index here.

In the “AIs in virtual worlds” setup, the AI entertains two hypotheses: one, $$W$$, that it lives in a deterministic world which it knows about (including itself in the world), and $$W'$$, an alternate hypothesis that the world is “like” W but that there are some random stochastic effects that correspond to us messing up the world. If the AI ever believes enough in $$W'$$, it shuts down.

Older

NEW DISCUSSION POSTS

Yep! I wrote a (hopefully
 by Stuart Armstrong on Corrigibility through stratified indifference and ... | 0 likes

Supposing humans _did_ shut
 by Jessica Taylor on Corrigibility through stratified indifference and ... | 0 likes

In the shutdown problem, it
 by Jessica Taylor on Corrigibility through stratified indifference and ... | 0 likes

I'm copying over some
 by Sam Eisenstat on A new proposal for logical counterfactuals | 0 likes

"events flowing from
 by Stuart Armstrong on Corrigibility through stratified indifference and ... | 0 likes

Hmm... I seem to have trouble
 by Jessica Taylor on Corrigibility through stratified indifference and ... | 0 likes

New version up
 by Stuart Armstrong on Corrigibility through stratified indifference and ... | 0 likes

I'll have a better version of
 by Stuart Armstrong on Corrigibility through stratified indifference and ... | 0 likes

Ah, it seems I wasn't
 by Stuart Armstrong on Corrigibility through stratified indifference and ... | 0 likes

> This kind of “causal”
 by Jessica Taylor on Corrigibility through stratified indifference and ... | 0 likes

I don't *think* the various
 by Stuart Armstrong on Uncertainty About One's Utility Function | 1 like

Have you seen
 by Jessica Taylor on Can we hybridize Absent-Minded Driver with Death i... | 1 like

I don't quite have the exact
 by William Saunders on Improbable Oversight, An Attempt at Informed Overs... | 1 like

Another method for dealing
 by William Saunders on Improbable Oversight, An Attempt at Informed Overs... | 0 likes

Ok, send me the draft.
 by Stuart Armstrong on Learning (meta-)preferences | 0 likes