Three Oracle designs
post by Stuart Armstrong 6 days ago | Patrick LaVictoire likes this | discuss

A putative new idea for AI control; index here.

An initial draft looking at three ways of getting information out of Oracles, information that’s useful and safe - in theory.

One thing I may need to do, is find slightly better names for them ^_^

Good and safe uses of AI Oracles

Abstract:

Abstract model of human bias
post by Stuart Armstrong 20 days ago | 5 comments

A putative new idea for AI control; index here.

Any suggestions for refining this model are welcome!

Somewhat inspired by the previous post, this is a model of human bias that can be used to test theories that want to compute the “true” human preferences. The basic idea is to formalise the question:

• If the AI can make the human give any answer to any question, can it figure out what humans really want?
When the AI closes a door, it opens a window
post by Stuart Armstrong 20 days ago | discuss

A putative new idea for AI control; index here.

Some methods, such as Cooperative Inverse Reinforcement Learning, have the AI assume that humans have access to a true reward function, that the AI will then attempt to maximise. This post is an attempt to clarify a specific potential problem with these methods; it is related to the third problem described here, but hopefully makes it clearer.

 Generative adversarial models, informed by arguments discussion post by Jessica Taylor 28 days ago | discuss
 Simpler, cruder, virtual world AIs discussion post by Stuart Armstrong 30 days ago | Patrick LaVictoire likes this | discuss
 Questioning GLS-Coherence discussion post by Abram Demski 37 days ago | discuss
 Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences discussion post by Patrick LaVictoire 38 days ago | Jessica Taylor and Stuart Armstrong like this | discuss
Guarded learning
post by Stuart Armstrong 38 days ago | discuss

A putative new idea for AI control; index here.

“Guarded learning” is a model for unbiased learning, the kind of learning where the AI has an incentive to learn its values, but not to bias the direction in which it learns.

AIs in virtual worlds: discounted mixed utility/reward
post by Stuart Armstrong 39 days ago | discuss

A putative new idea for AI control; index here.

In a previous post on AIs in virtual worlds, I described the idea of a utility function that motivates the AI to do operate in a virtual world with certain goals in mind, but to shutdown immediately if it detects that the outside world – us – is having an impact in the virtual world.

This is one way to implement such a goal, given certain restrictions on the AI’s in-world utility. The restrictions are more natural for rewards rather than utilities, so they will be phrased in those terms.

 General Cooperative Inverse RL Convergence discussion post by Jan Leike 39 days ago | Jessica Taylor likes this | discuss
 Conservation of Expected Ethics isn't enough discussion post by Stuart Armstrong 42 days ago | Jessica Taylor likes this | discuss
Learning values versus indifference
post by Stuart Armstrong 43 days ago | Patrick LaVictoire likes this | 3 comments

A putative new idea for AI control; index here.

Corrigibility should allow safe value or policy change. Indifference allows the agent to accept changes without objecting. However, an indifferent agent is similarly indifferent to the learning process.

Classical uncertainty over values has the opposite problem: the AI is motivated to learn more about its values (and preserve the learning process) BUT is also motivated to manipulate its values.

Both these effects can be illustrated on a single graph. Assume that the AI follows utility $$U$$ is uncertain between utilities $$v$$ and $$w$$, and has a probability $$p$$ that $$U=v$$.

 The overfitting utility problem for value learning AIs discussion post by Stuart Armstrong 44 days ago | Abram Demski, Daniel Dewey and Patrick LaVictoire like this | discuss
In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy
post by Jessica Taylor 45 days ago | Vadim Kosoy and Abram Demski like this | discuss

Summary: I define a memoryless Cartesian environments (which can model many familiar decision problems), note the similarity to memoryless POMDPs, and define a local optimality condition for policies, which can be roughly stated as “the policy is consistent with maximizing expected utility using CDT and subjective probabilities derived from SIA”. I show that this local optimality condition is necesssary but not sufficient for global optimality (UDT).

 Indifference utility functions discussion post by Stuart Armstrong 45 days ago | discuss
Confirmed Selective Oracle
post by Stuart Armstrong 45 days ago | discuss

A putative new idea for AI control; index here.

I originally came up with a method for – safely? – extracting high impact from low impact AIs.

But it just occurred to me that the idea could be used for standard Oracles, in a way that keeps them arguably safe, without needing to define truth and similar. So introducing:

The Confirmed Selective Oracle!

Cake or Death toy model for corrigibility
post by Stuart Armstrong 45 days ago | Patrick LaVictoire likes this | discuss

Let’s think of the classical Cake or Death problem from the point of view of corrigibility. The aim here is to construct a toy model sufficiently complex that it shows all the problems that derail classical value learning and corrigibility.

The alternate hypothesis for AIs in virtual worlds
post by Stuart Armstrong 46 days ago | discuss

A putative new idea for AI control; index here.

In the “AIs in virtual worlds” setup, the AI entertains two hypotheses: one, $$W$$, that it lives in a deterministic world which it knows about (including itself in the world), and $$W'$$, an alternate hypothesis that the world is “like” W but that there are some random stochastic effects that correspond to us messing up the world. If the AI ever believes enough in $$W'$$, it shuts down.

 Two problems with causal-counterfactual utility indifference discussion post by Jessica Taylor 61 days ago | Patrick LaVictoire, Stuart Armstrong and Vladimir Slepnev like this | discuss
 Is logic epistemically appropriate? discussion post by Abram Demski 62 days ago | Jessica Taylor likes this | discuss
 Removing interrupted histories doesn't debias discussion post by Stuart Armstrong 67 days ago | Patrick LaVictoire likes this | discuss
 You can't beat a troll by predicting it. discussion post by Abram Demski 67 days ago | discuss
 Example of double indifference discussion post by Stuart Armstrong 77 days ago | Patrick LaVictoire likes this | 2 comments
Anything you can do with n AIs, you can do with two (with directly opposed objectives)
post by Jessica Taylor 82 days ago | Patrick LaVictoire and Stuart Armstrong like this | 2 comments

Summary: For any normal-form game, it’s possible to cast the problem of finding a correlated equilibrium in this game as a 2-player zero-sum game. This seems useful because zero-sum games are easy to analyze and more resistant to collusion.

Corrigibility for AIXI via double indifference
post by Stuart Armstrong 83 days ago | discuss

A putative new idea for AI control; index here.

This post sketches out how one could extend corrigibility to AIXI, using both utility indifference and double indifference approaches.

The arguments are intended to be rigorous, but need to be checked, and convergence results are not proved. A full treatment of “probability estimators estimating probability estimators” will of course need the full machinery for logical uncertainty that MIRI is developing. I also feel the recursion formulas at the end could be simplified.

Older

### NEW DISCUSSION POSTS

> If B is “sufficiently more
 by Jessica Taylor on Improbable Oversight, An Attempt at Informed Overs... | 0 likes

If B is "sufficiently more
 by William Saunders on Improbable Oversight, An Attempt at Informed Overs... | 1 like

This seems like a reasonable
 by Jessica Taylor on Improbable Oversight, An Attempt at Informed Overs... | 0 likes

This falls in with the
 by William Saunders on Improbable Oversight, An Attempt at Informed Overs... | 0 likes

Even the last version might
 by Patrick LaVictoire on Improbable Oversight, An Attempt at Informed Overs... | 0 likes

By censoring I mean a
 by Jack Gallagher on A new proposal for logical counterfactuals | 0 likes

The following is one
 by Vadim Kosoy on Abstract model of human bias | 0 likes

Interesting. How could we
 by Stuart Armstrong on Abstract model of human bias | 0 likes

Can you define more precisely
 by Patrick LaVictoire on A new proposal for logical counterfactuals | 1 like

I agree that even without
 by Vadim Kosoy on Abstract model of human bias | 0 likes

Those are some of the lines I
 by Stuart Armstrong on Abstract model of human bias | 0 likes

I think it might be possible
 by Vadim Kosoy on Abstract model of human bias | 0 likes

I think I have a better
 by Stuart Armstrong on Learning values versus indifference | 0 likes

> So though the AI is
 by Jessica Taylor on Learning values versus indifference | 0 likes

Correct me if I'm wrong, but
 by Jessica Taylor on Learning values versus indifference | 0 likes