The set of Logical Inductors is not Convex
post by Scott Garrabrant 11 hours ago | Sam Eisenstat, Abram Demski and Patrick LaVictoire like this | 1 comment

Sam Eisenstat asked the following interesting question: Given two logical inductors over the same deductive process, is every (rational) convex combination of them also a logical inductor? Surprisingly, the answer is no! Here is my counterexample.

Logical Inductors contain Logical Inductors over other complexity classes
post by Scott Garrabrant 22 hours ago | Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss

In the Logical Induction paper, we give a definition of logical inductors over polynomial time traders. It is clear from our definition that our use of polynomial time is rather arbitrary, and we could define e.g. an exponential time logical inductor. However, it may be less clear that actually logical inductors over one complexity class contain logical inductors over other complexity classes within them.

 Learning doesn't solve philosophy of ethics discussion post by Stuart Armstrong 1 day ago | discuss
Model of human (ir)rationality
post by Stuart Armstrong 1 day ago | discuss

A putative new idea for AI control; index here.

This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it’s our definition which determines what’s a bias and what’s a preference in humans).

Heroin model: AI "manipulates" "unmanipulatable" reward
post by Stuart Armstrong 6 days ago | 9 comments

A putative new idea for AI control; index here.

A conversation with Jessica has revealed that people weren’t understanding my points about AI manipulating the learning process. So here’s a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.

Logical Inductors that trust their limits
post by Scott Garrabrant 6 days ago | Jessica Taylor and Patrick LaVictoire like this | discuss

Here is another open question related to Logical Inductors. I have not thought about it very long, so it might be easy.

Does there exist a logical inductor $$\{\mathbb P_n\}$$ over PA such that for all $$\phi$$:

1. PA proves that $$\mathbb P_\infty(\phi)$$ exists and is in $$[0,1]$$, and

2. $$\mathbb{E}_n(\mathbb{P}_\infty(\phi))\eqsim_n\mathbb{P}_n(\phi)$$?

Stratified learning and action
post by Stuart Armstrong 12 days ago | discuss

A putative new idea for AI control; index here.

(C)IRL is not solely a learning process
post by Stuart Armstrong 13 days ago | 28 comments

A putative new idea for AI control; index here.

I feel Inverse Reinforcement Learning (IRL) and Cooperative Inverse Reinforcement Learning (CIRL) are very good ideas, and will likely be essential for safe AI if we can’t come up with some sort of sustainable low impact, modular, or Oracle design. But IRL and CIRL have a weakness. In a nutshell:

1. The models (C)IRL uses for humans are underspecified.
2. This should cause CIRL to have motivated and manipulative learning.
3. Even without that, (C)IRL can end up fitting a terrible model to humans.
4. To solve those issues, (C)IRL will need to make creative modelling decisions that go beyond (standard) learning.
 Learning values versus learning knowledge discussion post by Stuart Armstrong 13 days ago | 5 comments
Universal Inductors
post by Scott Garrabrant 13 days ago | Sam Eisenstat, Benja Fallenstein, Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss

Now that the Logical Induction paper is out, I am directing my attention towards decision theory. The approach I currently think will be most fruitful is attempting to make a logically updateless version of Wei Dai’s Updateless Decision Theory. Abram Demski has posted on here about this, but I think Logical Induction provides a new angle with which we can attack the problem. This post will present an alternate way of viewing Logical Induction which I think will be especially helpful for building a logical UDT. (The Logical Induction paper is a prerequisite for this post.)

IRL is hard

We show that assuming the existence of public-key cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the “teacher” agent, the environment and the utility functions are computable in polynomial-time and there is only 1 bit of information to learn.

 Oracle design as de-black-boxer. discussion post by Stuart Armstrong 25 days ago | discuss
 Causal graphs and counterfactuals discussion post by Stuart Armstrong 28 days ago | 2 comments
Simplified explanation of stratification
post by Stuart Armstrong 28 days ago | Patrick LaVictoire likes this | 4 comments

A putative new idea for AI control; index here.

I’ve previously talked about stratified indifference/learning. In this short post, I’ll try and present the idea, as simply and clearly as possible.

 Corrigibility through stratified indifference and learning discussion post by Stuart Armstrong 39 days ago | 9 comments
Modeling the capabilities of advanced AI systems as episodic reinforcement learning
post by Jessica Taylor 39 days ago | Patrick LaVictoire likes this | 6 comments

Here I’ll summarize the main abstraction I use for thinking about future AI systems. This is essentially the same model that Paul uses. I’m not actually introducing any new ideas in this post; mostly this is intended to summarize my current views.

 Can we hybridize Absent-Minded Driver with Death in Damascus? discussion post by Eliezer Yudkowsky 56 days ago | Patrick LaVictoire likes this | 1 comment
Learning (meta-)preferences
post by Stuart Armstrong 62 days ago | Patrick LaVictoire likes this | 2 comments

A putative new idea for AI control; index here.

There are various methods, such as Cooperative Inverse Reinforcement Learning (CIRL), that aim to have an AI deduce human preferences in some fashion.

The problem is that humans are not rational - citation certainly not needed. But, worse than that, they are not rational in ways that seriously complicate the task of fitting a reward or utility function to them. I presented one problem this entails in a previous post. That talked about the problems that emerged when an AI could influence a human’s preference through the ways it presented the issues.

What does an imperfect agent want?
post by Stuart Armstrong 62 days ago | Patrick LaVictoire likes this | discuss

A putative new idea for AI control; index here.

I’ll roughly divide ways of establishing human preferences into four categories:

1. Assume true
2. Best fit
3. Proxy measures
4. Modelled irrationality
Three Oracle designs
post by Stuart Armstrong 69 days ago | Patrick LaVictoire likes this | discuss

A putative new idea for AI control; index here.

An initial draft looking at three ways of getting information out of Oracles, information that’s useful and safe - in theory.

One thing I may need to do, is find slightly better names for them ^_^

Good and safe uses of AI Oracles

Abstract:

Abstract model of human bias
post by Stuart Armstrong 83 days ago | 5 comments

A putative new idea for AI control; index here.

Any suggestions for refining this model are welcome!

Somewhat inspired by the previous post, this is a model of human bias that can be used to test theories that want to compute the “true” human preferences. The basic idea is to formalise the question:

• If the AI can make the human give any answer to any question, can it figure out what humans really want?
When the AI closes a door, it opens a window
post by Stuart Armstrong 83 days ago | discuss

A putative new idea for AI control; index here.

Some methods, such as Cooperative Inverse Reinforcement Learning, have the AI assume that humans have access to a true reward function, that the AI will then attempt to maximise. This post is an attempt to clarify a specific potential problem with these methods; it is related to the third problem described here, but hopefully makes it clearer.

 Generative adversarial models, informed by arguments discussion post by Jessica Taylor 92 days ago | discuss
 Simpler, cruder, virtual world AIs discussion post by Stuart Armstrong 93 days ago | Patrick LaVictoire likes this | discuss
 Questioning GLS-Coherence discussion post by Abram Demski 101 days ago | discuss
Older

### NEW DISCUSSION POSTS

1. Note that IRL is
 by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

Stuart did make it easier for
 by Patrick LaVictoire on (C)IRL is not solely a learning process | 0 likes

Nicely done. I should have
 by Sam Eisenstat on The set of Logical Inductors is not Convex | 1 like

Ok, I think we need to
 by Stuart Armstrong on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

I strongly predict that if
 by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

 by Stuart Armstrong on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

Re 1: There are cases where
 by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

1. I don't really see the
 by Stuart Armstrong on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

If we could model humans as
 by Wei Dai on (C)IRL is not solely a learning process | 0 likes

1. Can we agree that this is
 by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

This was initially setup in
 by Stuart Armstrong on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

This significantly clarifies
 by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

What kind of object is $Q$?
 by Paul Christiano on (C)IRL is not solely a learning process | 0 likes

The model $P$ is simply a
 by Stuart Armstrong on (C)IRL is not solely a learning process | 0 likes

>For example, if you assume
 by Stuart Armstrong on (C)IRL is not solely a learning process | 0 likes