Counterfactual do-what-I-mean discussion post by Stuart Armstrong 18 hours ago | discuss
 Training Garrabrant inductors to predict counterfactuals discussion post by Tsvi Benson-Tilsen 1 day ago | Scott Garrabrant likes this | discuss
 Desiderata for decision theory discussion post by Tsvi Benson-Tilsen 1 day ago | Scott Garrabrant likes this | discuss
 Transitive negotiations with counterfactual agents discussion post by Scott Garrabrant 7 days ago | Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss
 Attacking the grain of truth problem using Bayes-Savage agents discussion post by Vadim Kosoy 7 days ago | Paul Christiano likes this | discuss
post by Ryan Carey 12 days ago | Vadim Kosoy and Patrick LaVictoire like this | discuss

Note: This describes an idea of Jessica Taylor’s.

Control and security
post by Paul Christiano 12 days ago | Jessica Taylor and Vladimir Nesov like this | 7 comments

I used to think of AI security as largely unrelated to AI control, and my impression is that some people on this forum probably still do. I’ve recently shifted towards seeing control and security as basically the same, and thinking that security may often be a more appealing way to think and talk about control.

Online Learning 1: Bias-detecting online learners
post by Ryan Carey 21 days ago | Vadim Kosoy, Jessica Taylor and Paul Christiano like this | 6 comments

Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.

 Index of some decision theory posts discussion post by Tsvi Benson-Tilsen 21 days ago | Ryan Carey, Jack Gallagher, Jessica Taylor and Scott Garrabrant like this | discuss
Logical inductor limits are dense under pointwise convergence
post by Sam Eisenstat 22 days ago | Abram Demski, Patrick LaVictoire, Scott Garrabrant and Tsvi Benson-Tilsen like this | discuss

Logical inductors [1] are very complex objects, and even their limits are hard to get a handle on. In this post, I investigate the topological properties of the set of all limits of logical inductors.

The set of Logical Inductors is not Convex
post by Scott Garrabrant 30 days ago | Sam Eisenstat, Abram Demski and Patrick LaVictoire like this | 1 comment

Sam Eisenstat asked the following interesting question: Given two logical inductors over the same deductive process, is every (rational) convex combination of them also a logical inductor? Surprisingly, the answer is no! Here is my counterexample.

Logical Inductors contain Logical Inductors over other complexity classes
post by Scott Garrabrant 31 days ago | Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss

In the Logical Induction paper, we give a definition of logical inductors over polynomial time traders. It is clear from our definition that our use of polynomial time is rather arbitrary, and we could define e.g. an exponential time logical inductor. However, it may be less clear that actually logical inductors over one complexity class contain logical inductors over other complexity classes within them.

 Learning doesn't solve philosophy of ethics discussion post by Stuart Armstrong 31 days ago | discuss
Model of human (ir)rationality
post by Stuart Armstrong 31 days ago | discuss

A putative new idea for AI control; index here.

This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it’s our definition which determines what’s a bias and what’s a preference in humans).

Heroin model: AI "manipulates" "unmanipulatable" reward
post by Stuart Armstrong 36 days ago | 9 comments

A putative new idea for AI control; index here.

A conversation with Jessica has revealed that people weren’t understanding my points about AI manipulating the learning process. So here’s a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.

Logical Inductors that trust their limits
post by Scott Garrabrant 37 days ago | Jack Gallagher, Jessica Taylor and Patrick LaVictoire like this | 2 comments

Here is another open question related to Logical Inductors. I have not thought about it very long, so it might be easy.

Does there exist a logical inductor $$\{\mathbb P_n\}$$ over PA such that for all $$\phi$$:

1. PA proves that $$\mathbb P_\infty(\phi)$$ exists and is in $$[0,1]$$, and

2. $$\mathbb{E}_n(\mathbb{P}_\infty(\phi))\eqsim_n\mathbb{P}_n(\phi)$$?

Stratified learning and action
post by Stuart Armstrong 42 days ago | discuss

A putative new idea for AI control; index here.

(C)IRL is not solely a learning process
post by Stuart Armstrong 43 days ago | 29 comments

A putative new idea for AI control; index here.

I feel Inverse Reinforcement Learning (IRL) and Cooperative Inverse Reinforcement Learning (CIRL) are very good ideas, and will likely be essential for safe AI if we can’t come up with some sort of sustainable low impact, modular, or Oracle design. But IRL and CIRL have a weakness. In a nutshell:

1. The models (C)IRL uses for humans are underspecified.
2. This should cause CIRL to have motivated and manipulative learning.
3. Even without that, (C)IRL can end up fitting a terrible model to humans.
4. To solve those issues, (C)IRL will need to make creative modelling decisions that go beyond (standard) learning.
 Learning values versus learning knowledge discussion post by Stuart Armstrong 43 days ago | 5 comments
Universal Inductors
post by Scott Garrabrant 44 days ago | Sam Eisenstat, Jack Gallagher, Benja Fallenstein, Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | discuss

Now that the Logical Induction paper is out, I am directing my attention towards decision theory. The approach I currently think will be most fruitful is attempting to make a logically updateless version of Wei Dai’s Updateless Decision Theory. Abram Demski has posted on here about this, but I think Logical Induction provides a new angle with which we can attack the problem. This post will present an alternate way of viewing Logical Induction which I think will be especially helpful for building a logical UDT. (The Logical Induction paper is a prerequisite for this post.)

IRL is hard

We show that assuming the existence of public-key cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the “teacher” agent, the environment and the utility functions are computable in polynomial-time and there is only 1 bit of information to learn.

 Oracle design as de-black-boxer. discussion post by Stuart Armstrong 55 days ago | discuss
 Causal graphs and counterfactuals discussion post by Stuart Armstrong 58 days ago | 2 comments
Simplified explanation of stratification
post by Stuart Armstrong 58 days ago | Patrick LaVictoire likes this | 5 comments

A putative new idea for AI control; index here.

I’ve previously talked about stratified indifference/learning. In this short post, I’ll try and present the idea, as simply and clearly as possible.

 Corrigibility through stratified indifference and learning discussion post by Stuart Armstrong 69 days ago | 9 comments
Older

### NEW DISCUSSION POSTS

I wrote a post on [security
 by Paul Christiano on Control and security | 0 likes

This week I will put up
 by Paul Christiano on Control and security | 1 like

(I'm moving a discussion
 by Wei Dai on Control and security | 1 like

Typo: The statement of
 by Patrick LaVictoire on Asymptotic Decision Theory | 0 likes

> I think that the main
 by Vladimir Nesov on Control and security | 0 likes

It seems to me like failures
 by Paul Christiano on Control and security | 0 likes

This works as a subtle
 by Vladimir Nesov on Control and security | 0 likes

In general finding
 by Jessica Taylor on Online Learning 1: Bias-detecting online learners | 0 likes

 by Jessica Taylor on Control and security | 0 likes

We could also generalize this
 by Paul Christiano on Online Learning 1: Bias-detecting online learners | 0 likes

This is cool! It would be
 by Paul Christiano on Online Learning 1: Bias-detecting online learners | 0 likes

Can you provide links to the
 by Vadim Kosoy on Two Questions about Solomonoff Induction | 0 likes

And I just wanted to write a
 by Vadim Kosoy on Online Learning 1: Bias-detecting online learners | 0 likes

Also see the notion of
 by Paul Christiano on Online Learning 1: Bias-detecting online learners | 2 likes

Given that this is my first
 by Ryan Carey on Online Learning 1: Bias-detecting online learners | 1 like