1. Oracle paper discussion post by Stuart Armstrong 35 days ago | Vladimir Slepnev likes this | discuss
 2. Stable agent, subagent-unstable discussion post by Stuart Armstrong 50 days ago | discuss
 3. Rationalising humans: another mugging, but not Pascal's discussion post by Stuart Armstrong 64 days ago | discuss
 4. Kolmogorov complexity makes reward learning worse discussion post by Stuart Armstrong 72 days ago | discuss
5.Reward learning summary
post by Stuart Armstrong 72 days ago | discuss

A putative new idea for AI control; index here.

I’ve been posting a lot on value/reward learning recently, and, as usual, the process of posting (and some feedback) means that those posts are partially superseded already - and some of them are overly complex.

So here I’ll try and briefly summarise my current insights, with links to the other posts if appropriate (a link will cover all the points noted since the previous link):

 6. Our values are underdefined, changeable, and manipulable discussion post by Stuart Armstrong 76 days ago | discuss
 7. Normative assumptions: regret discussion post by Stuart Armstrong 78 days ago | discuss
 8. Bias in rationality is much worse than noise discussion post by Stuart Armstrong 78 days ago | discuss
 9. Learning values, or defining them? discussion post by Stuart Armstrong 78 days ago | discuss
10.Rationality and overriding human preferences: a combined model
post by Stuart Armstrong 89 days ago | discuss

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

11.Humans can be assigned any values whatsoever...
post by Stuart Armstrong 96 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

 12. Should I post technical ideas here or on LessWrong 2.0? discussion post by Stuart Armstrong 105 days ago | Abram Demski likes this | 3 comments
13.Resolving human inconsistency in a simple model
post by Stuart Armstrong 105 days ago | Abram Demski likes this | 1 comment

A putative new idea for AI control; index here.

This post will present a simple model of an inconsistent human, and ponder how to resolve their inconsistency.

Let $$\bf{H}$$ be our agent, in a turn-based world. Let $$R^l$$ and $$R^s$$ be two simple reward functions at each turn. The reward $$R^l$$ is thought of as being a ‘long-term’ reward, while $$R^s$$ is a short-term one.

14.The Doomsday argument in anthropic decision theory
post by Stuart Armstrong 139 days ago | Abram Demski likes this | discuss

In Anthropic Decision Theory (ADT), behaviours that resemble the Self Sampling Assumption (SSA) derive from average utilitarian preferences (and from certain specific selfish preferences).

However, SSA implies the doomsday argument, and, to date, I hadn’t found a good way to express the doomsday argument within ADT.

This post will remedy that hole, by showing how there is a natural doomsday-like behaviour for average utilitarian agents within ADT.

15."Like this world, but..."
post by Stuart Armstrong 187 days ago | discuss

A putative new idea for AI control; index here.

Pick a very unsafe goal: $$G=$$“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?

I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.

Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.

16.Humans are not agents: short vs long term
post by Stuart Armstrong 222 days ago | 2 comments

A putative new idea for AI control; index here.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it’s predictable that every year they are alive, they will have the same desire to survive till the next year.

 17. New circumstances, new values? discussion post by Stuart Armstrong 225 days ago | discuss
 18. Futarchy, Xrisks, and near misses discussion post by Stuart Armstrong 229 days ago | Abram Demski likes this | discuss
19.Divergent preferences and meta-preferences
post by Stuart Armstrong 233 days ago | discuss

A putative new idea for AI control; index here.

In simple graphical form, here is the problem of divergent human preferences:

 20. Optimisation in manipulating humans: engineered fanatics vs yes-men discussion post by Stuart Armstrong 237 days ago | discuss
21.Acausal trade: conclusion: theory vs practice
post by Stuart Armstrong 246 days ago | discuss

A putative new idea for AI control; index here.

When I started this dive into acausal trade, I expected to find subtle and interesting theoretical considerations. Instead, most of the issues are practical.

 22. Acausal trade: trade barriers discussion post by Stuart Armstrong 246 days ago | discuss
 23. Acausal trade: universal utility, or selling non-existence insurance too late discussion post by Stuart Armstrong 247 days ago | discuss
 24. Acausal trade: full decision algorithms discussion post by Stuart Armstrong 247 days ago | discuss
post by Stuart Armstrong 251 days ago | discuss

A putative new idea for AI control; index here.

I’ve never really understood acausal trade. So in a short series of posts, I’ll attempt to analyse the concept sufficiently that I can grasp it - and hopefully so others can grasp it as well.

Older

### NEW DISCUSSION POSTS

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes