1. Using lying to detect human values link by Stuart Armstrong 457 days ago | discuss
 2. Intuitive examples of reward function learning? link by Stuart Armstrong 466 days ago | discuss
 3. Beyond algorithmic equivalence: self-modelling link by Stuart Armstrong 471 days ago | discuss
 4. Beyond algorithmic equivalence: algorithmic noise link by Stuart Armstrong 471 days ago | discuss
5.Why we want unbiased learning processes
post by Stuart Armstrong 480 days ago | discuss

Crossposted at Lesserwrong.

tl;dr: if an agent has a biased learning process, it may choose actions that are worse (with certainty) for every possible reward function it could be learning.

 6. Oracle paper discussion post by Stuart Armstrong 549 days ago | Vladimir Slepnev likes this | discuss
 7. Stable agent, subagent-unstable discussion post by Stuart Armstrong 564 days ago | discuss
 8. Rationalising humans: another mugging, but not Pascal's discussion post by Stuart Armstrong 578 days ago | discuss
 9. Kolmogorov complexity makes reward learning worse discussion post by Stuart Armstrong 586 days ago | discuss
10.Reward learning summary
post by Stuart Armstrong 586 days ago | discuss

A putative new idea for AI control; index here.

I’ve been posting a lot on value/reward learning recently, and, as usual, the process of posting (and some feedback) means that those posts are partially superseded already - and some of them are overly complex.

So here I’ll try and briefly summarise my current insights, with links to the other posts if appropriate (a link will cover all the points noted since the previous link):

 11. Our values are underdefined, changeable, and manipulable discussion post by Stuart Armstrong 590 days ago | discuss
 12. Normative assumptions: regret discussion post by Stuart Armstrong 592 days ago | discuss
 13. Bias in rationality is much worse than noise discussion post by Stuart Armstrong 592 days ago | discuss
 14. Learning values, or defining them? discussion post by Stuart Armstrong 592 days ago | discuss
15.Rationality and overriding human preferences: a combined model
post by Stuart Armstrong 603 days ago | discuss

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

16.Humans can be assigned any values whatsoever...
post by Stuart Armstrong 610 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

 17. Should I post technical ideas here or on LessWrong 2.0? discussion post by Stuart Armstrong 619 days ago | Abram Demski likes this | 3 comments
18.Resolving human inconsistency in a simple model
post by Stuart Armstrong 619 days ago | Abram Demski likes this | 1 comment

A putative new idea for AI control; index here.

This post will present a simple model of an inconsistent human, and ponder how to resolve their inconsistency.

Let $$\bf{H}$$ be our agent, in a turn-based world. Let $$R^l$$ and $$R^s$$ be two simple reward functions at each turn. The reward $$R^l$$ is thought of as being a ‘long-term’ reward, while $$R^s$$ is a short-term one.

19.The Doomsday argument in anthropic decision theory
post by Stuart Armstrong 653 days ago | Abram Demski likes this | discuss

In Anthropic Decision Theory (ADT), behaviours that resemble the Self Sampling Assumption (SSA) derive from average utilitarian preferences (and from certain specific selfish preferences).

However, SSA implies the doomsday argument, and, to date, I hadn’t found a good way to express the doomsday argument within ADT.

This post will remedy that hole, by showing how there is a natural doomsday-like behaviour for average utilitarian agents within ADT.

20."Like this world, but..."
post by Stuart Armstrong 701 days ago | discuss

A putative new idea for AI control; index here.

Pick a very unsafe goal: $$G=$$“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?

I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.

Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.

21.Humans are not agents: short vs long term
post by Stuart Armstrong 736 days ago | 2 comments

A putative new idea for AI control; index here.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it’s predictable that every year they are alive, they will have the same desire to survive till the next year.

 22. New circumstances, new values? discussion post by Stuart Armstrong 739 days ago | discuss
 23. Futarchy, Xrisks, and near misses discussion post by Stuart Armstrong 743 days ago | Abram Demski likes this | discuss
24.Divergent preferences and meta-preferences
post by Stuart Armstrong 747 days ago | discuss

A putative new idea for AI control; index here.

In simple graphical form, here is the problem of divergent human preferences:

 25. Optimisation in manipulating humans: engineered fanatics vs yes-men discussion post by Stuart Armstrong 751 days ago | discuss
Older

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes