Intelligent Agent Foundations Forumsign up / log in
1.Using lying to detect human values
link by Stuart Armstrong 187 days ago | discuss
2.Intuitive examples of reward function learning?
link by Stuart Armstrong 195 days ago | discuss
3.Beyond algorithmic equivalence: self-modelling
link by Stuart Armstrong 201 days ago | discuss
4.Beyond algorithmic equivalence: algorithmic noise
link by Stuart Armstrong 201 days ago | discuss
5.Why we want unbiased learning processes
post by Stuart Armstrong 210 days ago | discuss

Crossposted at Lesserwrong.

tl;dr: if an agent has a biased learning process, it may choose actions that are worse (with certainty) for every possible reward function it could be learning.

continue reading »
6.Oracle paper
discussion post by Stuart Armstrong 279 days ago | Vladimir Slepnev likes this | discuss
7.Stable agent, subagent-unstable
discussion post by Stuart Armstrong 294 days ago | discuss
8.Rationalising humans: another mugging, but not Pascal's
discussion post by Stuart Armstrong 308 days ago | discuss
9.Kolmogorov complexity makes reward learning worse
discussion post by Stuart Armstrong 316 days ago | discuss
10.Reward learning summary
post by Stuart Armstrong 316 days ago | discuss

A putative new idea for AI control; index here.

I’ve been posting a lot on value/reward learning recently, and, as usual, the process of posting (and some feedback) means that those posts are partially superseded already - and some of them are overly complex.

So here I’ll try and briefly summarise my current insights, with links to the other posts if appropriate (a link will cover all the points noted since the previous link):

continue reading »
11.Our values are underdefined, changeable, and manipulable
discussion post by Stuart Armstrong 320 days ago | discuss
12.Normative assumptions: regret
discussion post by Stuart Armstrong 322 days ago | discuss
13.Bias in rationality is much worse than noise
discussion post by Stuart Armstrong 322 days ago | discuss
14.Learning values, or defining them?
discussion post by Stuart Armstrong 322 days ago | discuss
15.Rationality and overriding human preferences: a combined model
post by Stuart Armstrong 332 days ago | discuss

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

continue reading »
16.Humans can be assigned any values whatsoever...
post by Stuart Armstrong 340 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

continue reading »
17.Should I post technical ideas here or on LessWrong 2.0?
discussion post by Stuart Armstrong 349 days ago | Abram Demski likes this | 3 comments
18.Resolving human inconsistency in a simple model
post by Stuart Armstrong 349 days ago | Abram Demski likes this | 1 comment

A putative new idea for AI control; index here.

This post will present a simple model of an inconsistent human, and ponder how to resolve their inconsistency.

Let \(\bf{H}\) be our agent, in a turn-based world. Let \(R^l\) and \(R^s\) be two simple reward functions at each turn. The reward \(R^l\) is thought of as being a ‘long-term’ reward, while \(R^s\) is a short-term one.

continue reading »
19.The Doomsday argument in anthropic decision theory
post by Stuart Armstrong 383 days ago | Abram Demski likes this | discuss

In Anthropic Decision Theory (ADT), behaviours that resemble the Self Sampling Assumption (SSA) derive from average utilitarian preferences (and from certain specific selfish preferences).

However, SSA implies the doomsday argument, and, to date, I hadn’t found a good way to express the doomsday argument within ADT.

This post will remedy that hole, by showing how there is a natural doomsday-like behaviour for average utilitarian agents within ADT.

continue reading »
20."Like this world, but..."
post by Stuart Armstrong 430 days ago | discuss

A putative new idea for AI control; index here.

Pick a very unsafe goal: \(G=\)“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?

I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.

Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.

continue reading »
21.Humans are not agents: short vs long term
post by Stuart Armstrong 466 days ago | 2 comments

A putative new idea for AI control; index here.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it’s predictable that every year they are alive, they will have the same desire to survive till the next year.

continue reading »
22.New circumstances, new values?
discussion post by Stuart Armstrong 469 days ago | discuss
23.Futarchy, Xrisks, and near misses
discussion post by Stuart Armstrong 473 days ago | Abram Demski likes this | discuss
24.Divergent preferences and meta-preferences
post by Stuart Armstrong 477 days ago | discuss

A putative new idea for AI control; index here.

In simple graphical form, here is the problem of divergent human preferences:

continue reading »
25.Optimisation in manipulating humans: engineered fanatics vs yes-men
discussion post by Stuart Armstrong 481 days ago | discuss
Older

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms