Intelligent Agent Foundations Forumsign up / log in
1.Using lying to detect human values
link by Stuart Armstrong 97 days ago | discuss
2.Intuitive examples of reward function learning?
link by Stuart Armstrong 106 days ago | discuss
3.Beyond algorithmic equivalence: self-modelling
link by Stuart Armstrong 111 days ago | discuss
4.Beyond algorithmic equivalence: algorithmic noise
link by Stuart Armstrong 111 days ago | discuss
5.Why we want unbiased learning processes
post by Stuart Armstrong 120 days ago | discuss

Crossposted at Lesserwrong.

tl;dr: if an agent has a biased learning process, it may choose actions that are worse (with certainty) for every possible reward function it could be learning.

continue reading »
6.Oracle paper
discussion post by Stuart Armstrong 189 days ago | Vladimir Slepnev likes this | discuss
7.Stable agent, subagent-unstable
discussion post by Stuart Armstrong 204 days ago | discuss
8.Rationalising humans: another mugging, but not Pascal's
discussion post by Stuart Armstrong 218 days ago | discuss
9.Kolmogorov complexity makes reward learning worse
discussion post by Stuart Armstrong 226 days ago | discuss
10.Reward learning summary
post by Stuart Armstrong 226 days ago | discuss

A putative new idea for AI control; index here.

I’ve been posting a lot on value/reward learning recently, and, as usual, the process of posting (and some feedback) means that those posts are partially superseded already - and some of them are overly complex.

So here I’ll try and briefly summarise my current insights, with links to the other posts if appropriate (a link will cover all the points noted since the previous link):

continue reading »
11.Our values are underdefined, changeable, and manipulable
discussion post by Stuart Armstrong 230 days ago | discuss
12.Normative assumptions: regret
discussion post by Stuart Armstrong 232 days ago | discuss
13.Bias in rationality is much worse than noise
discussion post by Stuart Armstrong 232 days ago | discuss
14.Learning values, or defining them?
discussion post by Stuart Armstrong 232 days ago | discuss
15.Rationality and overriding human preferences: a combined model
post by Stuart Armstrong 243 days ago | discuss

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

continue reading »
16.Humans can be assigned any values whatsoever...
post by Stuart Armstrong 250 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

continue reading »
17.Should I post technical ideas here or on LessWrong 2.0?
discussion post by Stuart Armstrong 259 days ago | Abram Demski likes this | 3 comments
18.Resolving human inconsistency in a simple model
post by Stuart Armstrong 259 days ago | Abram Demski likes this | 1 comment

A putative new idea for AI control; index here.

This post will present a simple model of an inconsistent human, and ponder how to resolve their inconsistency.

Let \(\bf{H}\) be our agent, in a turn-based world. Let \(R^l\) and \(R^s\) be two simple reward functions at each turn. The reward \(R^l\) is thought of as being a ‘long-term’ reward, while \(R^s\) is a short-term one.

continue reading »
19.The Doomsday argument in anthropic decision theory
post by Stuart Armstrong 293 days ago | Abram Demski likes this | discuss

In Anthropic Decision Theory (ADT), behaviours that resemble the Self Sampling Assumption (SSA) derive from average utilitarian preferences (and from certain specific selfish preferences).

However, SSA implies the doomsday argument, and, to date, I hadn’t found a good way to express the doomsday argument within ADT.

This post will remedy that hole, by showing how there is a natural doomsday-like behaviour for average utilitarian agents within ADT.

continue reading »
20."Like this world, but..."
post by Stuart Armstrong 341 days ago | discuss

A putative new idea for AI control; index here.

Pick a very unsafe goal: \(G=\)“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?

I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.

Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.

continue reading »
21.Humans are not agents: short vs long term
post by Stuart Armstrong 376 days ago | 2 comments

A putative new idea for AI control; index here.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it’s predictable that every year they are alive, they will have the same desire to survive till the next year.

continue reading »
22.New circumstances, new values?
discussion post by Stuart Armstrong 380 days ago | discuss
23.Futarchy, Xrisks, and near misses
discussion post by Stuart Armstrong 384 days ago | Abram Demski likes this | discuss
24.Divergent preferences and meta-preferences
post by Stuart Armstrong 388 days ago | discuss

A putative new idea for AI control; index here.

In simple graphical form, here is the problem of divergent human preferences:

continue reading »
25.Optimisation in manipulating humans: engineered fanatics vs yes-men
discussion post by Stuart Armstrong 391 days ago | discuss
Older

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

I found an improved version
by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

I misunderstood your
by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 0 likes

Caught a flaw with this
by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

As you say, this isn't a
by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 1 like

Note: I currently think that
by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
by Alex Appel on Smoking Lesion Steelman | 1 like

Two minor comments. First,
by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
by Jessica Taylor on Musings on Exploration | 0 likes

A few comments. Traps are
by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes

RSS

Privacy & Terms