Intelligent Agent Foundations Forumsign up / log in
by Stuart Armstrong 340 days ago | link | parent

but the agent incorrectly observes the action

It’s a bit annoying that this has to rely on an incorrect observation. Why not replace the human action, in state \(s_2\), with a simple automated system that chooses \(a_1^H\)? It’s an easy mistake to make while programming, and the agent has no fundamental understanding of the difference between the human and an imperfect automated system.

Basically, if the human acts in perfect accordance with their preferences, and if the agent correctly observes and learns this, the agent will converge on the right values. You put wireheading by removing “correctly observes”, but I think removing “human acts in perfect accordance with their preferences” is a better example for wireheading.



by Tom Everitt 258 days ago | link

Adversarial examples for neural networks make situations where the agent misinterprets the human action seem plausible.

But it is true that the situation where the human acts irrationally in some state (e.g. because of drugs, propaganda) could be modeled in much the same way.

I preferred the sensory error since it doesn’t require a irrational human. Perhaps I should have been clearer that I’m interested in the agent wireheading itself (in some sense) rather than wireheading of the human.

(Sorry for being slow to reply – I didn’t get notified about the comments.)

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

I think that in that case,
by Alex Appel on Smoking Lesion Steelman | 1 like

Two minor comments. First,
by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
by Jessica Taylor on Musings on Exploration | 0 likes

A few comments. Traps are
by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes

If you drop the
by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

RSS

Privacy & Terms