by Stuart Armstrong 401 days ago | link | parent but the agent incorrectly observes the action It’s a bit annoying that this has to rely on an incorrect observation. Why not replace the human action, in state $$s_2$$, with a simple automated system that chooses $$a_1^H$$? It’s an easy mistake to make while programming, and the agent has no fundamental understanding of the difference between the human and an imperfect automated system. Basically, if the human acts in perfect accordance with their preferences, and if the agent correctly observes and learns this, the agent will converge on the right values. You put wireheading by removing “correctly observes”, but I think removing “human acts in perfect accordance with their preferences” is a better example for wireheading.

 by Tom Everitt 318 days ago | link Adversarial examples for neural networks make situations where the agent misinterprets the human action seem plausible. But it is true that the situation where the human acts irrationally in some state (e.g. because of drugs, propaganda) could be modeled in much the same way. I preferred the sensory error since it doesn’t require a irrational human. Perhaps I should have been clearer that I’m interested in the agent wireheading itself (in some sense) rather than wireheading of the human. (Sorry for being slow to reply – I didn’t get notified about the comments.) reply

NEW DISCUSSION POSTS

I found an improved version
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

I misunderstood your
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 0 likes

Caught a flaw with this
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

As you say, this isn't a
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 1 like

Note: I currently think that
 by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
 by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
 by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
 by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
 by Alex Appel on Smoking Lesion Steelman | 1 like

 by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
 by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
 by Jessica Taylor on Musings on Exploration | 0 likes