by Stuart Armstrong 388 days ago | Patrick LaVictoire likes this | link | parent How about something like this? I don’t expect this to work as stated, but it may suggest certain possibilities: There is a familiarity score $$F$$ which labels how close a situation is to one where humans have full and rapid understanding of what’s going on. In situations of high $$F$$, the human reward signals are take as accurate. There are examples of situations of medium $$F$$ where humans, after careful deliberation, conclude that the reward signals were wrong. The prior is that for low $$F$$, there will be reward signals that are wrong but which even careful human deliberation cannot discern. The job of the learning algorithm is to deduce what these are by extending the results from medium $$F$$. This should not converge merely onto human approval, since human approval is explicitly modelled to be false here.

 by Jessica Taylor 387 days ago | link This seems pretty similar to this proposal, does that seem right to you? I think my main objection is the same as the main objection to the proposal I linked to: there has to be a good prior over “what the correct judgments are” such that when this prior is updated on data, it correctly generalizes to cases where we can’t get feedback even in principle. It’s not even clear what “correct judgments” means (you can’t put a human in a box and have them think for 500 years). reply
 by Stuart Armstrong 387 days ago | link No exactly that. What I’m trying to get at is that we know some of the features that failure would have (eg edge cases of utility maximalisation, seductive-seeming or seductively-presented answer), so we should be able to use that knowledge somehow. reply

### NEW DISCUSSION POSTS

If you drop the
 by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
 by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes