|
|
|
|
|
|
 | Model of human (ir)rationality | | post by Stuart Armstrong 18 days ago | discuss | |
| A putative new idea for AI control; index here.
This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it’s our definition which determines what’s a bias and what’s a preference in humans).
| |
|
 | Heroin model: AI "manipulates" "unmanipulatable" reward | | post by Stuart Armstrong 23 days ago | 9 comments | |
| A putative new idea for AI control; index here.
A conversation with Jessica has revealed that people weren’t understanding my points about AI manipulating the learning process. So here’s a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.
| |
|
|
|
|
|
|
 | IRL is hard | | post by Vadim Kosoy 31 days ago | 6 comments | |
| We show that assuming the existence of public-key cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the “teacher” agent, the environment and the utility functions are computable in polynomial-time and there is only 1 bit of information to learn. | |
|
|
|
|
|
|
|
 | Learning (meta-)preferences | | post by Stuart Armstrong 79 days ago | Patrick LaVictoire likes this | 2 comments | |
| A putative new idea for AI control; index here.
There are various methods, such as Cooperative Inverse Reinforcement Learning (CIRL), that aim to have an AI deduce human preferences in some fashion.
The problem is that humans are not rational - citation certainly not needed. But, worse than that, they are not rational in ways that seriously complicate the task of fitting a reward or utility function to them. I presented one problem this entails in a previous post. That talked about the problems that emerged when an AI could influence a human’s preference through the ways it presented the issues.
| |
|
|
|
 | Abstract model of human bias | | post by Stuart Armstrong 100 days ago | 5 comments | |
| A putative new idea for AI control; index here.
Any suggestions for refining this model are welcome!
Somewhat inspired by the previous post, this is a model of human bias that can be used to test theories that want to compute the “true” human preferences. The basic idea is to formalise the question:
- If the AI can make the human give any answer to any question, can it figure out what humans really want?
| |
|
|
|
| Older |