A putative new idea for AI control; index here.
A simple and easy design for a \(u\)-maximising agent that turns into a \(u\)-minimising one.
Let \(X\) be some boolean random variable outside the agent’s control, that will be determined at some future time \(t\) (based on a cosmic event, maybe?). Set it up so that \(P(X=1)=\epsilon\), and for a given utility \(u\) consider the utility:
- \(u^\# = (2/\epsilon)Xu - u\).
Before \(t\), the expected value of \((2/\epsilon)X\) is \(2\), so \(u^\# = u\). Hence the agent is a \(u\)-maximiser. After \(t\), the most likely option is \(X=0\), hence a little bit of evidence to that effect is enough to make \(u^\#\) into a \(u\)-minimiser.
This isn’t perfect corrigibility — the agent would be willing to sacrifice a bit of \(u\)-value (before \(t\)) in order to maintain its flexibility after \(t\). To combat this effect, we could instead use:
- \(u^\# = \Omega(2/\epsilon)Xu - u\).
If \(\Omega\) is large, then the agent is willing to pay very little \(u\)-value to maintain flexibility. However, the amount of evidence of \(X=0\) that it needs to become a \(u\)-minimiser is equally proportional to \(\Omega\), so \(X\) better be a clear and convincing event.
|