|Nearest unblocked strategy versus learning patches|
| post by Stuart Armstrong 28 days ago | 9 comments|
The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.
For instance, if the AI is maximising a reward \(R\), and does some behaviour \(B_i\) that we don’t like, we can patch the AI’s algorithm with patch \(P_i\) (‘maximise \(R_0\) subject to these constraints…’), or modify \(R\) to \(R_i\) so that \(B_i\) doesn’t come up. I’ll focus more on the patching example, but the modified reward one is similar.
|Entangled Equilibria and the Twin Prisoners' Dilemma|
| post by Scott Garrabrant 40 days ago | Vadim Kosoy and Patrick LaVictoire like this | 2 comments|
In this post, I present a generalization of Nash equilibria to non-CDT agents. I will use this formulation to model mutual cooperation in a twin prisoners’ dilemma, caused by the belief that the other player is similar to you, and not by mutual prediction. (This post came mostly out of a conversation with Sam Eisenstat, as well as contributions from Tsvi Benson-Tilsen and Jessica Taylor)
|True understanding comes from passing exams|
| post by Stuart Armstrong 45 days ago | 5 comments|
I’ll try to clarify what I was doing with the AI truth setup in a previous post. First I’ll explain the nature of the challenge, and then how the setup tries to solve it.
The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.
|Humans as a truth channel|
| post by Stuart Armstrong 50 days ago | discuss|
Defining truth and accuracy is tricky, so when I’ve proposed designs for things like Oracles, I’ve either used a very specific and formal question, or and indirect criteria for truth.
Here I’ll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.
| post by Stuart Armstrong 54 days ago | Ryan Carey likes this | discuss|
Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?
Well, drinking coffee by the barrel at Miri’s emergency research retreat I’d… still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn’t reveal any new approaches, I’d try and get something like this working.
|Thoughts on Quantilizers|
| post by Stuart Armstrong 55 days ago | Ryan Carey and Abram Demski like this | discuss|
This post will look at some of the properties of quantilizers, when they succeed and how they might fail.
Roughly speaking, let \(f\) be some true objective function that we want to maximise. We haven’t been able to specify it fully, so we have instead a proxy function \(g\). There is a cost function \(c=f-g\) which measures how much \(g\) falls short of \(f\). Then a quantilizer will choose actions (or policies) radomly from the top \(n\%\) of actions available, ranking those actions according to \(g\).