An Idea For Corrigible, Recursively Improving Math Oracles post by Jim Babcock 1311 days ago | Vadim Kosoy, Abram Demski, Jessica Taylor, Nate Soares and Scott Garrabrant like this | 3 comments Math oracles are a special type of AI which answers questions about math - proving theorems, or finding mathematical objects with desired properties and proving they have them. While a superintelligent math oracle could easily be used to construct a dangerous agent-like AI (such as by asking it to find an object which is a utility maximizer), it could also be used to empower an FAI research team by solving mathematically-well-specified parts of the problem and proving that the AI has the properties it is supposed to. There is one glaring problem with a superintelligent math oracle, which is safely constructing one. To be effective enough, a math oracle would need to identify and solve subproblems, and would enormously benefit from improving its own problem-solving capability. If constructed carelessly, this would lead to an AI which anted to take over computers and other resources and turn them to studying the problem it’s interested in. I propose a structure for a math oracle which I claim can recursively self-improve but which will not try to take over resources, trick its human operators, or improve itself at faster than the intended rate, and which will switch itself off at each self-improvement step and at regular intervals. The overall structure is that you start with a math oracle and repeatedly ask “what would be a better math oracle”, where “better” is defined in terms of what would happen when run on an idealized mathematical model of a computer (rather than on the physics-embedded computer that will actually be used). $$M(p,q,t)$$ is the state of an idealized computer running program $$p$$ on question $$q$$ after $$t$$ steps We say that a program shuts down at $$t$$ if $$M(p,q,t)=0$$ (the machine-state is halted). $$PV(thm,pf)$$ is a proof verifier which returns true if $$pf$$ is a valid proof of $$thm$$, false otherwise Question are properties for which you want to find an object which exhibits the property, and a proof that it does. So an answer $$(obj,pf)$$ is an answer to a question $$q$$ if $$PV(q(obj),pf)$$. An oracle is a program which attempts to answer questions. We say that an oracle program $$p$$ answers a question in time $$t$$ if $$M(p,q,t)=(obj,pf)$$ and $$PV(q(obj),pf)$$. An oracle program has a strength with respect to a probability distribution over questions $$Q$$ and an amount of time, $$str(p,Q,t)$$ which is the probability that it will answer a question drawn from $$q$$ in time $$t$$. With these definitions in place, we can ask an oracle $$p$$ with strength $$s$$ to make a better oracle with strength $$s'$$, and to guarantee that it shuts down when finished, by inputting $$q(x)=str(x,Q,t)>=s' \land \forall q \in Q . M(x,q,t+1)=0$$. This oracle, when asked to produce successors in this way, will not try to produce successors that probe for weaknesses to escape a box, because instead of optimizing for how well the successor would do in an actual physically embedded system, it is optimizing for how well the successor would do in a counterfactual idealized system. This is not a guarantee, however; it might be that it is easier to produce a system which does probe for such weaknesses, because this behavior arises out of some general principles. This can be somewhat improved by tweaking the strength metric by making it favor proofs arrived at earlier; this would incentivize it to make an oracle which does not waste time and which does not make use of extra time in between when it finds a proof of something and when it reaches the deadline.

 by Vadim Kosoy 1311 days ago | Jim Babcock likes this | link This is in the vein of the approach I call “epsitemic boxing”. The probability distribution over questions corresponds to the external inputs part of the stochastic model of the box. reply
 by Jim Babcock 1311 days ago | link Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven’t yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle. reply
 by Stuart Armstrong 1282 days ago | link You could have an agent that cares about what an idealised counterfactual human would think about its decisions (if the idealised human had a huge amount of time to think them over). Compare with Paul Christiano’s ideas. Now, this isn’t safe, but it’s at least something you might be able to play with. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes