Intelligent Agent Foundations Forumsign up / log in
An Idea For Corrigible, Recursively Improving Math Oracles
post by Jim Babcock 944 days ago | Vadim Kosoy, Abram Demski, Jessica Taylor, Nate Soares and Scott Garrabrant like this | 3 comments

Math oracles are a special type of AI which answers questions about math - proving theorems, or finding mathematical objects with desired properties and proving they have them. While a superintelligent math oracle could easily be used to construct a dangerous agent-like AI (such as by asking it to find an object which is a utility maximizer), it could also be used to empower an FAI research team by solving mathematically-well-specified parts of the problem and proving that the AI has the properties it is supposed to.

There is one glaring problem with a superintelligent math oracle, which is safely constructing one. To be effective enough, a math oracle would need to identify and solve subproblems, and would enormously benefit from improving its own problem-solving capability. If constructed carelessly, this would lead to an AI which anted to take over computers and other resources and turn them to studying the problem it’s interested in.

I propose a structure for a math oracle which I claim can recursively self-improve but which will not try to take over resources, trick its human operators, or improve itself at faster than the intended rate, and which will switch itself off at each self-improvement step and at regular intervals. The overall structure is that you start with a math oracle and repeatedly ask “what would be a better math oracle”, where “better” is defined in terms of what would happen when run on an idealized mathematical model of a computer (rather than on the physics-embedded computer that will actually be used).

\(M(p,q,t)\) is the state of an idealized computer running program \(p\) on question \(q\) after \(t\) steps

We say that a program shuts down at \(t\) if \(M(p,q,t)=0\) (the machine-state is halted).

\(PV(thm,pf)\) is a proof verifier which returns true if \(pf\) is a valid proof of \(thm\), false otherwise

Question are properties for which you want to find an object which exhibits the property, and a proof that it does. So an answer \((obj,pf)\) is an answer to a question \(q\) if \(PV(q(obj),pf)\).

An oracle is a program which attempts to answer questions. We say that an oracle program \(p\) answers a question in time \(t\) if \(M(p,q,t)=(obj,pf)\) and \(PV(q(obj),pf)\). An oracle program has a strength with respect to a probability distribution over questions \(Q\) and an amount of time, \(str(p,Q,t)\) which is the probability that it will answer a question drawn from \(q\) in time \(t\).

With these definitions in place, we can ask an oracle \(p\) with strength \(s\) to make a better oracle with strength \(s'\), and to guarantee that it shuts down when finished, by inputting \(q(x)=str(x,Q,t)>=s' \land \forall q \in Q . M(x,q,t+1)=0\).

This oracle, when asked to produce successors in this way, will not try to produce successors that probe for weaknesses to escape a box, because instead of optimizing for how well the successor would do in an actual physically embedded system, it is optimizing for how well the successor would do in a counterfactual idealized system. This is not a guarantee, however; it might be that it is easier to produce a system which does probe for such weaknesses, because this behavior arises out of some general principles. This can be somewhat improved by tweaking the strength metric by making it favor proofs arrived at earlier; this would incentivize it to make an oracle which does not waste time and which does not make use of extra time in between when it finds a proof of something and when it reaches the deadline.



by Vadim Kosoy 944 days ago | Jim Babcock likes this | link

This is in the vein of the approach I call “epsitemic boxing”. The probability distribution over questions corresponds to the external inputs part of the stochastic model of the box.

reply

by Jim Babcock 944 days ago | link

Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven’t yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.

reply

by Stuart Armstrong 914 days ago | link

You could have an agent that cares about what an idealised counterfactual human would think about its decisions (if the idealised human had a huge amount of time to think them over). Compare with Paul Christiano’s ideas.

Now, this isn’t safe, but it’s at least something you might be able to play with.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

RSS

Privacy & Terms