Intelligent Agent Foundations Forumsign up / log in
A cheating approach to the tiling agents problem
post by Vladimir Slepnev 169 days ago | Alex Mennen, Vadim Kosoy and Abram Demski like this | 2 comments

(This post resulted from a conversation with Wei Dai.)

Formalizing the tiling agents problem is very delicate. In this post I’ll show a toy problem and a solution to it, which arguably meets all the desiderata stated before, but only by cheating in a new and unusual way.

Here’s a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor must solve another math question and design its own successor, and so on. The questions get harder each time, so they can’t all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.

Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself on each round, not just design a custom-made successor that solves the math question with PA and reproduces itself by quining.


To nail down the formalization, we will use good old programs with proof limits, not oracles or modal formulas. The universe program U will contain the agent program A as a subprogram, and A will know the full source code of U (including A) by diagonalization. At each round, the agent will receive the text of a math question as input, and return a pair consisting of the answer to the math question and the source code of the successor agent.

The math questions will be of the form “you’re allowed to use N_i computation steps, here’s the source code of a program P_i, what does it return?” The programs P_i will be chosen so that N_i steps isn’t enough to run them to completion, but is more than enough to find a PA proof that P_i returns some particular value.

To give the agent some assurance of the above, the universe program will use the following transparent procedure to evaluate the agent’s answer:

  1. Run P_i to completion, and compare the result with the agent’s answer.

  2. Verify that there exist PA proofs about P_i that can be found in much fewer than N_i computation steps. Ignore the answers found in the proofs.

  3. Increment the agent’s utility if (1) succeeded or (2) failed.

We will also assume that all math questions are so difficult, and all N_i are so large, that doing any fixed work up front is a rounding error by comparison.


The above sounds like a fair formulation of the tiling agents problem, right? The Vingean principle is satisfied because the initial agent doesn’t have enough time to solve all math questions up front. The tiling principle is satisfied because we demand a solution that will reproduce itself at each round. The naturalistic principle is satisfied because we demand a solution that will treat both subtasks uniformly (solve the math question and design a successor).

Now I will show a solution that seems to meet all desiderata. It will be based on the theory PA+Con(PA), which is the weakest we can do, because the problem requires creating successor agents that trust PA. But you will still say that I’m cheating.


Let the agent A be this program:

“Ignore my input at first. Using a small number of computation steps, and using PA+Con(PA) as the theory, search for programs B such that, if all instances of A internally delegate all input-output to B, then U will provably return some particular utility. Then internally delegate to the B that leads to the highest utility found.”

Then A will find a short proof that delegating to the following program B leads to high utility (winning all rounds):

“Accept P_i and N_i as input. Using slightly fewer than N_i computation steps, and using PA as the theory, find the first proof of P_i returning anything. Then return the pair of that value and A’s source code.”

(To an outside observer, who doesn’t know about A’s private rituals, that looks like A correctly solving the math question and designating itself as successor. Also note that A’s algorithm has no mention of this particular B, nor of the two subtasks that B must solve.)

The proof is simple. If A internally delegates to B on the first round, then it will delegate to the same B on all subsequent rounds, because it ignores the input while choosing B. That fact is easily provable in PA. Then B goes on to handle the inputs on different rounds and wins all of them, which is easily provable in PA+Con(PA) due to the way the universe program verifies the answers. At each round B designates A as successor, which rubber-stamps B again, and the cycle continues.

You could say that A fails the Vingean principle, but that wouldn’t be fair. After all, A indeed can’t solve all questions in advance, because it doesn’t have enough time. It only solves the question “which B to run internally?” in advance, giving the same answer redundantly on each round, but that’s just an implementation detail. If you didn’t know how A worked internally, you wouldn’t object to it.

Of course it’s not guaranteed that A will reproduce itself. It knows that reproducing itself will lead to high utility, and we happen to know that no higher utility will be found, but the same utility could be achieved in other ways (e.g. by choosing a B that reproduces itself directly, instead of going through A). But that’s not a strong objection, because given the right tie-breaking, A will indeed reproduce itself.

Moreover, the approach of A looks quite general, without anything specific to this problem. It can also deal with all the usual UDT test cases, like Newcomb’s Problem, as well as all imaginable tiling problems that are amenable to quining. So someone optimistic could say that A is a version of UDT that solves the tiling agents problem!

That would be unsatisfactory though. It looks more like A has found a trick that leads to quining in a roundabout way. I feel that the people working on tiling agents wouldn’t accept my solution, but I can’t quite figure out which new principle would rule it out. Perhaps some kind of “no meta-quining principle”? Do humans obey that principle, or do they have intuitions like “solve decision theory in the abstract and delegate to that” which sound suspiciously like my agent? It’s tricky and my thinking is very confused right now.

What do you think?



by Alex Mennen 167 days ago | Abram Demski and Vladimir Slepnev like this | link

I think the general principle you’re taking advantage of is:

Do a small amount of very predictable reasoning in PA+Sound(PA). Then use PA for the rest of the reasoning you have to do. When reasoning about other instances of agents similar to you, simulate their use of Sound(PA), and trust the reasoning that they did while confining themselves to PA.

In your example, PA+Con(PA) sufficed, but PA+Sound(PA) is more flexible in general, in ways that might be important. This also seems to solve the closely related problem of how to trust statements if you know they were proven by an agent that reasons approximately the same way you do, for instance, statements proven by yourself in the past. If you proved X in the past, and you want to establish that that makes X true, you fully simulate everything your past self could have done in PA+Sound(PA)-mode before confining itself to PA, and then you can just trust that the reasoning you did afterwards in PA-mode was correct, so you don’t have to redo that part.

This also might allow for an agent to make improvements in its own source code and still trust its future self, provided that the modification that it makes still only uses Sound(PA) in ways that it can predict and simulate. On the other hand, that condition might be a serious limitation to recursive self-improvement, since the successor agent would need to use Sound(PA) in order to pick its successor agent, and it can’t do so in ways that the predecessor agent can’t predict.

Perhaps it could be worse than that, and attempting to do anything nontrivial with this trick leads to a combinatorial explosion from every instance of the agent trying to simulate every other instance’s uses of Sound(PA). But I’m cautiously optimistic that it isn’t quite that bad, since simply simulating an agent invoking Sound(PA) does not itself require you to invoke Sound(PA) yourself, so these simulations can be run in PA-mode, and only the decision to run them needs to be made in PA+Sound(PA)-mode.

reply

by Vladimir Slepnev 164 days ago | Alex Mennen likes this | link

I just realized that A will not only approve itself as successor, but also approve some limited self-modifications, like removing some inefficiency in choosing B that provably doesn’t affect the choice of B. Though it doesn’t matter much, because A might as well delete all code for choosing B and appoint a quining B as successor.

This suggests that the next version of the tiling agents problem should involve nontrivial self-improvement, not just self-reproduction. I have no idea how to formalize that though.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms