Intelligent Agent Foundations Forumsign up / log in
An approach to the Agent Simulates Predictor problem
link by Alex Mennen 1196 days ago | Vanessa Kosoy, Abram Demski, Gary Drescher, Jessica Taylor and Patrick LaVictoire like this | 11 comments


by Gary Drescher 1184 days ago | Jessica Taylor likes this | link

Suppose we amend ASP to require the agent to output a full simulation of the predictor before saying “one box” or “two boxes” (or else the agent gets no payoff at all). Would that defeat UDT variants that depend on stopping the agent before it overthinks the problem?

(Or instead of requiring the the agent to output the simulation, we could use the entire simulation, in some canonical form, as a cryptographic key to unlock an encrypted description of the problem itself. Prior to decrypting the description, the agent doesn’t even know what the rules are; the agent is told in advance only that that decryption will reveal the rules.)

reply

by Alex Mennen 1183 days ago | link

In the first problem, the agent could commit to one-boxing (through the mechanism I described in the link) and then finish simulating the predictor afterwards. Then the predictor would still be able to simulate the agent until it commits to one-boxing, and then prove that the agent will one-box no matter what it computes after that.

The second version of the problem seems more likely to cause problems, but it might work for the agent to restrict itself to not using the information it pre-computed for the purposes of modeling the predictor (even though it has to use that information for understanding the problem). If predictor is capable of verifying or assuming that the agent will correctly simulate it, it could skip the impossible step of fully simulating the agent fully simulating it, and just simulate the agent on the decrypted problem.

reply

by Gary Drescher 1182 days ago | link

For the simulation-output variant of ASP, let’s say the agent’s possible actions/outputs consist of all possible simulations Si (up to some specified length), concatenated with “one box” or “two boxes”. To prove that any given action has utility greater than zero, the agent must prove that the associated simulation of the predictor is correct. Where does your algorithm have an opportunity to commit to one-boxing before completing the simulation, if it’s not yet aware that any of its available actions has nonzero utility? (Or would that commitment require a further modification to the algorithm?)

For the simulation-as-key variant of ASP, what principle would instruct a (modified) UDT algorithm to redact some of the inferences it has already derived?

reply

by Alex Mennen 1182 days ago | link

simulation-output: It would require a modification to the algorithm. I don’t find this particularly alarming, though, since the algorithm was intended as a minimally-complex solution that behaves correctly for good reasons, not as a final, fully-general version. To do this, the agent would have to first (or at least, at some point soon enough for the predictor to simulate) look for ways to partition its output into pieces and consider choosing each piece separately. There would have to be some heuristic for deciding what partitionings of the output to consider and how much computational power to devote to each of them, and then which one actually gets chosen depends on which has the highest resulting utility you expect to get from them. Come to think of it, this might be trickier than I was thinking because you would run into self-trust issues if you need to prove that you will output the correct simulation of the predictor. This could be fixed by delegating the task of fully simulating the predictor to an easier-to-model subroutine, though that would require further modification to the algorithm.

Simulation-as-key: I don’t have a good answer to that.

reply

by Vanessa Kosoy 1182 days ago | Abram Demski likes this | link

I think that the idea that contradictions should lead to infinite utility is probably something that doesn’t work for real models of logical uncertainty. Instead we can do pseudorandomization. That said, there might be some other approach that I’m missing.

Maximizing \(E[U \mid A() = a]\) is not EDT. In fact I believe it is the original formulation of UDT. The problems with EDT arise when you condition by indexical uncertainty. Instead, you should condition by logical uncertainty while fixing indexical uncertainty (see also this). I think that the correct decision theory has to look like evaluating some sort of expected values since the spirit of VNM theory should survive.

I think that the idea of gradually increasing the power of utility estimation corresponds precisely to going to higher metathreat levels, since the implementation of metathreats by optimal predictors involves pseudorandomizing each level while using more powerful predictors on the next level that are able to simulate the pseudorandom. This should also solve logical counterfactual mugging where the logical coin looks random on the lower levels allowing pre-committing to cooperative behavior on the higher levels on which the logical coin looks deterministic.

reply

by Alex Mennen 1182 days ago | link

I think that the idea that contradictions should lead to infinite utility is probably something that doesn’t work for real models of logical uncertainty.

Why not?

In fact I believe it is the original formulation of UDT.

Huh, you’re right. I even knew that at one point. Specifically, what I proposed was UDT 1 (in this write-up’s terminology), and I concluded that what I described was not UDT because I found a bug in it that I thought UDT should get right (and which UDT 1.1 does get right).

reply

by Patrick LaVictoire 1193 days ago | link

Nice!

Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.

And I’m not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor’s (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor’s axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?

reply

by Alex Mennen 1191 days ago | Patrick LaVictoire likes this | link

Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.

Yes, thanks for the correction. I’d fix it, but I don’t think it’s possible to edit a pdf in google drive, and it’t not worth re-uploading and posting a new link for a typo.

And I’m not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor’s (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor’s axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?

I don’t have such a proof. I mentioned that as a possible concern at the end of the second-last paragraph of the section on the predictor having stronger logic and more computing power. Reconsidering though, this seems like a more serious concern than I initially imagined. It seems this will behave reasonably only when the agent does not trust itself too much, which would have terrible consequences for problems involving sequential decision-making.

Ideally, we’d want to replace the conditional expected value function with something of a more counterfactual nature to avoid these sorts of issues, but I don’t have a coherent way of specifying what that would even mean.

reply

by Vanessa Kosoy 1182 days ago | link

I think you mean “a spurious counterfactual where the conditional utilities view the agent one-boxing as evidence that the predictor’s axioms must be inconsistent”? That is, the agent correctly believes that predictor’s axioms are likely to be consistent but also thinks that they would be inconsistent if it one-boxed, so it two-boxes?

reply

by Alex Mennen 1182 days ago | Patrick LaVictoire likes this | link

[Edit: this isn’t actually a spurious counterfactual.] The agent might reason “if I two-box, then either it’s because I do something stupid (we can’t rule this out for Lobian reasons, but we should be able to assign it arbitrarily low probability), or, much more likely, the predictor’s reasoning is inconsistent. An inconsistent predictor would put $1M in box B no matter what my action is, so I can get $1,001,000 by two-boxing in this scenario. I am sufficiently confident in this model that my expected payoff conditional on me two-boxing is greater than $1M, whereas I can’t possibly get more than $1M if I one-box. Therefore I should two-box.” (this only happens if the predictor is implemented in such a way that it puts $1M in box B if it is inconsistent, of course). If the agent reasons this way, it would be wrong to trust itself with high probability, but we’d want the agent to be able to trust itself with high probability without being wrong.

reply

by Vanessa Kosoy 1182 days ago | link

This is a reply to Alex’s comment 792 but I’m placing it here since for some weird reason the website doesn’t let me reply to 792

I think that the idea that contradictions should lead to infinite utility is probably something that doesn’t work for real models of logical uncertainty.

Why not?

So, I started writing an explanation why it doesn’t work, tried to anticipate the loopholes you would point out in this explanation and ended up with the conclusion it actually does work :)

First, note that in logical uncertainty the boolean divide between “contradiction” and “consistency” is replaced by a continuum. Logical conditional expectations become less and less stable as the probability of the condition goes to zero (see this; for a generalization to probabilistic algorithms see this). What we can do in the spirit of your proposal is e.g. maximize \(\operatorname{E}[U \mid A() = a] - T \log \Pr[A() = a]\) for some small constant \(T\) (in the optimal predictor formalism we probably want \(T\) to be a function of \(k\) that goes to 0 as \(k\) goes to infinity).

The problem was that the self-referential nature of UDT requies optimal predictors for reflective systems and the construction I knew for the latter yielded probabilistic optimal predictors since it uses the Kakutani fixed point theorem and we need to form mixtures to apply it. With probabilistic optimal predictors things get hairy since the stability condition “\(\Pr_{\text{logical}}[A() = a] > \epsilon\)” is replaced by the condition “lowest eigenvalue of \(\operatorname{E}_{\text{logical}}[\Pr_{\text{indexical}}[A() = a] \Pr_{\text{indexical}}[A() = b]] > \epsilon\)”. There seems to be no way to stabilize this new condition. There are superficially appealing analogues that in the degenerate case reduce to choosing the action most unlikely in the normal distribution with mean \(\operatorname{E}_{\text{logical}}[\Pr_{\text{indexical}}[A() = a]]\) and covariance \(\operatorname{E}_{\text{logical}}[\Pr_{\text{indexical}}[A() = a] \Pr_{\text{indexical}}[A() = b]]\). Unfortunately it doesn’t work since there might be several actions with similar likelihoods that get chosen with different indexical probabilities consistently with the above mean and covariance. Indeed it would be impossible for it to work since in particular it would allow getting non-negligible logical variance of a quantity that depends on no parameters, which cannot be (since it is always possible to hardcode such a quantity).

However, recently I discovered reflective systems that are deterministic (and which seem the right thing to use for real agents because of independent reasons). For these systems the “naive” method works! This again caches out into some sort of pseudorandomization but this way the pseudorandomization arises naturally instead of having to insert an arbitrary pseudorandom function by hand. Moreover it looks like it solves some issues with making the formalism truly “updateless” (i.e. dealing correctly with scenarios similar to counterfactual mugging).

Very pleased with this development!

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms