Intelligent Agent Foundations Forumsign up / log in

I’ve kinda switched to the view that we should have the whole party happening on LW, so people from different backgrounds can mix and get exposed to each other’s interests. But yeah, it would be nice if new LW had tags, like old LW.


figure out what my values actually are / should be

I think many human ideas are like low resolution pictures. Sometimes they show simple things, like a circle, so we can make a higher resolution picture of the same circle. That’s known as formalizing an idea. But if the thing in the picture looks complicated, figuring out a higher resolution picture of it is an underspecified problem. I fear that figuring out my values over all possible futures might be that kind of problem.

So apart from hoping to define a “full resolution picture” of human values, either by ourselves or with the help of some AI or AI-human hybrid, it might be useful to come up with approaches that avoid defining it. That was my motivation for this post, which directly uses our “low resolution” ideas to describe some particular nice future without considering all possible ones. It’s certainly flawed, but there might be other similar ideas.

Does that make sense?


by Wei Dai 698 days ago | link

I think I understand what you’re saying, but my state of uncertainty is such that I put a lot of probability mass on possibilities that wouldn’t be well served by what you’re suggesting. For example, the possibility that we can achieve most value not through the consequences of our actions in this universe, but through their consequences in much larger (computationally richer) universes simulating this one. Or that spreading hedonium is actually the right thing to do and produces orders of magnitude more value than spreading anything that resembles human civilization. Or that value scales non-linearly with brain size so we should go for either very large or very small brains.

While discussing the VR utopia post, you wrote “I know you want to use philosophy to extend the domain, but I don’t trust our philosophical abilities to do that, because whatever mechanism created them could only test them on normal situations.” I have some hope that there is a minimal set of philosophical abilities that would allow us to eventually solve arbitrary philosophical problems, and we already have this. Otherwise it seems hard to explain the kinds of philosophical progress we’ve made, like realizing that other universes probably exist, and figuring out some ideas about how to make decisions when there are multiple copies of us in this universe and others.

Of course it’s also possible that’s not the case, and we can’t do better than to optimize the future using our current “low resolution” values, but until we’re a lot more certain of this, any attempt to do this seems to constitute a strong existential risk.


Yeah. An asymptotic thing like Solomonoff induction can still have all sorts of mathematical goodness, like multiple surprisingly equivalent definitions, uniqueness/optimality properties, etc. It doesn’t have to be immediately practical to be worth studying. I hope LI can also end up like that.


I just realized that A will not only approve itself as successor, but also approve some limited self-modifications, like removing some inefficiency in choosing B that provably doesn’t affect the choice of B. Though it doesn’t matter much, because A might as well delete all code for choosing B and appoint a quining B as successor.

This suggests that the next version of the tiling agents problem should involve nontrivial self-improvement, not just self-reproduction. I have no idea how to formalize that though.


Wei has proposed this program by email:

  1. Always say “yes” to chocolate. Accept Y as successor iff PA proves that Y always says “yes” to chocolate and Y accepts a subset of successors that X accepts.

Figuring out the relationship between 5 and 6 (do they accept each other, are they equivalent) is a really fun exercise. So far I’ve been able to find a program that’s accepted by 5 but not 6. Won’t spoil people the pleasure of figuring it out :-)


It doesn’t mean computation steps. Losing in 1 step means you say “no” to chocolate, losing in 2 steps means you accept some program that says “no” to chocolate, and so on. Sorry, I thought that was the obvious interpretation, I’ll edit the post to make it clear.


by Stuart Armstrong 754 days ago | link

Ah, thanks! That seems more sensible.


I’ll just note that in a modal logic or halting oracle setting you don’t need the chicken rule, as we found in this old post: So it seems like at least the first problem is about the approximation, not the thing being approximated.


by Sam Eisenstat 771 days ago | Abram Demski likes this | link

Yeah, the 5 and 10 problem in the post actually can be addressed using provability ideas, in a way that fits in pretty natually with logical induction. The motivation here is to work with decision problems where you can’t prove statements \(A = a \to U = u\) for agent \(A\), utility function \(U\), action \(a\), and utility value \(u\), at least not with the amount of computing power provided, but you want to use inductive generalizations instead. That isn’t necessary in this example, so it’s more of an illustration.

To say a bit more, if you make logical inductors propositionally consistent, similarly to what is done in this post, and make them assign things that have been proven already probability 1, then they will work on the 5 and 10 problem in the post.

It would be interesting if there was more of an analogy to explore between the provability oracle setting and the inductive setting, and more ideas could be carried over from modal UDT, but it seems to me that this is a different kind of problem that will require new ideas.


Counterfactual mugging with a logical coin is a tricky problem. It might be easier to describe the problem with a “physical” coin first. We have two world programs, mutually quined:

  1. The agent decides whether to pay the predictor 10 dollars. The predictor doesn’t decide anything.

  2. The agent doesn’t decide anything. The predictor decides whether to pay the agent 100 dollars, depending on the agent’s decision in world 1.

By fiat, the agent cares about the two worlds equally, i.e. it maximizes the total sum of money it receives in both worlds. The usual UDT-ish solution can be crisply formulated in modal logic, PA or a bunch of other formalisms.

Does that make sense?


by Scott Garrabrant 1188 days ago | Ryan Carey likes this | link

This makes sense. My main point is that the care about the two worlds equally part makes sense if it is part of the problem description, but otherwise we don’t know where that part comes from.

My logical example was supposed to illustrate that sometimes you should not care about them equally.


by Vladimir Slepnev 1236 days ago | Abram Demski likes this | link | parent | on: Naturalistic Logical Updates

Nice work! I’m not too attached to the particular prior described in the “UDT 1.5” post, and I’m very happy that you’re trying to design something better to solve the same problem. That said, it’s a bit suspicious that you’re conditioning on provability rather than truth, which leads you to use GLS etc. Maybe I just don’t understand the “non-naturalistic” objection?

Also I don’t understand why you have to use both propositional coherence and partitions of truth. Can’t you replace partitions of truth with arbitrary propositional combinations of sentences?


by Abram Demski 1199 days ago | link

(Sorry I didn’t notice this comment earlier.)

My intention wasn’t really to use both propositional coherence and partitions of truth – I switched from one to the other because they’re equivalent (at least, the way I did it). Probably would have been better to stick with one.

I do think this notion of ‘naturalistic’ is important. The idea is that if another computer implements the same logic ans your internal theorem prover, and you know this, you should treat information coming from it just the same as you would your own. This seems like a desirable property, if you can get it.

I can understand being suspicious. I’m not claiming that using GLS gives some magical self-reference properties due to knowing what is true about provability. It’s more like using PA+CON(PA) to reason about PA. It’s much more restricted than that, though; it’s a probabilistic reasoning system that believes GLS, reasoning about PA and provability in PA. In any case, you won’t be automatically trusting external theorem provers in this “higher” system, only in PA. However, GLS is decidable, so trusting what external theorem provers claim about it is a non-issue.

What’s the use? Aside from giving a quite pleasing (to me at least) solution to the paradox of ignorance, this seems to me to be precisely what’s needed for impossible possible worlds. In order to be uncertain about logic itself, there needs to be some “structure” what we are uncertain about: something that is really chosen deterministically, according to real logic, but which we pretend is chosen randomly, in order to get a tractable distribution to reason about (which then includes the impossible possible worlds).


by Abram Demski 1198 days ago | link

The main reason to think it’s important to be “naturalistic” about the updates is reflective consistency. When there is negligible interaction between possible worlds, an updateless agent should behave like an updateful one. The updateless version of the agent would not endorse updating on \(\phi\) rather than \(\Box \phi\)! Since it does not trust what its future self proves, it would self-modify to remove such an update if it were hard-wired. So I think updating on provability rather than truth is really the right thing.


Well, Laplace’s rule of succession is a computable prior that will almost certainly converge to your uncomputable probability value, and I think the difference in log scores from the “true” prior will be finite too. Since Laplace’s rule is included in the Solomonoff mixture, I suspect that things should work out nicely. I don’t have a proof, though.







[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes


Privacy & Terms