Some Criticisms of the Logical Induction paper link by Tarn Somervell Fletcher 24 days ago | Alex Mennen, Sam Eisenstat and Scott Garrabrant like this | 10 comments

A few thoughts:

I agree that the LI criterion is “pointwise” in the way that you describe, but I think that this is both pretty good and as much as could actually be asked. A single efficiently computable trader can do a lot. It can enforce coherence on a polynomially growing set of sentences, search for proofs using many different proof strategies, enforce a polynomially growing set of statistical patterns, enforce reflection properties on a polynomially large set of sentences, etc. So, eventually the market will not be exploitable on all these things simultaneously, which seems like a pretty good level of accurate beliefs to have.

On the other side of things, it would be far to strong to ask for a uniform bound of the form “for every $$\varepsilon > 0$$, there is some day $$t$$ such that after step $$t$$, no trader can multiply its wealth by a factor more than $$1 + \varepsilon$$”. This is because a trader can be hardcoded with arbitrarily many hard-to-compute facts. For every $$\delta$$, there must eventually be a day $$t' > t$$ on which the belief of your logical inductor assign probability less than $$\delta$$ to some true statement, at which point a trader who has that statement hardcoded can multiply its wealth by $$1/\delta$$. (I can give a construction of such a sentence using self-reference if you want, but it’s also intuitively natural - just pick many mutually exclusive statements with nothing to break the symmetry.)

Thus, I wouldn’t think of traders as “mistakes”, as you do in the post. A trader can gain money on the market if the market doesn’t already know all facts that will be listed by the deductive process, but that is a very high bar. Doing well against finitely many traders is already “pretty good”.

What you can ask for regarding uniformity is for some simple function $$f$$ such that any trader $$T$$ can multiply its wealth by at most a factor $$f(T)$$. This is basically the idea of the mistake bound model in learning theory; you bound how many mistakes happen rather than when they happen. This would let you say a more than the one-trader properties I mentioned in my first paragraph. In fact, $$\tt{LIA}$$ has this propery; $$f(T)$$ is just the initial wealth of the trader. You may therefore want to do something like setting traders’ initial wealths according to some measure of complexity. Admittedly this isn’t made explicit in the paper, but there’s not much additional that needs to be done to think in this way; it’s just the combination of the individual proofs in the paper with the explicit bounds you get from the initial wealths of the traders involved.

I basically agree completely on your last few points. The traders are a model class, not an ensemble method in any substantive way, and it is just confusing to connect them to the papers on ensemble methods that the LI paper references. Also, while I use the idea of logical induction to do research that I hope will be relevant to practical algorithms, it seems unlikely than any practical algorithm will look much like a LI. For one thing, finding fixed points is really hard without some property stronger than continuity, and you’d need a pretty good reason to put it in the inner loop of anything.

 by Paul Christiano 4 days ago | link | parent | on: Current thoughts on Paul Christano's research agen... Why does Paul think that learning needs to be “aligned” as opposed to just well-understood and well-behaved, so that it can be safely used as part of a larger aligned AI design that includes search, logic, etc.? I mostly think it should be benign / corrigible / something like that. I think you’d need something like that whether you want to apply learning directly or to apply it as part of a larger system. If Paul does not think ALBA is a realistic design of an entire aligned AI (since it doesn’t include search/logic/etc.) what might a realistic design look like, roughly? You can definitely make an entire AI out of learning alone (evolution / model-free RL), and I think that’s currently the single most likely possibility even though it’s not particularly likely. The alternative design would integrate whatever other useful techniques are turned up by the community, which will depend on what those techniques are. One possibility is search/planning. This can be integrated in a straightforward way into ALBA, I think the main obstacle is security amplification which needs to work for ALBA anyway and is closely related to empirical work on capability amplification. On the logic side it’s harder to say what a useful technique would look like other than “run your agent for a while,” which you can also do with ALBA (though it requires something like these ideas). which makes it seem like his approach is an alternative to MIRI’s My hope is to have safe and safely composable versions of each important AI ingredient. I would caricature the implicit MIRI view as “learning will lead to doom, so we need to develop an alternative approach that isn’t doomed,” which is a substitute in the sense that it’s also trying to route around the apparent doomedness of learning but in a quite different way.

Thanks, so to paraphrase your current position, you think once we have aligned learning it doesn’t seem as hard to integrate other AI components into the design, so aligning learning seems to be the hardest part. MIRI’s work might help with aligning other AI components and integrating them into something like ALBA, but you don’t see that as very hard anyway, so it perhaps has more value as a substitute than a complement. Is that about right?

One possibility is search/planning. This can be integrated in a straightforward way into ALBA

I don’t understand ALBA well enough to easily see extensions to the idea that are obvious to you, and I’m guessing others may be in a similar situation. (I’m guessing Jessica didn’t see it for example, or she wouldn’t have said “ALBA competes with adversaries who use only learning” without noting that there’s a straightforward extension that does more.) Can you write a post about this? (Or someone else please jump in if you do see what the “straightforward way” is.)

 by Wei Dai 5 days ago | Vladimir Slepnev likes this | link | parent | on: Current thoughts on Paul Christano's research agen... Thank you for writing this. I’m trying to better understand Paul’s ideas, and it really helps to see an explanation from a different perspective. Also, I was thinking of publicly complaining that I know at least four people who have objections to Paul’s approach that they haven’t published anywhere. Now that’s down to three. :) I wonder if you can help answer some questions for me. (I’m directing these at Paul too, but since he’s very busy I can’t always expect an answer.) Why does Paul think that learning needs to be “aligned” as opposed to just well-understood and well-behaved, so that it can be safely used as part of a larger aligned AI design that includes search, logic, etc.? He seems to be trying to design an entire aligned AI out of “learning”, which makes it seem like his approach is an alternative to MIRI’s (Daniel Dewey said this recently at EA Forums for example), while at the same time saying “But we can and should try to do the same for other AI components; I understand MIRI’s agent foundations agenda as (mostly) addressing the alignment of these other elements.” If he actually thinks that his approach and MIRI’s are complements, why didn’t he correct Daniel? I’m pretty confused here. ETA: I found a partial answer to the above here. To express my understanding of it, Paul is trying to build an aligned AI out of only learning because that seems easier than building a realistic aligned AI and may give him insights into how to do the latter. If he interprets MIRI as doing the analogous thing starting with other AI components (as he seems to according to the quote in the above paragraph), then he surely ought to view the two approaches as complementary, which makes it a bigger puzzle why he didn’t contradict Daniel when Daniel said “if an approach along these lines is successful, it doesn’t seem to me that much room would be left for HRAD to help on the margin”. (Maybe he didn’t read that part, or his interpretation of what MIRI is doing has changed?) If Paul does not think ALBA is a realistic design of an entire aligned AI (since it doesn’t include search/logic/etc.) what might a realistic design look like, roughly? Why does Paul think learning “poses much harder safety problems than other AI techniques under discussion”? Paul is beginning to do empirical work on capability amplification (as he told me recently via email). Do you think that’s a good alternative to trying to make further theoretical progress?

Why does Paul think that learning needs to be “aligned” as opposed to just well-understood and well-behaved, so that it can be safely used as part of a larger aligned AI design that includes search, logic, etc.?

I mostly think it should be benign / corrigible / something like that. I think you’d need something like that whether you want to apply learning directly or to apply it as part of a larger system.

If Paul does not think ALBA is a realistic design of an entire aligned AI (since it doesn’t include search/logic/etc.) what might a realistic design look like, roughly?

You can definitely make an entire AI out of learning alone (evolution / model-free RL), and I think that’s currently the single most likely possibility even though it’s not particularly likely.

The alternative design would integrate whatever other useful techniques are turned up by the community, which will depend on what those techniques are. One possibility is search/planning. This can be integrated in a straightforward way into ALBA, I think the main obstacle is security amplification which needs to work for ALBA anyway and is closely related to empirical work on capability amplification. On the logic side it’s harder to say what a useful technique would look like other than “run your agent for a while,” which you can also do with ALBA (though it requires something like these ideas).

which makes it seem like his approach is an alternative to MIRI’s

My hope is to have safe and safely composable versions of each important AI ingredient. I would caricature the implicit MIRI view as “learning will lead to doom, so we need to develop an alternative approach that isn’t doomed,” which is a substitute in the sense that it’s also trying to route around the apparent doomedness of learning but in a quite different way.

 by Paul Christiano 6 days ago | Jessica Taylor and Wei Dai like this | link | parent | on: Current thoughts on Paul Christano's research agen... I mostly agree with this post’s characterization of my position. Places where I disagree with your characterization of my view: I don’t assume that powerful actors can’t coordinate, and I don’t think that assumption is necessary. I would describe the situation as: over the course of time influence will necessarily shift sometimes due to forces we endorse—like deliberation or reconciliation—and sometimes due to forces we don’t endorse—like compatibility with an uncontrolled expansion strategy. Even if powerful actors can form perfectly-coordinated coalitions, a “weak” actor positioned to benefit from competitive expansion would simply decline to participate in that coalition unless offered extremely generous terms. I don’t see how the situation changes unless the strong actors use force. I do think that’s reasonably likely, I would more describe alignment as a first line of defense or a complement to approaches like regulation. I generally agree that good coordination can substitute for technical weakness. I don’t think I rely on or even implicitly use a unidimensional model of power. I do use a concept like “total wealth” or “total influence,” which seems almost but not quite tautologically well-defined (as the output of a competitive/bargaining dynamic) and in particular is compatible with knowledge vs. resources vs. whatever. Being “competitive” seems to make sense in very complex worlds, when I say something like “win in a fistfight” I mean to quantify over all possible fistfights (including science, economic competition, persuasion, war, etc. etc.). I have strong intuitions about my approach being workable, and either the approach will succeed or I at least will feel that I have learned something substantial. I expect many more pivots and adjustments to be necessary, but don’t expect to get stuck with plausibility arguments that are nearly as weak as the current arguments. Place where I disagree with your view: I agree that there are many drivers of AI other than learning. However, I think that learning is (a) currently the dominant component of powerful AI and so both more urgent and easier to study, (b) poses a much harder safety problems than other AI techniques under discussion, and (c) appears to be the “hard part” of analyzing procedures like evolution, fine-tuning brain-inspired architectures, or analyzing reasoning (it’s where I run into a wall when trying to analyze these other alternatives). I think that all of capability amplification, informed oversight, and red teams / adversarial training are amenable to theoretical analysis with realistic amounts of philosophical progress. For example, I think that it will be possible to analyze these schemes using only abstractions like optimization power, without digging into models of bounded rationality at all. I may have understated my optimism on this (for capability amplification in particular) in our last discussion—I do believe that we won’t have a formal argument, but I think we should aim for an argument that is based on plausible empirical assumptions plus very good evidence for those assumptions. Altruism as in “concern for future generations” does not fall out of coordination strategies, and seems more like a spandrel to me. But I do agree that many parts of altruism are more like coordination and this gives some prima facie reason for optimism about getting to some Pareto frontier. Take-aways that I agree with: We will need to have a better understanding of deliberation in order to be confident in any alignment scheme. (I prefer a more surgical approach than most MIRI folk, trying to figure exactly what we need to know rather than trying to have an expansive understanding of what good reasoning looks like.) It is valuable for people to step back from particular approaches to alignment and to try to form a clearer understanding of the problem, explore completely new approaches, etc.

Given that ALBA was not meant to be a realistic aligned AI design in and of itself, but just a way to get insights into how to build a realistic aligned AI (which I hadn’t entirely understood until now), I wonder if it makes sense to try to nail down all the details and arguments for it before checking to see if you generated any such insights. If we assume that aligned learning roughly looks like ALBA, what does that tell you about what a more realistic aligned AI looks like? It seems worth asking this, in case you, for example, spend a lot of time figuring out exactly how capability amplification could work, and then it ends up that capability amplification isn’t even used in the final aligned AI design, or in case designing aligned AI out of individual AI components doesn’t actually give you much insight into how to design more realistic aligned AI.

Current thoughts on Paul Christano's research agenda
post by Jessica Taylor 6 days ago | Sam Eisenstat, Paul Christiano and Wei Dai like this | 5 comments

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.

Thank you for writing this. I’m trying to better understand Paul’s ideas, and it really helps to see an explanation from a different perspective. Also, I was thinking of publicly complaining that I know at least four people who have objections to Paul’s approach that they haven’t published anywhere. Now that’s down to three. :)

I wonder if you can help answer some questions for me. (I’m directing these at Paul too, but since he’s very busy I can’t always expect an answer.)

Why does Paul think that learning needs to be “aligned” as opposed to just well-understood and well-behaved, so that it can be safely used as part of a larger aligned AI design that includes search, logic, etc.? He seems to be trying to design an entire aligned AI out of “learning”, which makes it seem like his approach is an alternative to MIRI’s (Daniel Dewey said this recently at EA Forums for example), while at the same time saying “But we can and should try to do the same for other AI components; I understand MIRI’s agent foundations agenda as (mostly) addressing the alignment of these other elements.” If he actually thinks that his approach and MIRI’s are complements, why didn’t he correct Daniel? I’m pretty confused here.

ETA: I found a partial answer to the above here. To express my understanding of it, Paul is trying to build an aligned AI out of only learning because that seems easier than building a realistic aligned AI and may give him insights into how to do the latter. If he interprets MIRI as doing the analogous thing starting with other AI components (as he seems to according to the quote in the above paragraph), then he surely ought to view the two approaches as complementary, which makes it a bigger puzzle why he didn’t contradict Daniel when Daniel said “if an approach along these lines is successful, it doesn’t seem to me that much room would be left for HRAD to help on the margin”. (Maybe he didn’t read that part, or his interpretation of what MIRI is doing has changed?)

If Paul does not think ALBA is a realistic design of an entire aligned AI (since it doesn’t include search/logic/etc.) what might a realistic design look like, roughly?

Why does Paul think learning “poses much harder safety problems than other AI techniques under discussion”?

Paul is beginning to do empirical work on capability amplification (as he told me recently via email). Do you think that’s a good alternative to trying to make further theoretical progress?

Current thoughts on Paul Christano's research agenda
post by Jessica Taylor 6 days ago | Sam Eisenstat, Paul Christiano and Wei Dai like this | 5 comments

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.

I mostly agree with this post’s characterization of my position.

Places where I disagree with your characterization of my view:

• I don’t assume that powerful actors can’t coordinate, and I don’t think that assumption is necessary. I would describe the situation as: over the course of time influence will necessarily shift sometimes due to forces we endorse—like deliberation or reconciliation—and sometimes due to forces we don’t endorse—like compatibility with an uncontrolled expansion strategy. Even if powerful actors can form perfectly-coordinated coalitions, a “weak” actor positioned to benefit from competitive expansion would simply decline to participate in that coalition unless offered extremely generous terms. I don’t see how the situation changes unless the strong actors use force. I do think that’s reasonably likely, I would more describe alignment as a first line of defense or a complement to approaches like regulation. I generally agree that good coordination can substitute for technical weakness.
• I don’t think I rely on or even implicitly use a unidimensional model of power. I do use a concept like “total wealth” or “total influence,” which seems almost but not quite tautologically well-defined (as the output of a competitive/bargaining dynamic) and in particular is compatible with knowledge vs. resources vs. whatever. Being “competitive” seems to make sense in very complex worlds, when I say something like “win in a fistfight” I mean to quantify over all possible fistfights (including science, economic competition, persuasion, war, etc. etc.).
• I have strong intuitions about my approach being workable, and either the approach will succeed or I at least will feel that I have learned something substantial. I expect many more pivots and adjustments to be necessary, but don’t expect to get stuck with plausibility arguments that are nearly as weak as the current arguments.

Place where I disagree with your view:

• I agree that there are many drivers of AI other than learning. However, I think that learning is (a) currently the dominant component of powerful AI and so both more urgent and easier to study, (b) poses a much harder safety problems than other AI techniques under discussion, and (c) appears to be the “hard part” of analyzing procedures like evolution, fine-tuning brain-inspired architectures, or analyzing reasoning (it’s where I run into a wall when trying to analyze these other alternatives).
• I think that all of capability amplification, informed oversight, and red teams / adversarial training are amenable to theoretical analysis with realistic amounts of philosophical progress. For example, I think that it will be possible to analyze these schemes using only abstractions like optimization power, without digging into models of bounded rationality at all. I may have understated my optimism on this (for capability amplification in particular) in our last discussion—I do believe that we won’t have a formal argument, but I think we should aim for an argument that is based on plausible empirical assumptions plus very good evidence for those assumptions.
• Altruism as in “concern for future generations” does not fall out of coordination strategies, and seems more like a spandrel to me. But I do agree that many parts of altruism are more like coordination and this gives some prima facie reason for optimism about getting to some Pareto frontier.

Take-aways that I agree with:

• We will need to have a better understanding of deliberation in order to be confident in any alignment scheme. (I prefer a more surgical approach than most MIRI folk, trying to figure exactly what we need to know rather than trying to have an expansive understanding of what good reasoning looks like.)
• It is valuable for people to step back from particular approaches to alignment and to try to form a clearer understanding of the problem, explore completely new approaches, etc.

Smoking Lesion Steelman
post by Abram Demski 21 days ago | Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this | 5 comments

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT

I agree with this.

It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.

I’d also be interested in finding such a problem.

I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a counter-example to causal decision theory. For example, consider a decision problem with the following payoff matrix:

Smoke-lover:

• Smokes:
• Killed: 10
• Not killed: -90
• Doesn’t smoke:
• Killed: 0
• Not killed: 0

Non-smoke-lover:

• Smokes:
• Killed: -100
• Not killed: -100
• Doesn’t smoke:
• Killed: 0
• Not killed: 0

For some reason, the agent doesn’t care whether they live or die. Also, let’s say that smoking makes a smoke-lover happy, but afterwards, they get terribly sick and lose 100 utilons. So they would only smoke if they knew they were going to be killed afterwards. The non-smoke-lover doesn’t want to smoke in any case.

Now, smoke-loving evidential decision theorists rightly choose smoking: they know that robots with a non-smoke-loving utility function would never have any reason to smoke, no matter which probabilities they assign. So if they end up smoking, then this means they are certainly smoke-lovers. It follows that they will be killed, and conditional on that state, smoking gives 10 more utility than not smoking.

Causal decision theory, on the other hand, seems to recommend a suboptimal action. Let $$a_1$$ be smoking, $$a_2$$ not smoking, $$S_1$$ being a smoke-lover, and $$S_2$$ being a non-smoke-lover. Moreover, say the prior probability $$P(S_1)$$ is $$0.5$$. Then, for a smoke-loving CDT bot, the expected utility of smoking is just

$$\mathbb{E}[U|a_1]=P(S_1)\cdot U(S_1\wedge a_1)+P(S_2)\cdot U(S_2\wedge a_1)=0.5\cdot 10 + 0.5\cdot (-90) = -40$$,

which is less then the certain $$0$$ utilons for $$a_2$$. Assigning a credence of around $$1$$ to $$P(S_1|a_1)$$, a smoke-loving EDT bot calculates

$$\mathbb{E}[U|a_1]=P(S_1|a_1)\cdot U(S_1\wedge a_1)+P(S_2|a_1)\cdot U(S_2\wedge a_1)\approx 1 \cdot 10 + 0\cdot (-90) = 10$$,

which is higher than the expected utility of $$a_2$$.

The reason CDT fails here doesn’t seem to lie in a mistaken causal structure. Also, I’m not sure whether the problem for EDT in the smoking lesion steelman is really that it can’t condition on all its inputs. If EDT can’t condition on something, then EDT doesn’t account for this information, but this doesn’t seem to be a problem per se.

In my opinion, the problem lies in an inconsistency in the expected utility equations. Smoke-loving EDT bots calculate the probability of being a non-smoke-lover, but then the utility they get is actually the one from being a smoke-lover. For this reason, they can get some “back-handed” information about their own utility function from their actions. The agents basically fail to condition two factors of the same product on the same knowledge.

Say we don’t know our own utility function on an epistemic level. Ordinarily, we would calculate the expected utility of an action, both as smoke-lovers and as non-smoke-lovers, as follows:

$$\mathbb{E}[U|a]=P(S_1|a)\cdot \mathbb{E}[U|S_1, a]+P(S_2|a)\cdot \mathbb{E}[U|S_2, a]$$,

where, if $$U_{1}$$ ($$U_{2}$$) is the utility function of a smoke-lover (non-smoke-lover), $$\mathbb{E}[U|S_i, a]$$ is equal to $$\mathbb{E}[U_{i}|a]$$. In this case, we don’t get any information about our utility function from our own action, and hence, no Newcomb-like problem arises.

I’m unsure whether there is any causal decision theory derivative that gets my case (or all other possible cases in this setting) right. It seems like as long as the agent isn’t certain to be a smoke-lover from the start, there are still payoffs for which CDT would (wrongly) choose not to smoke.

 by Vadim Kosoy 14 days ago | link | parent | on: Some Criticisms of the Logical Induction paper Replying to Rob. Actually, I do know a stronger property of LI / dominant forecasting. In the notation of my paper, we have the following: Let $$\{G^k\}$$ be a family of gamblers, $$\xi,\zeta$$ probability distributions supported on $$\mathbb{N}$$. Then, there exists a forecaster $$F$$ s.t. for any $$k,m,b \in \mathbb{N}$$ ($$b > 0$$) and $$x \in \mathcal{O}^\omega$$, if the first condition holds then the second condition holds: $\inf_{n \leq m} {\operatorname{\Sigma V}_\min G^{kF}_{n}(x_{:n-1})} > -b$ $\sup_{n \leq m} {\operatorname{\Sigma V}_\max G^{kF}_{n}(x_{:n-1})} \leq \frac{\sum_{j,c} \xi(j) \zeta(c) c}{\xi(k)\sum_{c > b} \zeta(c)}$ The above is not hard to see from the proof of Theorem 2, if you keep track of the finite terms. Now, this is not really a bound on the time of convergence, it is something like a bound on the number of mistakes made, which is completely analogous to standard guarantees in online learning. If you’re running LIA and stop enumerating traders at $$N$$, you will get the same dominance property, only for traders up to $$N$$. The fact that VC dimension is non-decreasing for a family of nested classes is a tautology. There is nothing special about the order of the hypotheses in SRM: the order (and weighting) of hypotheses is external information that reflects your prior about the correct hypothesis. Similarly, in LIA we can order the traders by description complexity, same as in Solomonoff induction, because we expect simpler patterns to be more important than complex patterns. This is nothing but the usual Occam’s razor. Or, we can consider more specialized dominant forecasting, with gamblers and ordering selected according to domain-specific considerations.

I don’t think that stopping at $$N$$ does some kind of “disproportionate” damage to the LI. For example, Theorem 4.2.1 in the LI paper requires one trader per each $$\epsilon > 0$$ and sequence of theorems. If this trader is in the selection, then the probabilities of the theorems will converge to 1 within $$\epsilon$$. Similarly, in my paper you need the gambler $$\Gamma S^{Mk}$$ to ensure your forecaster converges to incomplete model $$M$$ within $$1/k$$.

You can do SRM for any sequence of nested classes of finite VC dimension. For example, if you have a countable set of hypotheses $$\{h_n\}$$, you can take classes to be $$H_n:=\{h_m\}_{m < n}$$. This is just as arbitrary as in LI. The thing is, the error bound that SRM satisfies depends on the actual class in which the reference hypothesis lies. So, compared to a hypothesis in a very high class, SRM can converge very slowly (require very large sample size). This is completely analogous to the inequality I gave before, where $$\xi(k)$$ appears in the denominator of the bound, so gamblers that are “far away” in the “prior” can win a lot of bets before the forecaster catches up. SRM is useful iff you a priori expect “low” hypotheses to be good approximations. For example, suppose you want to fit a polynomial to some data but you don’t know what degree to use. SRM gives you a rule that determines the degree automatically, from the data itself. However, the reason this rule has good performance is because we expect most functions in the real world to be relatively “smooth” and therefore well-approximable by a low degree polynomial.

Ordering by description complexity is perfectly computable in itself, it just means that we fix a UTM, represent traders by programs (on which we impose a time bound, otherwise it really is uncomputable), and weight each trader by 2^{-program length}. It would be interesting to find some good property this thing has. If we don’t impose a time bound (so we are uncomputable), then the Bayesian analogue is Solomonoff induction, which has the nice property that it only “weakly” depends on the choice of UTM. Will different UTMs gives “similar” LIs? Off the top of my head, I have no idea! When we add the time bound it gets more messy since time is affected by translation between UTMs, so I’m not sure how to formulate an “invariance” property even in the Bayesian case. Looking through Schmidhuber’s paper on the speed prior, I see ey do have some kind of invariance (section 4) but I’m too lazy to understand the details right now.

 by Vadim Kosoy 15 days ago | link | parent | on: Some Criticisms of the Logical Induction paper Replying to 240 (I can’t reply directly because of some quirk of the forum). (I’m not sure that I know your name, is it Robert?) I’m actually not interested in discussing the verbal arguments in the paper. Who reads the verbal arguments anyway? I go straight for the equations ;) The verbal arguments might or might not misrepresent the results, I don’t care either way. I am interested in discussing the mathematical content of the LI paper and whether LI is a valuable mathematical discovery. I agree that there is a relevant sense in which the dominance property in itself is too weak. The way I see it, LI (more generally, dominant forecasting, since I don’t care much about the application to formal logic in particular) is a generalization of Bayesian inference, and the dominance property is roughly analogous to merging of opinions (in my paper there is a more precise analogy, but it doesn’t matter). In the Bayesian case, we can also say that merging of opinions is a too weak property in itself. However: Proving this weak property is already non-trivial, so it is a valuable and necessary step towards proving stronger properties. Besides the property, we have an actual construction and I expect this construction to have stronger properties (even though I don’t expect it to be a practical algorithm), like Bayesian inference has stronger properties than just merging of opinions. Now, nobody currently has a proof of stronger properties or even a formulation (I think), but IMO this is not a reason to disregard all work done so far. Rome wasn’t built in a day :)

Actually, I do know a stronger property of LI / dominant forecasting. In the notation of my paper, we have the following:

Let $$\{G^k\}$$ be a family of gamblers, $$\xi,\zeta$$ probability distributions supported on $$\mathbb{N}$$. Then, there exists a forecaster $$F$$ s.t. for any $$k,m,b \in \mathbb{N}$$ ($$b > 0$$) and $$x \in \mathcal{O}^\omega$$, if the first condition holds then the second condition holds:

$\inf_{n \leq m} {\operatorname{\Sigma V}_\min G^{kF}_{n}(x_{:n-1})} > -b$

$\sup_{n \leq m} {\operatorname{\Sigma V}_\max G^{kF}_{n}(x_{:n-1})} \leq \frac{\sum_{j,c} \xi(j) \zeta(c) c}{\xi(k)\sum_{c > b} \zeta(c)}$

The above is not hard to see from the proof of Theorem 2, if you keep track of the finite terms. Now, this is not really a bound on the time of convergence, it is something like a bound on the number of mistakes made, which is completely analogous to standard guarantees in online learning.

If you’re running LIA and stop enumerating traders at $$N$$, you will get the same dominance property, only for traders up to $$N$$.

The fact that VC dimension is non-decreasing for a family of nested classes is a tautology. There is nothing special about the order of the hypotheses in SRM: the order (and weighting) of hypotheses is external information that reflects your prior about the correct hypothesis. Similarly, in LIA we can order the traders by description complexity, same as in Solomonoff induction, because we expect simpler patterns to be more important than complex patterns. This is nothing but the usual Occam’s razor. Or, we can consider more specialized dominant forecasting, with gamblers and ordering selected according to domain-specific considerations.

 by Vadim Kosoy 15 days ago | link | parent | on: Some Criticisms of the Logical Induction paper Replying to “240” First, yes, the convergence property is not uniform, but it neither should nor can be uniform. This is completely analogous to usual Bayesian inference where the speed of convergence depends on the probability of the correct hypothesis within the prior, or to the Structural Risk Minimization principle in PAC learning. This is unavoidable: if two models only start to differ at a very late point in time, there no way to distinguish between them before this point. In particular, human brains have the same limitation: it would take you more time to notice a complex pattern than a simple pattern. Second, of course we can and should work on results about convergence rates, it is just that there is only that much ground you can reasonably cover in one paper. For example, the classical paper on merging of opinions by Blackwell and Dubins also doesn’t analyze convergence rates, and nevertheless it has 500+ citations. To make one concrete guess, if we consider a sequence of observations in $$\{0,1\}$$ and look at the deviation between the probability the incomplete model forecaster assigns to the next observation and the probability interval that follows from a given correct incomplete model, then I expect to have some sort of cumulative bound on these deviations analogous to standard regret bounds in online learning. Of course the bound will include some parameter related to the “prior probability” $$\xi(k)$$. Third, the only reason the sum in equation (31) is finite is because I wanted to state Theorem 2 in greater generality, without requiring the gamblers to be bounded. However, the gamblers that I actually use (see section 5) are bounded, so for this purpose I might as well have taken an infinite sum over all gamblers.

Replying to 240 (I can’t reply directly because of some quirk of the forum).

(I’m not sure that I know your name, is it Robert?)

I’m actually not interested in discussing the verbal arguments in the paper. Who reads the verbal arguments anyway? I go straight for the equations ;) The verbal arguments might or might not misrepresent the results, I don’t care either way.

I am interested in discussing the mathematical content of the LI paper and whether LI is a valuable mathematical discovery.

I agree that there is a relevant sense in which the dominance property in itself is too weak. The way I see it, LI (more generally, dominant forecasting, since I don’t care much about the application to formal logic in particular) is a generalization of Bayesian inference, and the dominance property is roughly analogous to merging of opinions (in my paper there is a more precise analogy, but it doesn’t matter). In the Bayesian case, we can also say that merging of opinions is a too weak property in itself. However:

• Proving this weak property is already non-trivial, so it is a valuable and necessary step towards proving stronger properties.

• Besides the property, we have an actual construction and I expect this construction to have stronger properties (even though I don’t expect it to be a practical algorithm), like Bayesian inference has stronger properties than just merging of opinions.

Now, nobody currently has a proof of stronger properties or even a formulation (I think), but IMO this is not a reason to disregard all work done so far. Rome wasn’t built in a day :)

 by Abram Demski 15 days ago | Vadim Kosoy likes this | link | parent | on: Smoking Lesion Steelman The non-smoke-loving agents think of themselves as having a negative incentive to switch to CDT in that case. They think that if they build a CDT agent with oracle access to their true reward function, they may smoke (since they don’t know what their true reward function is). So I don’t think there’s an equilibrium there. The non-smoke-lovers would prefer to explicitly give a CDT successor a non-smoke-loving utility function, if they wanted to switch to CDT. But then, this action itself would give evidence of their own true utility function, likely counter-balancing any reason to switch to CDT. I was wondering about what happens if the agents try to write a strategy for switching between using such a utility oracle and a hand-written utility function (which would in fact be the same function, since they prefer their own utility function). But this probably doesn’t do anything nice either, since a useful choice of policy their would also reveal too much information about motives.
by Vadim Kosoy 15 days ago | Abram Demski likes this | link | on: Smoking Lesion Steelman

Yeah, you’re right. This setting is quite confusing :) In fact, if your agent doesn’t commit to a policy once and for all, things get pretty weird because it doesn’t trust its future-self.

 by Vadim Kosoy 16 days ago | Abram Demski likes this | link | parent | on: Smoking Lesion Steelman The claim that “this isn’t changed at all by trying updateless reasoning” depends on the assumptions about updateless reasoning. If the agent chooses a policy in the form of a self-sufficient program, then you are right. On the other hand, if the agent chooses a policy in the form of a program with oracle access to the “utility estimator,” then there is an equilibrium where both smoke-lovers and non-smoke-lovers self-modify into CDT. Admittedly, there are also “bad” equilibria, e.g. non-smoke-lovers staying with EDT and smoke-lovers choosing between EDT and CDT with some probability. However, it seems arguable that the presence of bad equilibria is due to the “degenerate” property of the problem that one type of agents have incentives to move away from EDT whereas another type has exactly zero such incentives.
by Abram Demski 15 days ago | Vadim Kosoy likes this | link | on: Smoking Lesion Steelman

The non-smoke-loving agents think of themselves as having a negative incentive to switch to CDT in that case. They think that if they build a CDT agent with oracle access to their true reward function, they may smoke (since they don’t know what their true reward function is). So I don’t think there’s an equilibrium there. The non-smoke-lovers would prefer to explicitly give a CDT successor a non-smoke-loving utility function, if they wanted to switch to CDT. But then, this action itself would give evidence of their own true utility function, likely counter-balancing any reason to switch to CDT.

I was wondering about what happens if the agents try to write a strategy for switching between using such a utility oracle and a hand-written utility function (which would in fact be the same function, since they prefer their own utility function). But this probably doesn’t do anything nice either, since a useful choice of policy their would also reveal too much information about motives.

 by Vadim Kosoy 16 days ago | Alex Mennen likes this | link | parent | on: Some Criticisms of the Logical Induction paper Alex, the difference between your point of view and my own, is that I don’t care that much about the applications of LI to formal logic. I’m much more interested in its applications to forecasting. If the author is worried about the same issues as you, IMO it is not very clear from ’eir essay. On the other hand, the original paper doesn’t talk much about applications outside of formal logic, and my countercriticism might be slightly unfair since it’s grounded in a perspective that the author doesn’t know about.

First, yes, the convergence property is not uniform, but it neither should nor can be uniform. This is completely analogous to usual Bayesian inference where the speed of convergence depends on the probability of the correct hypothesis within the prior, or to the Structural Risk Minimization principle in PAC learning. This is unavoidable: if two models only start to differ at a very late point in time, there no way to distinguish between them before this point. In particular, human brains have the same limitation: it would take you more time to notice a complex pattern than a simple pattern.

Second, of course we can and should work on results about convergence rates, it is just that there is only that much ground you can reasonably cover in one paper. For example, the classical paper on merging of opinions by Blackwell and Dubins also doesn’t analyze convergence rates, and nevertheless it has 500+ citations. To make one concrete guess, if we consider a sequence of observations in $$\{0,1\}$$ and look at the deviation between the probability the incomplete model forecaster assigns to the next observation and the probability interval that follows from a given correct incomplete model, then I expect to have some sort of cumulative bound on these deviations analogous to standard regret bounds in online learning. Of course the bound will include some parameter related to the “prior probability” $$\xi(k)$$.

Third, the only reason the sum in equation (31) is finite is because I wanted to state Theorem 2 in greater generality, without requiring the gamblers to be bounded. However, the gamblers that I actually use (see section 5) are bounded, so for this purpose I might as well have taken an infinite sum over all gamblers.

 by Vadim Kosoy 16 days ago | Alex Mennen likes this | link | parent | on: Some Criticisms of the Logical Induction paper Alex, the difference between your point of view and my own, is that I don’t care that much about the applications of LI to formal logic. I’m much more interested in its applications to forecasting. If the author is worried about the same issues as you, IMO it is not very clear from ’eir essay. On the other hand, the original paper doesn’t talk much about applications outside of formal logic, and my countercriticism might be slightly unfair since it’s grounded in a perspective that the author doesn’t know about.

Clarification: I’m not the author of the linked post, I’m just linking it here on behalf of the author (who objects to signing up for things with facebook). I found the most interesting part to be the observation about pointwise convergence - because the LI framework proves its results pointwise, it cannot guarantee that each result which is proved about P_infty will obtain for any finite time n; see also his example about preemptive learning, where for an infinite set of e.c sequences, there may be no finite n that works for all of the set.

 by Alex Mennen 16 days ago | Abram Demski likes this | link | parent | on: Some Criticisms of the Logical Induction paper I think the asymptotic nature of LI is much more worrying than the things you compare it to. In order for an asymptotic result to be encouraging for practical applications, we need it to be an asymptotic result such that in most cases of interest when it holds, you tend to also have good finite results. If you give a $$O(n^{12})$$ algorithm for some problem, this is an encouraging sign that maybe you can develop a practical algorithm. But if a problem is known not to be in $$P$$, and then you give an exponential-time algorithm for it, and argue that it can probably be solved efficiently in practice because you’ve established an asymptotic bound, then no one will take this claim very seriously. It seems to me that this is roughly the position logical induction is in, but lifted from efficiency all the way up to computability. The probabilities given by logical induction converge to coherent probabilities, but there is no computable function that will tell you how long you have to run logical induction to know the eventual probability of a certain sentence within a certain precision. We know logical induction doesn’t do this because it cannot be done; if it could be, then there would be a computable set containing all provable sentences and no disprovable sentences (since every sentence would eventually be shown either not to have probability 0 or not to have probability 1), but there is no such computable set. So we know for sure that in the naive sense, there is no way, even in principle, to run logical induction long enough that you have probabilities you can trust to be reasonably accurate. Maybe a better complexity theory analogy would be if you give a $$PP$$ algorithm for a problem known not to be in $$BPP$$ (although there would need to be exciting advancements in complexity theory before that could happen); this gives good reasons to believe that improvements to the algorithm would make it practical to run, and if you run it enough times the majority vote will probably be correct, but it will never be practical to run it enough times for that to happen. Likewise, efficient approximations to logical induction should be able to complete each step in a reasonable amount of time, but will not be able to complete enough steps to give you accurate probabilities. In order to take the probabilities given by logical induction at very large finite steps seriously, you would need effective asymptotic results, and since this cannot be done for the probabilities converging to coherent probabilities, there would need to be subtler senses in which the probabilities given at finite times can be taken seriously even if they are not close to the limiting probabilities. Now, the logical induction paper does give interesting subtler senses in which the probabilities given by logical induction can be taken seriously before they approach their limiting values, but the fundamental problem is that all of those results say that some pattern will eventually hold, for the same sense of “eventually” in which the probabilities eventually converge to coherent probabilities, so this does not give strong evidence that those desirable patterns can be made to emerge in any reasonable amount of time.

Alex, the difference between your point of view and my own, is that I don’t care that much about the applications of LI to formal logic. I’m much more interested in its applications to forecasting. If the author is worried about the same issues as you, IMO it is not very clear from ’eir essay. On the other hand, the original paper doesn’t talk much about applications outside of formal logic, and my countercriticism might be slightly unfair since it’s grounded in a perspective that the author doesn’t know about.

 by Vadim Kosoy 16 days ago | Patrick LaVictoire and Scott Garrabrant like this | link | parent | on: Some Criticisms of the Logical Induction paper The substance of this criticism mostly reduces to “all results proven are only asymptotic.” This is not very useful criticism. Yes, the results are asymptotic, but this is true for half of theoretical computer science: complexity theory, statistical learning theory, you name it… The LI is obviously not a practical algorithm, but it is not supposed to be a practical algorithm. What it’s supposed to be, IMO, is an existence proof that certain (asymptotic!) desiderata can be satisfied. To make an analogy, the fundumental theorem of statistical learning theory tells us that the ERM principle is guaranteed to converge to the right answer whenever the VC dimension is finite. Now, if we want to make few domain-specific assumptions, we want to take a hypothesis space as large as possible and, voila, feed-forward NNs have a finite VC dimension but also allow approximating any Boolean circuit of given size, so, like in Solomonoff induction, we can approximate anything computable. On the other hand, applying the ERM principle to NNs is NP-hard, so we do gradient descend instead and nobody knows why it works. If we live long enough, I believe we will eventually find a mathematical characterization for a large natural class of problems where gradient descent for ANNs provably approximates the global optimum. Similarly, Solomonoff induction and the LI (in the abstract setting, not necessarily in the formal logic setting) should also have efficient analogues that work for some large natural class of problems. We don’t know these analogues yet (which is probably a good thing), but this doesn’t mean they don’t exist. Finally, the notion that “LIA is like an author-bot which generates all possible character strings in some arbitrary order” misses the point that we expect traders with low description complexity to be more useful for the usual Occam’s razor reason. So, to sum up, yes, the LI paper doesn’t solve all problems in the world, but neither does it pretend to do it, and it seems a decent result by any realistic standard.

I think the asymptotic nature of LI is much more worrying than the things you compare it to. In order for an asymptotic result to be encouraging for practical applications, we need it to be an asymptotic result such that in most cases of interest when it holds, you tend to also have good finite results. If you give a $$O(n^{12})$$ algorithm for some problem, this is an encouraging sign that maybe you can develop a practical algorithm. But if a problem is known not to be in $$P$$, and then you give an exponential-time algorithm for it, and argue that it can probably be solved efficiently in practice because you’ve established an asymptotic bound, then no one will take this claim very seriously. It seems to me that this is roughly the position logical induction is in, but lifted from efficiency all the way up to computability.

The probabilities given by logical induction converge to coherent probabilities, but there is no computable function that will tell you how long you have to run logical induction to know the eventual probability of a certain sentence within a certain precision. We know logical induction doesn’t do this because it cannot be done; if it could be, then there would be a computable set containing all provable sentences and no disprovable sentences (since every sentence would eventually be shown either not to have probability 0 or not to have probability 1), but there is no such computable set. So we know for sure that in the naive sense, there is no way, even in principle, to run logical induction long enough that you have probabilities you can trust to be reasonably accurate. Maybe a better complexity theory analogy would be if you give a $$PP$$ algorithm for a problem known not to be in $$BPP$$ (although there would need to be exciting advancements in complexity theory before that could happen); this gives good reasons to believe that improvements to the algorithm would make it practical to run, and if you run it enough times the majority vote will probably be correct, but it will never be practical to run it enough times for that to happen. Likewise, efficient approximations to logical induction should be able to complete each step in a reasonable amount of time, but will not be able to complete enough steps to give you accurate probabilities.

In order to take the probabilities given by logical induction at very large finite steps seriously, you would need effective asymptotic results, and since this cannot be done for the probabilities converging to coherent probabilities, there would need to be subtler senses in which the probabilities given at finite times can be taken seriously even if they are not close to the limiting probabilities. Now, the logical induction paper does give interesting subtler senses in which the probabilities given by logical induction can be taken seriously before they approach their limiting values, but the fundamental problem is that all of those results say that some pattern will eventually hold, for the same sense of “eventually” in which the probabilities eventually converge to coherent probabilities, so this does not give strong evidence that those desirable patterns can be made to emerge in any reasonable amount of time.

 by Vadim Kosoy 16 days ago | Patrick LaVictoire and Scott Garrabrant like this | link | parent | on: Some Criticisms of the Logical Induction paper The substance of this criticism mostly reduces to “all results proven are only asymptotic.” This is not very useful criticism. Yes, the results are asymptotic, but this is true for half of theoretical computer science: complexity theory, statistical learning theory, you name it… The LI is obviously not a practical algorithm, but it is not supposed to be a practical algorithm. What it’s supposed to be, IMO, is an existence proof that certain (asymptotic!) desiderata can be satisfied. To make an analogy, the fundumental theorem of statistical learning theory tells us that the ERM principle is guaranteed to converge to the right answer whenever the VC dimension is finite. Now, if we want to make few domain-specific assumptions, we want to take a hypothesis space as large as possible and, voila, feed-forward NNs have a finite VC dimension but also allow approximating any Boolean circuit of given size, so, like in Solomonoff induction, we can approximate anything computable. On the other hand, applying the ERM principle to NNs is NP-hard, so we do gradient descend instead and nobody knows why it works. If we live long enough, I believe we will eventually find a mathematical characterization for a large natural class of problems where gradient descent for ANNs provably approximates the global optimum. Similarly, Solomonoff induction and the LI (in the abstract setting, not necessarily in the formal logic setting) should also have efficient analogues that work for some large natural class of problems. We don’t know these analogues yet (which is probably a good thing), but this doesn’t mean they don’t exist. Finally, the notion that “LIA is like an author-bot which generates all possible character strings in some arbitrary order” misses the point that we expect traders with low description complexity to be more useful for the usual Occam’s razor reason. So, to sum up, yes, the LI paper doesn’t solve all problems in the world, but neither does it pretend to do it, and it seems a decent result by any realistic standard.

Yeah. An asymptotic thing like Solomonoff induction can still have all sorts of mathematical goodness, like multiple surprisingly equivalent definitions, uniqueness/optimality properties, etc. It doesn’t have to be immediately practical to be worth studying. I hope LI can also end up like that.

 Some Criticisms of the Logical Induction paper link by Tarn Somervell Fletcher 24 days ago | Alex Mennen, Sam Eisenstat and Scott Garrabrant like this | 10 comments

The substance of this criticism mostly reduces to “all results proven are only asymptotic.” This is not very useful criticism. Yes, the results are asymptotic, but this is true for half of theoretical computer science: complexity theory, statistical learning theory, you name it… The LI is obviously not a practical algorithm, but it is not supposed to be a practical algorithm. What it’s supposed to be, IMO, is an existence proof that certain (asymptotic!) desiderata can be satisfied. To make an analogy, the fundumental theorem of statistical learning theory tells us that the ERM principle is guaranteed to converge to the right answer whenever the VC dimension is finite. Now, if we want to make few domain-specific assumptions, we want to take a hypothesis space as large as possible and, voila, feed-forward NNs have a finite VC dimension but also allow approximating any Boolean circuit of given size, so, like in Solomonoff induction, we can approximate anything computable. On the other hand, applying the ERM principle to NNs is NP-hard, so we do gradient descend instead and nobody knows why it works. If we live long enough, I believe we will eventually find a mathematical characterization for a large natural class of problems where gradient descent for ANNs provably approximates the global optimum. Similarly, Solomonoff induction and the LI (in the abstract setting, not necessarily in the formal logic setting) should also have efficient analogues that work for some large natural class of problems. We don’t know these analogues yet (which is probably a good thing), but this doesn’t mean they don’t exist. Finally, the notion that “LIA is like an author-bot which generates all possible character strings in some arbitrary order” misses the point that we expect traders with low description complexity to be more useful for the usual Occam’s razor reason. So, to sum up, yes, the LI paper doesn’t solve all problems in the world, but neither does it pretend to do it, and it seems a decent result by any realistic standard.

Smoking Lesion Steelman
post by Abram Demski 21 days ago | Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this | 5 comments

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

by Vadim Kosoy 16 days ago | Abram Demski likes this | link | on: Smoking Lesion Steelman

The claim that “this isn’t changed at all by trying updateless reasoning” depends on the assumptions about updateless reasoning. If the agent chooses a policy in the form of a self-sufficient program, then you are right. On the other hand, if the agent chooses a policy in the form of a program with oracle access to the “utility estimator,” then there is an equilibrium where both smoke-lovers and non-smoke-lovers self-modify into CDT. Admittedly, there are also “bad” equilibria, e.g. non-smoke-lovers staying with EDT and smoke-lovers choosing between EDT and CDT with some probability. However, it seems arguable that the presence of bad equilibria is due to the “degenerate” property of the problem that one type of agents have incentives to move away from EDT whereas another type has exactly zero such incentives.

Cooperative Oracles: Stratified Pareto Optima and Almost Stratified Pareto Optima
post by Scott Garrabrant 50 days ago | Vadim Kosoy, Patrick LaVictoire and Stuart Armstrong like this | 6 comments

In this post, we generalize the notions in Cooperative Oracles: Nonexploited Bargaining to deal with the possibility of introducing extra agents that have no control but have preferences. We further generalize this to infinitely many agents. (Part of the series started here.)

In the infinite game example, I think that something doesn’t add up in the definition of $$f$$. A single-valued Kakutani map into a compact space is just a continuous map, but $$f$$ is not continuous.

post by Scott Garrabrant 657 days ago | Sam Eisenstat, Vadim Kosoy and Jessica Taylor like this | 4 comments

In this post, I ask two questions about Solomonoff Induction. I am not sure if these questions are open or not. If you know the answer to either of them, please let me know. I think that the answers may be very relevant to stuff I am currently working on in Asymptotic Logical Uncertainty.

Universal Prediction of Selected Bits solves the related question of what happens if the odd bits are adversarial but the even bits copy the preceding odd bits. Basically, the universal semimeasure learns to do the right thing, but the exact sense in which the result is positive is subtle and has to do with the difference between measures and semimeasures. The methods may also be relevant to the questions here, though I don’t see a proof for either question yet.

AI safety: three human problems and one AI issue
post by Stuart Armstrong 64 days ago | Ryan Carey and Daniel Dewey like this | 2 comments

A putative new idea for AI control; index here.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

It seems to me that “friendly AI” is a name for the entire field rather than a particular approach, otherwise I don’t understand what you mean by “friendly AI”? More generally, it would be nice to provide a link for each of the approaches.

A cheating approach to the tiling agents problem
post by Vladimir Slepnev 22 days ago | Alex Mennen and Vadim Kosoy like this | 3 comments

(This post resulted from a conversation with Wei Dai.)

Formalizing the tiling agents problem is very delicate. In this post I’ll show a toy problem and a solution to it, which arguably meets all the desiderata stated before, but only by cheating in a new and unusual way.

Here’s a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor must solve another math question and design its own successor, and so on. The questions get harder each time, so they can’t all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.

Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself on each round, not just design a custom-made successor that solves the math question with PA and reproduces itself by quining.

I just realized that A will not only approve itself as successor, but also approve some limited self-modifications, like removing some inefficiency in choosing B that provably doesn’t affect the choice of B. Though it doesn’t matter much, because A might as well delete all code for choosing B and appoint a quining B as successor.

This suggests that the next version of the tiling agents problem should involve nontrivial self-improvement, not just self-reproduction. I have no idea how to formalize that though.

Smoking Lesion Steelman
post by Abram Demski 21 days ago | Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this | 5 comments

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

by Paul Christiano 18 days ago | Scott Garrabrant likes this | link | on: Smoking Lesion Steelman

I like this line of inquiry; it seems like being very careful about the justification for CDT will probably give a much clearer sense of what we actually want out of “causal” structure for logical facts.

 by Alex Mennen 21 days ago | Abram Demski and Vladimir Slepnev like this | link | parent | on: A cheating approach to the tiling agents problem I think the general principle you’re taking advantage of is: Do a small amount of very predictable reasoning in PA+Sound(PA). Then use PA for the rest of the reasoning you have to do. When reasoning about other instances of agents similar to you, simulate their use of Sound(PA), and trust the reasoning that they did while confining themselves to PA. In your example, PA+Con(PA) sufficed, but PA+Sound(PA) is more flexible in general, in ways that might be important. This also seems to solve the closely related problem of how to trust statements if you know they were proven by an agent that reasons approximately the same way you do, for instance, statements proven by yourself in the past. If you proved X in the past, and you want to establish that that makes X true, you fully simulate everything your past self could have done in PA+Sound(PA)-mode before confining itself to PA, and then you can just trust that the reasoning you did afterwards in PA-mode was correct, so you don’t have to redo that part. This also might allow for an agent to make improvements in its own source code and still trust its future self, provided that the modification that it makes still only uses Sound(PA) in ways that it can predict and simulate. On the other hand, that condition might be a serious limitation to recursive self-improvement, since the successor agent would need to use Sound(PA) in order to pick its successor agent, and it can’t do so in ways that the predecessor agent can’t predict. Perhaps it could be worse than that, and attempting to do anything nontrivial with this trick leads to a combinatorial explosion from every instance of the agent trying to simulate every other instance’s uses of Sound(PA). But I’m cautiously optimistic that it isn’t quite that bad, since simply simulating an agent invoking Sound(PA) does not itself require you to invoke Sound(PA) yourself, so these simulations can be run in PA-mode, and only the decision to run them needs to be made in PA+Sound(PA)-mode.

I’d like to understand what we want from self-modification informally, and refine that into a toy problem that the agent in the post can’t solve…

Older

### NEW DISCUSSION POSTS

A few thoughts: I agree
 by Sam Eisenstat on Some Criticisms of the Logical Induction paper | 0 likes

Thanks, so to paraphrase your
 by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

> Why does Paul think that
 by Paul Christiano on Current thoughts on Paul Christano's research agen... | 0 likes

Given that ALBA was not meant
 by Wei Dai on Current thoughts on Paul Christano's research agen... | 0 likes

Thank you for writing this.
 by Wei Dai on Current thoughts on Paul Christano's research agen... | 1 like

I mostly agree with this
 by Paul Christiano on Current thoughts on Paul Christano's research agen... | 2 likes

>From my perspective, I don’t
 by Johannes Treutlein on Smoking Lesion Steelman | 2 likes

 by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 0 likes

 by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 0 likes

 by Vadim Kosoy on Some Criticisms of the Logical Induction paper | 0 likes

Yeah, you're right. This
 by Vadim Kosoy on Smoking Lesion Steelman | 1 like

The non-smoke-loving agents
 by Abram Demski on Smoking Lesion Steelman | 1 like