Intelligent Agent Foundations Forumsign up / log in

Yeah, when I went back and patched up the framework of this post to be less logical-omniscence-y, I was able to get \(2\to 3\to 4\to 1\), but 2 is a bit too strong to be proved from 1, because my framing of 2 is just about probability disagreements in general, while 1 requires \(W\) to assign probability 1 to \(\phi\).


I found an improved version by Pavel, that gives a way to construct a proof of \(\phi\) from \(\Box_{n}\phi\) that has a length of \(\mathcal{O}(n)\). The improved version is here.

There are restrictions to this result, though. One is that the C-rule must apply to the logic. This is just the ability to go from \(\exists x:\phi(x)\) to instantiating a \(c\) such that \(\phi(c)\). Pretty much all reasonable theorem provers have this.

The second restriction is that the theory must be finitely axiomatizable. No axiom schemas allowed. Again, this isn’t much of a restriction in practice, because NBG set theory, which proves the consistency of ZFC, is finitely axiomatizable.

The proof strategy is basically as follows. It’s shown that the shortest proof of a statement with quantifier depth n must have a length of \(\Omega(n^2)\), if the maximum quantifier depth in the proof is \(2n\) or greater.

This can be flipped around to conclude that if there’s a length-n proof of \(\phi\), the maximum quantifier depth in the proof can be at most \(\mathcal{O}(\sqrt{n})\).

The second part of the proof involves constructing a bounded-quantifier version of a truth predicate. By Tarski’s undefinability of truth, a full truth predicate cannot be constructed, but it’s possible to exhibit a formula for which it’s provable that \[qd(\overline{\psi})\le n\to(Sat_{n}(\overline{\psi},x)\leftrightarrow\Sigma(Sat_{n},\overline{\psi},x))\] (\(\Sigma\) is the formula laying out Tarski’s conditions for something to be a truth predicate). Also, if \(n\ge\) quantifier depth of \(\psi\), there’s a proof of \[Sat_{n}(\overline{\psi},x)\leftrightarrow\psi[x]\] (\(\psi[x]\) is the sentence \(\psi\) with its free variables substituted for the elements enumerated in the list \(x\)) Also, there’s a proof that \(Sat_{n}\) is preserved under inference rules and logical axioms, as long as everything stays below a quantifier depth of \(n\).

All these proofs can be done in \(\mathcal{O}(n^2)\) lines. One factor of \(n\) comes from the formula abbreviated as \(Sat_{n}(x,y)\) getting longer at a linear rate, and the other factor comes from having to prove \(Sat_{n}\) for each \(n\) seperately as an ingredient for the next proof.

Combining the two parts, the \(\mathcal{O}(\sqrt{n})\) bound on the quantifier depth and the \(\mathcal{O}(n^2)\) bound on how long it takes to prove stuff about the truth predicate, make it take \(\mathcal{O}(n)\) steps to prove all the relevant theorems about a sufficiently large bounded quantifier depth truth predicate, and then you can just go “the statement that we are claiming to have been proved must have \(Sat_{n}\) apply to it, and we’ve proved this is equivalent to the statement itself”

As a further bonus, a single \(\mathcal{O}(n)\)-length proof can establish the consistency of the theory itself for all \(n\)-length proofs.

It seems like a useful project to develop a program that will automatically write a proof of this form, to assess whether abstract unpacking of bounded proofs is usable in practice, but it will require digging into a bunch of finicky details of exactly how to encode a math theory inside itself.


Caught a flaw with this proposal in the currently stated form, though it is probably patchable.

When unpacking a proof, at some point the sentence \(\Box_{1}\phi\) will be reached as a conclusion, which is a false statement.


by Sam Eisenstat 375 days ago | link

I misunderstood your proposal, but you don’t need to do this work to get what you want. You can just take each sentence \(\square_n \phi \to \phi\) as an axiom, but declare that this axiom takes \(n\) symbols to invoke. This could be done by changing the notion of length of a proof, or by taking axioms \(\psi_{\phi,n} \to (\square_n \phi \to \phi)\) and \(\psi_{\phi,n}\) with \(\psi_{\phi,n}\) very long.


by Alex Appel 422 days ago | Abram Demski likes this | link | parent | on: Smoking Lesion Steelman

I think that in that case, the agent shouldn’t smoke, and CDT is right, although there is side-channel information that can be used to come to the conclusion that the agent should smoke. Here’s a reframing of the provided payoff matrix that makes this argument clearer. (also, your problem as stated should have 0 utility for a nonsmoker imagining the situation where they smoke and get killed)

Let’s say that there is a kingdom which contains two types of people, good people and evil people, and a person doesn’t necessarily know which type they are. There is a magical sword enchanted with a heavenly aura, and if a good person wields the sword, it will guide them do heroic things, for +10 utility (according to a good person) and 0 utility (according to a bad person). However, if an evil person wields the sword, it will afflict them for the rest of their life with extreme itchiness, for -100 utility (according to everyone).

good person’s utility estimates:

  • takes sword

    • I’m good: 10

    • I’m evil: -90

  • don’t take sword: 0

evil person’s utility estimates:

  • takes sword

    • I’m good: 0

    • I’m evil: -100

  • don’t take sword: 0

As you can clearly see, this is the exact same payoff matrix as the previous example. However, now it’s clear that if a (secretly good) CDT agent believes that most of society is evil, then it’s a bad idea to pick up the sword, because the agent is probably evil (according to the info they have) and will be tormented with itchiness for the rest of their life, and if it believes that most of society is good, then it’s a good idea to pick up the sword. Further, this situation is intuitively clear enough to argue that CDT just straight-up gets the right answer in this case.

A human (with some degree of introspective power) in this case, could correctly reason “oh hey I just got a little warm fuzzy feeling upon thinking of the hypothetical where I wield the sword and it doesn’t curse me. This is evidence that I’m good, because an evil person would not have that response, so I can safely wield the sword and will do so”.

However, what the human is doing in this case is using side-channel information that isn’t present in the problem description. They’re directly experiencing sense data as a result of the utility calculation outputting 10 in that hypothetical, and updating on that. In a society where everyone was really terrible at introspection so the only access they had to their decision algorithm was seeing their actual decision, (and assuming no previous decision problems that good and evil people decide differently on so the good person could learn that they were good by their actions), it seems to me like there’s a very intuitively strong case for not picking up the sword/not smoking.


by Alex Appel 435 days ago | link | parent | on: Musings on Exploration

A: While that is a really interesting note that I hadn’t spotted before, the standard formulation of exploration steps in logical inductor decision theory involve infinite exploration steps over all time, so even though an agent of this type would be able to inductively learn from what other agents do in different decision problems in less time than it naively appears, that wouldn’t make it explore less.

B: What I intended with the remark about Thompson sampling was that troll bridge functions on there being two distinct causes of “attempting to cross the bridge”. One is crossing because you believe it to be the best action, and the other is crossing because an exploration step occurred, and Thompson sampling doesn’t have a split decision criterion like this. Although now that you point it out, it is possible to make a Thompson sampling variant where the troll blows up the bridge when “crossing the bridge” is not the highest-ranked action.


Update: This isn’t really an issue, you just need to impose an assumption that there is some function \(f\) such that \(f(n)>n\), and \(f(n)\) is computable in time polynomial in \(f(n)\), and you always find out whether exploration happened on turn \(f(n)\) after \(\mathcal{O}(f(n+1))\) days.

This is just the condition that there’s a subsequence where good feedback is possible, and is discussed significantly in section 4.3 of the logical induction paper.

If there’s a subsequence B (of your subsequence of interest, A) where you can get good feedback, then there’s infinite exploration steps on subsequence B (and also on A because it contains B)

This post is hereby deprecated. Still right, just not that relevant.


by Alex Appel 454 days ago | Abram Demski likes this | link | parent | on: Distributed Cooperation

If you drop the Pareto-improvement condition from the cell rank, and just have “everyone sorts things by their own utility”, then you won’t necessarily get a Pareto-optimal outcome (within the set of cell center-points), but you will at least get a point where there are no strict Pareto improvements (no points that leave everyone better off).

The difference between the two is… let’s say we’ve got a 2-player 2-move game that in utility-space, makes some sort of quadrilateral. If the top and right edges join at 90 degrees, the Pareto-frontier would be the point on the corner, and the set of “no strict Pareto improvements” would be the top and the right edges.

If that corner is obtuse, then both “Pareto frontier” and “no strict Pareto improvements” agree that both line edges are within the set, and if the corner is acute, then both “Pareto frontier” and “no strict Pareto improvements” agree that only the corner is within the set. It actually isn’t much of a difference, it only manifests when the utilities for a player are exactly equal, and is easily changed by a little bit of noise.

The utility-approximation issue you pointed out seems to be pointing towards the impossibility of guaranteeing limiting to a point on the Pareto frontier (when you make the cell size smaller and smaller), precisely because of that “this set is unstable under arbitrarily small noise” issue.

But, the “set of all points that have no strict Pareto improvements by more than \(\delta\) for all players”, ie, the \(\delta\)-fuzzed version of “set of points with no strict pareto improvement”, does seem to be robust against a little bit of noise, and doesn’t require the Pareto-improvement condition on everyone’s ranking of cells.

So I’m thinking that if that’s all we can attain (because of the complication you pointed out), then it lets us drop that inelegant Pareto-improvement condition.

I’ll work on the proof that for sufficiently small cell size \(\epsilon\), you can get an outcome within \(\delta\) of the set of “no strict Pareto improvements available”

Nice job spotting that flaw.


Intermediate update:

The handwavy argument about how you’d get propositional inconsistency in the limit of imposing the constraint of “the string cannot contain \(a\wedge b\wedge c...\to\neg\phi\) and \(a\) and \(b\) and… and \(\phi\)

is less clear than I thought. The problem is that, while the prior may learn that that constraint applies as it updates on more sentences, that particular constraint can get you into situations where adding either \(\phi\) or \(\neg\phi\) leads to a violation of the constraint.

So, running the prior far enough forward leads to the probability distribution being nearly certain that, while that particular constraint applied in the past, it will stop applying at some point in the future by vetoing both possible extensions of a string of sentences, and then less-constrained conditions will apply from that point forward.

On one hand, if you don’t have the computational resources to enforce full propositional consistency, it’s expected that most of the worlds you generate will be propositionally inconsistent, and midway through generating them you’ll realize that some of them are indeed propositionally inconsistent.

On the other hand, we want to be able to believe that constraints capable of painting themselves into a corner will apply to reality forevermore.

I’ll think about this a bit more. One possible line of attack is having \(\mathbb{P}(\phi)\) and \(\mathbb{P}(\neg\phi)\) not add up to one, because it’s possible that the sentence generating process will just stop cold before one of the two shows up, and renormalizing them to 1. But I’d have to check if it’s still possible to \(\varepsilon\)-approximate the distribution if we introduce this renormalization, and to be honest, I wouldn’t be surprised if there was a more elegant way around this.

EDIT: yes it’s still possible to \(\varepsilon\)-approximate the distribution in known time if you have \(\mathbb{P}(\phi)\) refer to \(\frac{probability to encounter \phi first}{1-probability to halt first}\), although the bounds are really loose. This is because if most of the execution paths involve halting before the sentence is sampled, \(\varepsilon\)-error in the probability of sampling \(\phi\) first will get blown up by the small denominator.

Will type up the proof later, but it basically proceeds by looking at the probability mass associated with “sample the trivial constraint that accepts everything, and sample it again on each successive round”, because this slice of probability mass has a near-guarantee of hitting \(\phi\), and then showing that even this tiny slice has substantially more probability mass than the cumulative probability of ever sampling a really rare sentence or not hitting any of \(\phi\), \(\neg\phi\), or the string terminating.


A summary that might be informative to other people: Where does the \(\omega(\frac{2}{3})\) requirement on the growth rate of the “rationality parameter” \(\beta\) come from?

Well, the expected loss of the agent comes from two sources. Making a suboptimal choice on its own, and incurring a loss from consulting a not-fully-rational advisor. The policy of the agent is basically “defer to the advisor when the expected loss over all time of acting (relative to the optimal move by an agent who knew the true environment) is too high”. Too high, in this case, cashes out as “higher than \(\beta(t)^{-1}t^{-1/x}\)”, where t is the time discount parameter and \(\beta\) is the level-of-rationality parameter. Note that as the operator gets more rational, the agent gets less reluctant about deferring. Also note that t is reversed from what you might think, high values of t mean that the agent has a very distant planning horizon, low values mean the agent is more present-oriented.

On most rounds, the agent acts on its own, so the expected all-time loss on a single round from taking suboptimal choices is on the order of \(\beta(t)^{-1}t^{-1/x}\), and also we’re summing up over about t rounds (technically exponential discount, but they’re similar enough). So the loss from acting on its own ends up being about \(\beta(t)^{-1}t^{(x-1)/x}\).

On the other hand, delegation will happen on at most ~\(t^{2/x}\) rounds, with a loss of \(\beta(t)^{-1}\) value, so the loss from delegation ends up being around \(\beta(t)^{-1}t^{2/x}\).

Setting these two losses equal to each other/minimizing the exponent on the t when they are smooshed together gets you x=3. And then \(\beta(t)\) must grow asymptotically faster than \(t^{2/3}\) to have the loss shrink to 0. So that’s basically where the 2/3 comes from, it comes from setting the delegation threshold to equalize long-term losses from the AI acting on its own, and the human picking bad choices, as the time horizon t goes to infinity.


I don’t believe that \(x_{:n}^{!k}\) was defined anywhere, but we “use the definition” in the proof of Lemma 1.

As far as I can tell, it’s a set of (j,y) pairs, where j is the index of a hypothesis, and y is an infinite history string, rather like the set \(h^{!k}\).

How do the definitions of \(h^{!k}\) and \(x^{!k}_{:n}\) differ?


by Vanessa Kosoy 535 days ago | link

Hi Alex!

The definition of \(h^{!k}\) makes sense for any \(h\), that is, the superscript \(!k\) in this context is a mapping from finite histories to sets of pairs as you said. In the line in question we just apply this mapping to \(x_{:n}\) where \(x\) is a bound variable coming from the expected value.

I hope this helps?







[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes


Privacy & Terms