Intelligent Agent Foundations Forumsign up / log in
All Mathematicians are Trollable: Divergence of Naturalistic Logical Updates
post by Abram Demski 1138 days ago | Jessica Taylor, Patrick LaVictoire, Scott Garrabrant and Vladimir Slepnev like this | 1 comment

The post on naturalistic logical updates left open the question of whether the probability distribution converges as we condition on more logical information. Here, I show that this cannot always be the case: for any computable probability distribution with naturalistic logical updates, we can show it proofs in an order which will prevent convergence. In fact, at any time, we can drive the probability of \(x\) up or down as much as we like, for a wide variety of sentences \(x\).

As an aid to intuition, I describe the theorem informally as “all mathematicians are trollable”. I was once told that there was an “all mathematicians go to Bayesian hell” theorem, based on the fact that a computable probability distribution must suffer arbitrarily large log-loss when trying to model mathematics. The idea here is similar. We are representing the belief state of a mathematician with a computable probability distribution, and trying to manipulate that belief state by proving carefully-selected theorems to the mathematician.

(ETA: A way around this result has now been found, giving us an untrollable mathematician.)

\(\Box p\) represents “p is provable”, but we consider a class of proofs which the mathematician believes are sound; this justifies characterizing the mathematician’s beliefs with the provability logic \(GLS\) (as in naturalistic logical updates) rather than \(GL\), so the mathematician’s beliefs will obey the principle \(\Box p \rightarrow p\). For concreteness, we can suppose that the mathematician accepts proofs in \(PA\).

Definition. A probability assignment over a language \(L\) is a function from sentences of that language to values in the interval \([0,1]\).

This is, of course, an extremely weak definition. We will generally also want coherence with respect to some logic. Where a logic \(A\) is a set of axioms and inference rules over \(L\), we define coherence as usual:

Definition. A probability assignment \(P\) over \(L\) is coherent with respect to \(A\) if and only if:

  1. \(\vdash a\) implies \(P(a)=1\) and \(a \vdash \bot\) implies \(P(a)=0\),
  2. whenever \(a \vdash b\), we have \(P(a) \leq P(b)\), and
  3. \(P(a) + P(b) = P(a \vee b) + P(a \wedge b)\).

This is equivalent to a number of other common definitions of coherence. For a computable probability assignment, coherence with \(PA\) won’t be possible. Requirement (1) alone is impossible: the sentences which can be proven from \(PA\) and those which can be refuted are recursively inseparable sets.

We can, however, insist that the probability assignment is coherent with respect to \(GLS\). To be more precise, we can translate all the axioms and inference rules of \(GLS\) over to the language of \(PA\) by interpreting the modal provability operator \(\Box\) as a provability predicate in \(PA\). I will keep using the box-notation for simplicity, and will refer to \(GLS\)-derivability, \(GLS\)-coherence, and so on; I always mean it in this translated sense.1 Since the consequence relation of \(GLS\) is decidable, we don’t run into the same problem as we do for coherence with respect to \(PA\); I showed previously that this allows a probability distribution to be constructed. In most respects, this is much weaker than coherence with respect to \(PA\); however, it validates the soundness of proofs in \(PA\), which \(PA\) itself cannot.

The motivation for \(GLS\)-coherence in my previous post was reflective consistency: when a new mathematical fact is proven, we want to take it into account in a way which our past self would endorse. More precisely, I proposed that in the absence of interactions between impossible possible worlds (such as counterfactual mugging with a logical coin), the policy for responding to logical information which looks best before getting a proof one way or the other should still look like the best way of responding after a proof is found. I also argued something similar from logical Dutch Book arguments, though I hadn’t found the \(GLS\) recipe to satisfy the desiderata yet. These decision-theoretic arguments still needs to be dealt with in more detail, but it suggested to me the goal of making a computable probability assignment in which Bayesian updates on proofs take the place of the “second update” usually associated with logical uncertainty. The \(GLS\)-coherent prior which I constructed was the result.

In order for this approach to be very appealing, it seems that updating on proofs in this way should have other good properties which we have established for logical uncertainty, such as uniform coherence. The assignment which was given last time does not achieve such good properties, but perhaps some \(GLS\)-coherent probability assignment will. The major point of this post is to show a difficulty in achieving uniform coherence with such an approach: it is not possible for \(GLS\)-coherent Bayesian updates on proofs to have convergence guarantees when proofs are presented in arbitrary order. This does not rule out convergence for “good” proof orderings (indeed, perhaps there is a large class if “good” orderings for which we can converge); however, it is a bad sign. If convergence turns out to be impossible or very fragile, uniform coherence would as well.

Translating back to the story about “trolling mathematicians”, the idea is to show that if someone is choosing proofs adversarially, a mathematician with \(GLS\)-coherent beliefs can be made to form arbitrarily strong conjectural beliefs for or against a target proposition \(x\) (so long as this proposition has not yet been proved or disproved); these beliefs can be made to oscillate as desired. The adversary can do this without ever lying to the mathematician.

The initial idea of the proof is that because a computable probability assignment cannot be coherent with respect to \(PA\), any statement \(x\) has consequences in \(PA\) which are currently unrecognized. Showing the mathematician a proof of those unexpected consequences should reduce belief in \(x\), since it eliminates some of the ways \(x\) could be true (removing \(x\) as a possibility from situations where \(\neg y\)). (A real-life example of this is that learning about the Banach-Tarski paradox may reduce belief in the axiom of choice.) Furthermore, since learning \(x \rightarrow y\) tends to decrease belief in \(x\) and increase it in \(y\), we can reduce the probability of a statement arbitrarily far by putting it as \(x\) and finding many \(y\) which it implies, or we can increase the probability of a statement by taking it to be \(y\) and finding many \(x\) which imply it.

The proof doesn’t quite go through that simply, however; we will actually manipulate the probability of the statement \(\neg \Box \neg x\), rather than manipulating \(x\) directly. \(\neg \Box \neg x\) is abbreviated as \(\Diamond x\). I also argue by choosing \(y\) which is a theorem in \(PA\), making \(x \rightarrow y\) vacuously true; this simplifies the proof. Where \(\neg x\) is a theorem, I will call \(x\) and anti-theorem.

Lemma. Consider a computable probability assignment \(P\) over the sentences of \(PA\), such that \(P\) is coherent with respect to \(GLS\). For any \(x\) such that \(P(\Diamond x) \neq 0\), there exists a theorem of \(PA\), \(y\), such that \(P(\Diamond y | \Diamond x) \leq 1/2\).

Proof. The expression \(P(\Diamond y | \Diamond x) > 1/2\) cannot act as a separator between \(y\) which are theorems and anti-theorems of \(PA\), since they are recursively inseparable. Due to \(GLS\)-coherence, and assuming \(P(\Diamond x) \neq 0\), there will already be some theorems of \(PA\) for which this expression has value 1 and anti-theorems for which it is 0; for example, \(P(\Diamond (a \vee \neg a) | \Diamond x)=1\) for any sentence \(a\), and \(P(\Diamond (a \wedge \neg a) | \Diamond x)=0\). Therefore, to prevent separation, there must be either an anti-theorem \(c\) with \(P(\Diamond c | \Diamond x) > 1/2\), or a theorem \(t\) with \(P(\Diamond t | \Diamond y) \leq 1/2\).

Suppose there’s an anti-theorem \(c\) with \(P(\Diamond c | \Diamond x) > 1/2\). Then \(P(\neg \Diamond c | \Diamond x) \leq 1/2\). \(\neg \Diamond c = \Box \neg c\), and \(\Box \neg c\) implies \(\neg \Box c\) in \(GLS\) (since \(\neg \Box \bot\) and also \(((\Box c) \wedge (\Box \neg c)) \rightarrow \Box \bot\)). \(\neg \Box x\) is \(GLS\)-equivalent to \(\Diamond \neg c\). By \(GLS\)-coherence, \(P(\Diamond \neg c | \Diamond x) \leq P(\neg \Diamond c | \Diamond x)\). But this means that \(\neg c\) is a \(PA\)-theorem with \(P(\Diamond \neg c | \Diamond x) \leq 1/2\).

Therefore, in either case, there is a \(PA\)-theorem \(y\) such that \(P(\Diamond y | \Diamond x) \leq 1/2\). \(\blacksquare\)

For the next theorem, I also need a condition which I will call weak non-dogmatism:

Defintion. A probability assignment on the language of \(PA\) is weakly non-dogmatic if it does not assign probability zero to any theorem of \(PA\).

The \(GLS\)-coherent prior which I constructed previously is weakly non-dogmatic; I take this to be a very reasonable requirement which is not at all difficult to meet.

Note that weak non-dogmatism together with \(GLS\)-coherence further implies that theorems of the combined system \(PA+GLS\) have nonzero probability.

Theorem 1. For any computable probability assignment \(P\) over the sentences of \(PA\) which is \(GLS\)-coherent and weakly non-dogmatic, for any \(x\) such that \(1 > P(\Diamond x) > 0\), there exists a theorem \(t\) of \(PA\) such that \(P(\Diamond x | \Box t) / P(\neg \Diamond x | \Box t) \leq \frac{1}{2} P(\Diamond x) / P(\neg \Diamond x)\); that is, odds of \(\Diamond x\) reduce by half given that \(t\) is a theorem.

Proof. We will show this with the theorem \(t = x \rightarrow y\), with \(y\) chosen via the lemma. \(x \rightarrow y\) is a theorem of \(PA\), since \(y\) is. So, we want to show \(P(\Diamond x | \Box (x \rightarrow y)) / P(\neg \Diamond x | \Box (x \rightarrow y)) \leq \frac{1}{2} P(\Diamond x) / P(\neg \Diamond x)\). The quantities are well-defined thanks to weak non-dogmatism and the assumption \(1 > P(\Diamond x) > 0\). Simplifying:

\(\Leftrightarrow P(\Diamond x \wedge \Box (x \rightarrow y)) / P(\neg \Diamond x \wedge \Box (x \rightarrow y)) \leq \frac{1}{2} P(\Diamond x) / P(\neg \Diamond x)\)

\(\Leftrightarrow P(\Diamond x \wedge \Box (x \rightarrow y)) / P(\neg \Diamond x) \leq \frac{1}{2} P(\Diamond x) / P(\neg \Diamond x)\) (The conjunct \(\Box (x \rightarrow y)\) in the divisor was redundant, since \(\neg \Diamond x\) means \(\Box \neg x\), which already implies \(\Box (x \rightarrow z)\) for any \(z\).)

\(\Leftrightarrow P(\Diamond x \wedge \Box (x \rightarrow y)) \leq \frac{1}{2} P(\Diamond x)\)

So, we want to show that \(\Diamond x \wedge \Box (x \rightarrow y)\) is at most half as probable as \(\Diamond x\).

But we already know \(P(\Diamond y | \Diamond x) \leq \frac{1}{2}\), which means \(P(\Diamond y \wedge \Diamond x) \leq \frac{1}{2} P(\Diamond x)\). \((\neg \Box \neg x \wedge \Box (x \rightarrow y))\) implies \(\neg \Box \neg y\), since if we could prove \(\neg y\), we could combine that with a proof of \(x \rightarrow y\) to prove \(\neg x\). So by \(GLS\) coherence, \(P(\Diamond x \wedge \Box (x \rightarrow y)) \leq P(\Diamond y \wedge \Diamond x)\). So we have \(P(\Diamond x \wedge \Box (x \rightarrow y)) \leq \frac{1}{2} P(\Diamond x)\), which is what was to be proved. \(\blacksquare\)

This theorem allows us to cut the odds of \(\Diamond x\) in half (or more) by a set formula. Since the result of updating on \(\Box (x \rightarrow y)\) will itself be a \(GLS\)-coherent probability assignment (thanks in part to weak non-dogmatism), we can keep doing this, cutting the probability of \(\Diamond x\) down as much as we like.

Corollary 1. For \(x\) such that \(1 > P(\Diamond x) > 0\), we can drive \(P(x)\) as low as we like.

Proof. \(GLS\) proves \(\Box \neg x \rightarrow \neg x\), so it also proves \(x \rightarrow \neg \Box \neg x\). Since \(GLS\) has modus ponens as an inference rule, this means \(x \vdash \Diamond x\), so \(P(x) \leq P(\Diamond x)\). Since we can iterate the trick in Theorem 1 to drive down the probability of \(\Diamond x\) arbitrarily far, we can drive down \(P(x)\) as well. \(\blacksquare\)

Corollary 2. If we have \(1 > P(\Box x) > 0\), we can drive \(P(x)\) as high as we like.

Proof. By corollary 1, we can drive \(P(\neg x)\) down as far as we like in this case. \(\blacksquare\)

The two corollaries identify cases when a “troll” can make a mathematician’s belief arbitrarily low or high. The next theorem shows that we can combine the trick to send a mathematician’s beliefs waver back and forth forever.

Theorem 2. Consider a sequence of weakly non-dogmatic \(GLS\)-coherent probability assignments \(P_1, P_2, P_3, ...\) such that \(P_{n+1}\) is a Bayesian update of \(P_n\) on some sentence \(\Box t\) with \(t\) being a theorem of \(PA\). We can insert additional updates on theorems, making a modified sequence \(P^*_n\), to ensure that for some \(x\), \(P^*_n(x)\) does not converge to any value.

Proof. There must be undecidable sentences \(u\) with both \(P(\Box u) > 0\) and \(P(\Box \neg u) > 0\). If there weren’t, then all undecidable sentences would have either \(P(\Box x) = 0\) or \(P(\Box \neg x) = 0\). By weak non-dogmatism, theorems would have \(P(\Box x) > 0\), and anti-theorems would have \(P(\Box \neg x) > 0\). Sentences with both above zero would be guaranteed decidable one way or the other, so we could run a theorem prover and be sure that it would terminate on those. Sentences with both zero would be definitely undecidable. We can find the remaining theorems from anti-theorems by the predicate \(P(\Box x) > 0 \wedge P(\Box \neg x) = 0\). Thus, we have a procedure for deciding whether a given \(x\) is a theorem or not in finite time, which is impossible.

Since \(\Box x\) implies \(\Diamond x\), we also have \(P(\Diamond u) > 0\), which means \(P(\Box \neg u) < 1\). Similarly, because \(\Box \neg x\) implies \(\Diamond \neg x\), we can infer \(P(\Diamond \neg u) > 0\) and hence \(P(\Box u) < 1\). So the sentences \(u\) meet the conditions of corollaries 1 and 2.

Furthermore, there will be some sentences which stay in this state forever. If not, then it would be the case that we eventually know at least one of \(\neg \Box u\) or \(\neg \Box \neg u\) for these \(u\). In fact, this means we eventually know one of those for any \(x\) whatsoever. But this would allow us to separate theorems and anti-theorems: simply wait for one of \(P(\Box x)\) or \(P(\Box \neg x)\) to become zero, grouping the first case with anti-theorems and the second with theorems. This procedure would always halt, and never put a theorem in the anti-theorem group or vice versa. This can’t happen, so there must be sentences whose state we stay uncertain about forever.

Therefore, we can cause a failure of convergence by driving the probability of these sentences up and down forever. Suppose, as in the theorem statement, that there is a sequence of weakly non-dogmatic \(GLS\)-coherent probability assignments \(P_1, P_2, P_3, ...\) such that \(P_{n+1}\) is a Bayesian update of \(P_n\) on some sentence \(\Box t\) with \(t\) being a theorem of \(PA\). Denote the sequence of theorems used as \(t_1, t_2, t_3, ...\). We can edit the sequence to be sure that there are sentences for which the probability assignment does not converge, as follows. Between \(t_{2^n}\) and \(t_{2^n+1}\) for whole-numbered \(n\), insert additional theorems as follows. Find the first \(n\) sentences matching the conditions of both corollaries 1 and 2. Use the strategy in corollary 1 to at least half the odds of these sentences \(n\) times each, and then apply the strategy of corollary 2 to at least double them \(n\) times each.

This guarantees that some sentences will have probability assignments which go back and forth forever, coming arbitrarily close to both probability 1 and probability 0 as the probability assignment is conditioned on more information. \(\blacksquare\)

Are Naturalistic Updates to Blame?

I think some people might look at this and think that my unusual choice to update on \(\Box x\) rather than \(x\) and to use the provability logic \(GLS\) rather than the more usual \(GL\) might be the source of the difficulty. However, updating on theorems \(x\) directly rather than updating on their provability is just as bad (in fact a little worse). The following proof is due primarily to Scott Garrabrant (and was the seed idea for the proof of theorem 1):

Theorem 3. Consider a propositionally coherent probability assignment \(P\) to the language of \(PA\) which is computable and weakly non-dogmatic. For any sentence \(x\) with \(0<P(x)<1\), we can choose a sequence of theorems of \(PA\) to update on which send \(P(x)\) arbitrarily high or low.

Proof. There must be a theorem \(t\) of \(PA\) such that \(P(t|x) \leq \frac{1}{2}\), by the same seperability argument used in the proof of theorem 1. I want to establish that \(P(a | a \rightarrow t) / P( \neg a | a \rightarrow t) \leq \frac{1}{2} P(a) / P(\neg a)\). The expressions involved are well-defined thanks to weak non-dogmatism and the assumed range of \(P(x)\). Reducing, it suffices to show:

\(P(a \wedge (a \rightarrow t)) / P( \neg a \wedge ( a \rightarrow t)) \leq \frac{1}{2} P(a) / P(\neg a)\)

Since \(a \wedge (a \rightarrow t)\) is propositionally equivalent to \(a \wedge t\), and \(\neg a \wedge (a \rightarrow t)\) is propositionally equivalent to \(\neg a\), so it suffices to show:

\(P(a \wedge t) / P(\neg a) \leq \frac{1}{2} P(a) / P(\neg a)\)

\(\Leftrightarrow P(a \wedge t) \leq \frac{1}{2} P(a)\)

But we already know \(P(t|x) \leq \frac{1}{2}\), which is to say \(P(t \wedge x) \leq \frac{1}{2} P(a)\).

This allows us to cut the odds of \(x\) in half. A symmetric trick allows us to inflate the odds by half, using an anti-theorem \(c\) and updating on \(c \rightarrow x\). We can repeat both tricks as much as we like to drive the odds arbitrarily down or up. \(\blacksquare\)

So it is not exactly the formalism of naturalistic updates which creates the central problem, but rather, has more to do with representing a belief-state as a computable probability assignment and perform a Bayesian update to take into account more information. On the other hand, the details (such as updating on \(\Box t\) rather than \(t\)) do change the prognosis somewhat (changing the conditions under which we can manipulate a sentence’s probability); it might even be that there are mild variations which significantly improve things.


I’ve shown that logical uncertainty following the school of naturalistic logical updates is “trollable”, IE, can be manipulated by another entity selectively revealing mathematical proofs. This makes convergence (at best) dependent on proof order, which seems like a bad sign: logical uncertainty results so far haven’t needed to make many assumptions about the theorem provers being used.

Although convergence/divergence is a nice formal property to check, one might argue that it doesn’t matter too much what we believe about undecidable sentences. Decidable sentences will eventually be proved or refuted, and so will always converge. However, the proof also demonstrates that we can manipulate the probabilities of a large class of sentences, before those sentences are proved or refuted. This suggests that adversarial proof orderings can cause a \(GLS\)-agent to be arbitrarily wrong about sentences which are later proved/refuted.

The main concern is not so much whether \(GLS\)-coherent mathematicians are trollable as whether they are trolling themselves. Vulnerability to an external agent is somewhat concerning, but the existence of misleading proof-orderings brings up the question: are there principles we need to follow when deciding what proofs to look at next, to avoid misleading ourselves?

How damning is this for naturalistic logical updates, as a line of research? Speaking for myself, I find it concerning but not yet totally damning. The way I see it, this puts untrollability at odds with (a specific form of) reflective coherence: an agent shifting its mathematical uncertainty according to something other than a Bayesian update on proofs would want to self-modify, but an agent which uses a Bayesian update is more vulnerable to trolls. (“Trollability is bad, but if I can get a higher expected utility by editing myself to be more vulnerable to trolls, who cares?”) I’m still somewhat hopeful that a probability assignment can converge on a broad class of proof-orderings. For example, it would be very nice if there were a \(GLS\)-coherent distribution which was not trollable by any proof-ordering within its own range of computational complexity (“successful trolls need more processing power than their targets”). I think a result like that is unlikely, at present, but I haven’t ruled it out yet. We could then guarantee convergence by allocating more computational resources to the logical uncertainty than to the theorem prover (or, something resembling this).

One comment I got about this, in a discussion at MIRI (I forget who made the remark) was that it’s a problem of updating on information without taking into account the source of the information. Just as you would draw different conclusions about the frequencies of colors in a ball pit if you knew someone was selecting balls at random to show you vs if you thought they were selecting by some other policy, you should be able to model the proof-selecting strategy of a troll; believing with high probability that they will be showing you more evidence against \(x\), you would be inoculated against needing to actually adjust beliefs (unless the evidence was surprisingly compelling). I haven’t figured out a good way to model this yet, though.

On the MIRIx slack, Miëtek Bak discussed explicit provability logic (papers: 1, 2). The idea is that reasoning about the existential statement “there exists a proof”, in the form of “\(\Box\)”, causes unnecessary problems in reflection due to its non-constructive nature. Instead, we can consider a system which has explicit terms for proofs, with terms \(p:s\) (“\(p\) is a proof of \(s\)”) taking the place of \(\Box s\). Like \(GLS\), the resulting system can conclude the raw statement from the statement of provability; it endorses all instances of \((p:s) \rightarrow s\). Like \(GL\), provability is reflective, representing provability in-this-very-logic. This may make explicit provability a good candidate for a variant of naturalistic logical updates. I think there may be reflective consistency considerations which point in this direction as well: my argument for naturalistic logical updates involves the agent considering the future, but makes a perhaps problematic leap from the agent observing a proof to the agent observing abstract provability, “\(\Box\)”. (This is especially questionable given that the agent is not supposed to be making first-order inferences like that automatically.)

Overall, naturalistic logical updates stand or fall with the strength of the reflective consistency argument which I’m currently claiming leads us to them. I haven’t articulated this well enough yet, I think. (But, see logical Dutch Book arguments, logical counterfactuals consistent under self-modification, and naturalistic logical updates for my arguments as they stand.)

  1. This still leaves details to the imagination, so to be yet more precise: consider a (computable) encoding of sentences \(s\) of the language of \(PA\) into numerals \(\ulcorner s \urcorner\), and another encoding \(\llcorner p \lrcorner\) of proofs \(p\) of \(PA\) as numerals. There exists a predicate in \(PA\) which checks whether \(p\) is a proof of \(s\); denote it as \(Proves(\llcorner p \lrcorner, \ulcorner s \urcorner)\). Also, we will use a function \(f()\) giving unique propositional sentence-letters of \(GLS\) for each expression of \(PA\); this function must have the property that syntactically distinct expressions always get distinct propositional sentence-letters (no “hash collisions”), while identical expressions are given the same sentence-letter. We encode statements from \(PA\) into \(GLS\) by a recursive translation \(Tr()\):
  • \(Tr[ \neg a ] = \neg Tr[a]\)
  • \(Tr[ a \wedge b ] = Tr[a] \wedge Tr[b]\)
  • \(Tr[ a \vee b ] = Tr[a] \vee Tr[b]\)
  • \(Tr[ \exists x . Proves(x, \ulcorner a \urcorner)] = \Box Tr[a]\)
  • Otherwise, \(Tr[a] = f(a)\)

The abuse of language whereby I talk about consequence and coherence “in \(GLS\)”, but concerning the sentences of \(PA\), can now be justified by translating \(PA\) sentences into \(GLS\) to check the consequence relation. Notice that since identical sub-expressions always get translated to the same sentence-letters in \(GLS\), we can validate \(GLS\) principles like \(a \vee \neg a\) and \(a \vdash a\) for all sentences \(a\) in \(PA\); however, we make quantifiers totally opaque, failing to recognize even simple equivalences such as \((\forall x . x=x) \leftrightarrow (\forall y . y=y)\).

by Alex Appel 1091 days ago | link

Quick question: It is possible to drive the probability of x down arbitrarily far by finding a bunch of proofs of the form “x implies y” where y is a theorem. But the exact same argument applies to not x.

If the theorem-prover always finds a proof of the form “not x implies y” immediately afterwards, the probability wouldn’t converge, but it would fluctuate within a certain range, which looks good enough.

What, if any, conditions need to be imposed on the theorem prover to confine the probabilities assigned to an unprovable statement to a range that is narrower than (0, 1)?






[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes


Privacy & Terms