 Logical Inductor Tiling and Why it's Hard
post by Alex Appel 409 days ago | Sam Eisenstat and Abram Demski like this | discuss

(Tiling result due to Sam, exposition of obstacles due to me)

# Definitions:

$$\phi(\underline{x})$$ is a sentence with a specific number plugged in for $$x$$.

$$A^{\omega}$$ is the set of all infinite action sequences. For convenience, the set of actions will be taken to be $$\{0,1\}$$.

$$U:A^{\omega}\to[0,1]$$ is a utility function mapping an infinite sequence of actions to a utility, that is continuous and computable. There are some implicit assumptions in this simple definition, which will be discussed later.

Let $$\delta:\mathbb{N}^{+}\to(0,1]$$ be some efficiently computable, monotonically decreasing function that is provably greater than the modulus of continuity of the utility function. More simply, if the prefix $$A_{:t}$$ of the first $$t$$ actions is always enough to pin down the utility to within $$\epsilon$$ (which always happens because the utility function is continuous), then $$\delta(t)\ge\epsilon$$.

We’ll be using a modification of the logical induction algorithm where all traders, for turn $$t$$ have access to an advice string consisting of the sequence of past actions. This is essential for giving the traders the ability to reliably bet on the sentence that represents the utility function, with the past actions plugged in, because the sequence of actions that the logical inductor takes might not be efficiently computable, so the traders would be unable to even find the right sentence to bet on.

$$U(\underline{p_{:t-1}},x_{t:f(t)},a)$$ is the LUV corresponding to what the utility function outputs. $$\underline{p_{:t-1}}$$ is a string of length $$t-1$$ representing the empirical past actions,$$x_{t:f(t)}$$ represents $$f(t)-t+1$$ actions selected by algorithm $$X$$, and $$a$$ represents the infinite future sequence of actions selected by the logical inductor algorithm $$A$$. LUV’s are defined by definition 4.8.1 of the logical induction paper Further theorems from the paper will be referenced, so it will be helpful to keep the paper open.

However, due to utility function continuity, $\mathbb{E}_{t}U(\underline{p_{:t-1}},x_{t:f(t)},a)\simeq_{t}\mathbb{E}_{t}U(\underline{p_{:t-1}},a)$ Therefore, we will need to define a renormalized utility $$\mathcal{U}$$ to have a nontrivial tiling result.

$\mathcal{U}(\underline{p_{:t-1}},0,x_{t+1:f(t)},a):=\frac{U(\underline{p_{:t-1}},a_{t},x_{t+1:f(t)},a)-U(\underline{p_{:t-1}},0,x_{t+1:f(t)},a)+\delta(t)}{2\delta(t)}$

An analogous definition applies for 1 in place of 0. In both cases, this is a LUV that always lies in the interval [0,1], by the definition of $$\delta(t)$$. Utility functions are invariant up to a scale and shift, so this is just a scale and shift of the original utility function. However, it is scaled so that the change in utility between different sequences of actions doesn’t necessarily converge to 0 as $$t\to\infty$$. If the agent will actually take action 0 on turn $$t$$, then this is $$\frac{1}{2}$$.

The action criterion for the agent will be:

$\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},0,a)\ge\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},1,a)$

will imply that the agent will take action 0, and action 1 otherwise.

# Theorem 1: Restricted Inductor Tiling

If $\forall x:\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-c}},\underline{x_{t-c+1:t-1}},a)\simeq_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-c}},\underline{x_{t-c+1:t-1}},\underline{a_{t}},a)$ for some constant $$c$$, and all $$2^c$$ instances of this asymptotic equality converge with a rate bounded above by some e.c. function, then for all e.c. algorithms $$B:\mathbb{N}^{+}\to\{0,1\}$$, $\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},a)\ge_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+c},a)$

This corresponds to the inductor eventually learning to trust its own action sequence over those of any other e.c. sequence of actions for a constant number of steps, in the sense that the renormalized utility gain from switching to $$B$$ for action selection limits to 0 or becomes negative.

The key starting assumption effectively says that the agent will learn that the expected utility of selecting an action in the abstract is the same as the expected utility of selecting the specific action that it actually selects, even if a finite-sized chunk of the past action string is tampered with. Put another way, in the limit, if the agent assigns the best action an expected utility of 0.7, the agent will have its expected utility be 0.7, even if the agent went off-policy for a constant number of steps beforehand.

This assumption can be violated in certain environments. More on this later.

# Proof:

(1): $\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},a)\simeq_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},\underline{a_{t}},a)$ By assumption.

(2): $\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},\underline{a_{t}},a)\ge\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t}},a)$ By the definition of the action criterion.

(3): $\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t}},a)\simeq_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},b_{t},a)$ $$\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t}},a)$$ is an e.c. (with action advice) LUV-sequence (because the action of algorithm $$B$$ can be found in polynomial time by assumption), and $$\mathcal{U}(\underline{p_{:t-1}},b_{t},a)$$ is another e.c. (with advice) LUV-sequence. These two LUV’s are provably equal, so by Theorem 4.8.4 from the logical induction paper, the expectations limit to each other.

Now, we will begin a proof by induction up to $$c$$. Let $$k$$ be an integer less than $$c$$, and assume that $\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},a)\ge_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k},a)$ We have already shown this for $$k=0$$ as a base case, by lines 1-3. Now we will proceed to show that if this equation holds for $$k$$, it holds for $$k+1$$.

(4): $\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k},a)\simeq_{t}\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k},a)$ Theorem 4.12.1 from the logical induction paper.

(5): $\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k},a)\simeq_{t}\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k},a_{t+k+1},a)$ The e.c. sequences of LUV’s corresponding to the utility functions are always provably equal to each other. The tricky part is that the e.c. sequences of LUV’s corresponding to the future expectations limit to each other, but to apply Theorem 4.8.10, it is necessary that the difference between the LUV’s corresponding to the future expectations always be less than some constant $$b$$.

Let $$\overline{X_{b}}$$ be the e.c sequence of LUV’s corresponding to the future expectation of the utility function with $$b_{t}$$ as an explicit choice of action (left side of above), and $$\overline{Y_{b}}$$ be the e.c. sequence of LUV’s corresponding to the future expectation of the utility function with $$b_{t}$$ as an abstract action (right side of above).

The LUV sequence $$\overline{|X_{b}-Y_{b}|}$$ is bounded above by some e.c. sequence of numbers. This is because the function $$\frac{1}{t}$$ as an upper bound (as a specific example) would mean that, in the limit, the expectations limit to each other slowly enough, that a trader can make unbounded money by buying and selling the differently priced LUV’s for the utility functions. (I think. Haven’t checked this super-rigorously)

Therefore, by interpreting the e.c. sequence of numbers as a sequence of LUV’s (which are just numbers in [0,1]), we can come up with a corresponding LUV-sequence $$\overline{Z}$$.

Now, the LUV-combination sequence $$\overline{|X_{b}-Y_{b}|-Z}$$ will always be less than 0, so by Theorem 4.8.10, $$\mathbb{E}_{t}|X_{b,t}-Y_{b,t}|\le_{t}Z_{t}$$. Because $$Z_{t}$$ limits to 0, we then get asymptotic convergence of expected values at time $$t$$ for $$\overline{X_{b}}$$ and $$\overline{Y_{b}}$$, and (5) is proved.

(6): $\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k},a_{t+k+1},a)\simeq_{t}\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t:t+k}},a_{t+k+1},a))$ Same proof as step 5, specifically pertaining to being able to unpack the actions of $$B$$ from the abstract form to the specific action form, in an e.c. way.

(7): $\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t:t+k}},a_{t+k+1},a)\simeq_{t}\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t:t+k}},\underline{a_{t+k+1}},a)$ By assumption, we get that the inner expectations converge to each other. However, the argument of convergence of the outer expectations doesn’t work the same way as step 5, because I currently don’t see a clear argument for being able to reliably exploit the difference of the inner expectations, because the inner expectations just converge by assumption. Still, by assumption, the rate of convergence is bounded above by some e.c. function, so the argument from step 5 applies.

(8): $\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t:t+k}},\underline{a_{t+k+1}},a)\ge_{t}\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t:t+k+1}},a)$ By the action criterion, the inner expectations are always related by $$\ge$$, so Theorem 4.8.10 ensures that the outer expectations obey that in the limit.

(9): $\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},\underline{b_{t:t+k+1}},a)\simeq_{t}\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k+1},a)$ Same proof as step 6.

(10): $\mathbb{E}_{t}\mathbb{E}_{t+k+1}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k+1},a)\simeq_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k+1},a)$ By Theorem 4.12.1.

(11): $\mathbb{E}_{t}\mathcal{U}(p_{:t-1},a)\ge_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k},a)\ge_{t}\mathbb{E}_{t}\mathcal{U}(\underline{p_{:t-1}},b_{t:t+k+1},a)$ This is the induction step assumption, combined with the asymptotic inequalities and equalities on lines 4-10.

This then proves the induction step inequality for k+1, so we get trust of the inductors own actions over those of other e.c. sequences of actions, up to the $$c$$ from the original assumption (past that, step 7 stops working)

$$\Box$$

This proof is especially interesting because it implicitly reveals a lot of distinct and fundamental obstacles to getting tiling to work in full generality, by altering parts of the proof and seeing what breaks.

Assumptions of the Utility Function

The definition of the utility function implicitly assumes that the problem of counterfactuals has been figured out, and it also hand-waves away most of the interesting decision theory obstacles.

To begin with, notice that the utility is defined in functional form instead of as a LUV, and that the action criterion is defined by what the utility function outputs when it is given a specific action sequence, instead of by conditioning on a particular sequence of actions.

This is required because, when using standard conditioning, if the thing the inductor is conditioning on (a long action sequence) has a low probability, then expected value estimates conditional on that event will be very unstable. This especially applies where the sequence of actions is off-policy, such as the sequence of actions selected by program $$B$$, because now the only way to have them happen is a run of extraordinary luck with the exploration sentences.

But in this setting, since there is an explicit formula mapping action sequences to utilities, the inductor can go back and recompute what it should have done at various points in the past, to get lessons for later. However, since the environment is explicitly a function of the action alone, it can’t faithfully represent cases where the agent is being predicted by something. In a certain sense, this is a CDT setting with magical free will, because all the predictions of the agent must come out the same way, while the thing feeding actions in to the utility function can pick whatever it wants. So, while this can correctly model 5-and-10, and learns to cross Troll Bridge, and learns to not pay up in XOR blackmail (recomputing what would have happened if it didn’t pay up yields that it’s 1000 dollars richer) it will also learn to 2-box on Newcomb (going back and recomputing the actions, 2-boxing is always better), and can’t handle Counterlogical Mugging.

This setting, where the utility function has the type $$A^{\omega}\to[0,1]$$, corresponds to CDT. The standard logical inductor setting where the utility function is just an unknown and hard-to-compute constant in $$[0,1]$$ corresponds to EDT. And the setting of AsDT and policy selection corresponds to a utility function of type $$\pi\to[0,1]$$

So, this setting is assuming away most of the interesting decision theory problems where we’d be interested in how tiling works (or fails to work), in particular, cases where tiling and updatelessness interact.

Help I’m Lost in the Universe and Can’t Name Myself

Moving on, why did we need to assume that the inductor traders have access to the past string of actions? Well, otherwise, we can’t guarantee that the inductors can even find the sentences of interest. Inductors don’t “think about what to think about”, and the space of action sequences blows up exponentially, with no guarantee that the sequence of actions that are actually taken (and that we’re especially interested in) are an e.c. subsequence. Put more poetically, because a logical inductor is thinking about all possible math sentences, it’s a bit difficult to motivate trader interest in the specific math sentence that corresponds to the universe it’s in, unless you give the traders enough data to easily name that class of sentences.

There is, however, another way around this. What happens when we just lump the explicit list of past actions into an abstract $$p_{:t-d}$$, representing the string of past actions taken by the logical inductor algorithm, with a near-term buffer of $$d-1$$ explicitly remembered actions?

Then the tiling result only goes up to $$k=d-1$$ (possible off-by-one error here), regardless of how large $$c$$ is. This is because the “buffer” of explicitly remembered past must be shortened by k+1 around when we go to the future action in step 4, because we have to advance the “past forgotten actions” chunk forward enough to get step 8 to work out. Going back to the present time at step 10 then requires pulling the k+1 explicitly remembered actions back out of the “past forgotten actions”. But if you try going any further in the induction, then eventually the start of the string of actions by $$B$$ will have to get pulled into the “past forgotten actions” chunk, which can’t necessarily be done because it might be different from the logical inductor’s action on that timestep.

So that’s another obstacle. If you forget some past actions, and model the past by going “I guess I did me-like things according to my nature”, and you are thinking about switching over to some different algorithm $$B$$ for a stretch, and then think about what far-future you would think, and they’re far enough in the future to have forgotten the start of the switch-over, then you’ll think that future-you will be thinking about the world where “me instead of B” persists for a little while longer. And then, moving back to your current thoughts, it turns out you can’t necessarily move those opinions back to inform your current (extra-information) state.

To put it more generally, asking “what will future me think” stops being guaranteed-useful when future-you has predictably thrown away information that current-you has.

The Requirement of Transparent Action Selection

The logical inductor only trusts its own actions over those of efficiently-computable “blind” ways of action selection. What proof step fails when we try to generalize beyond that?

Specifically, step 6. Applying the action criterion requires the stretch of “actions selected by B” to be unpacked into explicit actions, and doing that requires the utility function (with all the explicit actions plugged in) to be part of an e.c. sequence. So, if the actions depend on the logical inductor’s own prices, the logical inductor can’t unpack them from the abstract to the concrete form, because it doesn’t know what its prices will be in the future.

In general, if the agent only trust processes when it is big enough to fully simulate out what they’ll do in advance, that’s a pretty severe restriction, that dodges almost all of the hard parts of tiling.

The Starting Assumption

What’s really going on with the assumption we made at the start? It is a sort of “assume limited off-policy competence” assumption.

A notable feature of logical inductors without exploration steps is that they’re pretty accurate at assessing the utility when they do what they usually do. They’re bad at assessing the utility when they do something they don’t usually (or ever) do. A similar thing happens with AIXI for certain choices of UTM. This is a very fundamental phenomenon, where an agent can’t effectively correct misperceptions of the utility of doing something it doesn’t do, because it doesn’t have the necessary data to learn that it’s wrong. And this especially applies to cases where the agent is pondering a rewrite of its source code.

Now, for the specific case where the past sequence of $$c$$ actions came from the logical inductor itself, and we don’t have to worry about issues of off-policy behavior, this is a sort of “no evil environments” requirement. It is possible to come up with an environment that simulates what the inductor will do, and penalizes that exact action, and rewards all others. In this case, the expected utility of the agent for taking an action in the abstract is 0, but the expected utility of the agent for taking any particular action (out of 0 and 1) is 1/2. In the limit, the inductor will assign about 50/50 probability to taking 0 and 1, and assigns 0.5 probability to the utility of each specific action. This is because the environment is in functional form, where it’s possible to go back and recompute what happens when you plug in the action you didn’t take.

There’s almost certainly other cases where these two utilities come apart, but for most reasonable environments, on turn $$t$$ when the agent is making a decision, the utility of the highest utility action will be equal to the agent’s expectation of utility for it taking an action in the abstract. (If it wasn’t, a trader could bet against the low-priced one, and for the high-priced one, and profit off of the agent actually taking the highest-priced action)

However, the starting assumption is stronger than that, because of the requirement that it hold for all $$c$$-or-less-sized off-policy pasts. Put another way, the inductor thinks that, in the world where the past went differently (and the logical inductor was using a different action advice string), the utility is (in expectation) equal to the utility of some specific action (in expectation). This is necessary to get step 7 to work out. The future logical inductor in the world where it has taken the actions advised by $$B$$ needs to be able to unpack “I take an action” to “I take this specific best action” without changing the expected utility. This is worrying, because an agent shouldn’t necessarily have to unpack “I take an action” into “I take this specific action” when considering future actions. This is essentially the same as the last problem, except it applies to the agent itself instead of the other algorithm it’s thinking of changing into.

Induction Failures and the Parable of the Long Distance Swimmer

Now, why did the starting assumption have to be for a constant number of steps? Well, since it has to apply for all sequences of past actions, there are $$2^c$$ asymptotic equalities that the inductor has to respect. You could try to replace $$c$$ with something that depends on $$t$$, like $$\frac{t}{2}$$, to get tiling to work out to $$t$$ extra steps on turn $$t$$. The problem is that that would require polynomially many asymptotic equalities to approximately hold on turn $$t$$ (because we don’t need to have the history condition work on all past histories, just e.c. ones to deal with a past history of selecting actions via $$B$$), and uniform convergence to 0 isn’t guaranteed when the inductor is dealing with a set of asymptotic equalities that grows over time. If it takes $$2^t$$ time for the $$t$$-th equality to approximately hold, and we add one new equality that has to approximately hold each day, we may never be in a case where all of them hold enough for $$t$$ induction steps to go through.

The induction proof (when unrolled) works by going “well, you trust yourself for one step, by reasoning about yourself one step in the future” and then “you trust yourself for two steps, by reasoning about yourself two steps in the future” and so on. For the “distance of trust” to hold for any appreciable period of time, it requires a strikingly high and uniform level of trust in all the individual time steps, because more time steps keep being added.

It’s similar to swimming to a very distant island that keeps receding. It’s possible to have a strong belief that one who just kept swimming would reach the island, while also not attempting swimming and staying on land (for less utility), because you don’t fully trust yourself on each individual step to keep swimming instead of deciding to drown for some unknown reason.

To translate this parable, swimming to a very distant island would be the best sequence of actions, the “keeps receding” refers to the length of the necessary sequence of actions, “just keep swimming” refers to some e.c. sequence of actions, and the rest is about a situation where the agent may not trust itself to equal or outperform the “keep swimming” strategy, because the induction proof of “I’ll always choose to keep swimming” doesn’t go through, due to uncertainty (non-equality, because we only get equality in the limit) in whether the agent will always keep swimming.

# Summary of Obstacles for Tiling

Some of these obstacles are dodged by this proof, some are still present, some are fairly minor and technical, while others are far more fundamental.

1: A utility function with type $$A^{\omega}\to[0,1]$$ corresponds to CDT, due to the ability to go back and plug in other actions and see how they did. It has the standard pitfalls of CDT as well.

2: EDT-style condition is unusually bad at considering explicit sequences of actions, because the probability of a particular action sequence drops off pretty fast, and the division by a very small number amplifies small changes in the numerator of the “conditional expected value” fraction.

3: Because logical inductors don’t think about what to think about, and may act in complex ways, in order for them to even refer to a particular setting of interest in which they are acting, they’ll either have to memorize their past actions, or define the setting of interest with a pointer to themselves that doesn’t change under rewrites of source code. If this isn’t done, we get an issue where the computation the agent cares about (utility function) decouples from the computation corresponding to what happens when the agent rewrites itself. This can be solved by having $$p_{:t-1}$$ be defined as the action string produced by having the $$n$$’th agent in the recursively defined self-modification sequence select the $$n$$’th action.

4: If the agent forgets information over time, predictions of what the future-agent will think was the best action may decouple from what the present-agent thinks is the best action, because the future-agent will predictably throw away some information that the present-agent has.

5: If the tiling proof relies on unpacking some algorithm into a sequence of explicit actions, then it will be a very limited tiling proof that is unfit for reasoning about tiling to anything of similar computational power, or reasoning about long sequences of actions. In particular, if the algorithm to select actions even looks at the environment, this proof strategy will fail because it will be very computationally intensive to fully simulate out the environment.

6: “Evil problems” where the environment punishes a certain kind of agent by penalizing the thing they predictably do (and where the environment won’t change what it’s simulating when the agent rewrites itself) incentivise self-modification.

7: The problem of “can’t get feedback from actions you don’t take” applies very strongly to problems involving the consequences of a rewrite of source code.

8: As a corollary to 5, if the tiling proof relies on unpacking the original algorithm itself into a plausible future action, it probably isn’t general enough. This especially applies to any proof that relies on repeatedly comparing your own actions to those of some other actions at all timesteps.

9: As the parable of swimming to a receding island shows, tiling trust for long periods of time easily fails because too many induction steps are required for the agent to be confident of finishing what it started. This has analogies to human behavior.

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes