by Gordon Worley III 7 days ago | Alex Appel and Abram Demski like this | link | parent | on: Catastrophe Mitigation Using DRL Maybe it’s just my browser, but it look like it got cut off. Here’s the last of what it renders for me: Averaging the previous inequality over kk, we get 1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[U!n∣J!n=K, Z!nT]−E[U!n∣Z!nT]]+O(1−γTη2+τ¯(1−γ)1−γT) 1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[Un!∣Jn!=K, ZnT!]−E[Un!∣ZnT!]]+O(1−γTη2+τ¯(1−γ)1−γT) $${k=0}{N-1}R{?k} (1-^T){n=0}{nT} [[U^!_n ^!n = K, Z^!{nT}]-[U^!n Z^!{nT}]] + O(+ Indeed there is some kind of length limit in the website. I moved Appendices B and C to a separate post. reply Hyperreal Brouwer post by Scott Garrabrant 49 days ago | Vadim Kosoy and Stuart Armstrong like this | 2 comments This post explains how to view Kakutani’s fixed point theorem as a special case of Brouwer’s fixed point theorem with hyperreal numbers. This post is just math intuitions, but I found them useful in thinking about Kakutani’s fixed point theorem and many things in agent foundations. This came out of conversations with Sam Eisenstat.  continue reading » by Vadim Kosoy 3 days ago | link | on: Hyperreal Brouwer Very nice. I wonder whether this fixed point theorem also implies the various generalization of Kakutani’s fixed point theorem in the literature, such as Lassonde’s theorem about compositions of Kakutani functions. It sounds like it should because the composition of hypercontinuous functions is hypercontinuous, but I don’t see the formal argument immediately since if we have $$x \in *X,\ y \in *Y$$ with standard parts $$x_\omega,\ y_\omega$$ s.t. $$f(x)=y$$, and and $$y' \in *Y,\ z \in *Z$$ with standard parts $$y'_\omega=y_\omega,\ z_\omega$$ s.t. $$g(y')=z$$ then it’s not clear why there should be $$x'\in X,\ z'\in Z$$ s.t. with standard parts $$x'_\omega=x_\omega,\ z'_\omega=z_\omega$$ s.t. $$g(f(x'))=z'$$. reply Resolving human inconsistency in a simple model post by Stuart Armstrong 51 days ago | Abram Demski likes this | 1 comment A putative new idea for AI control; index here. This post will present a simple model of an inconsistent human, and ponder how to resolve their inconsistency. Let $$\bf{H}$$ be our agent, in a turn-based world. Let $$R^l$$ and $$R^s$$ be two simple reward functions at each turn. The reward $$R^l$$ is thought of as being a ‘long-term’ reward, while $$R^s$$ is a short-term one.  continue reading » Freezing the reward seems like the correct answer by definition, since if I am an agent following the utility function $$R$$ and I have to design a new agent now, then it is rational for me to design the new agent to follow the utility function I am following now (i.e. this action is usually rated as the best according to my current utility function). reply  by Gordon Worley III 7 days ago | Alex Appel and Abram Demski like this | link | parent | on: Catastrophe Mitigation Using DRL Maybe it’s just my browser, but it look like it got cut off. Here’s the last of what it renders for me: Averaging the previous inequality over kk, we get 1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[U!n∣J!n=K, Z!nT]−E[U!n∣Z!nT]]+O(1−γTη2+τ¯(1−γ)1−γT) 1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[Un!∣Jn!=K, ZnT!]−E[Un!∣ZnT!]]+O(1−γTη2+τ¯(1−γ)1−γT)$${k=0}{N-1}R{?k} (1-^T){n=0}{nT} [[U^!_n ^!n = K, Z^!{nT}]-[U^!n Z^!{nT}]] + O(+

Unfortunately, it’s not just your browser. The website truncates the document for some reason. I emailed Matthew about it and ey are looking into it.

The Happy Dance Problem
post by Abram Demski 7 days ago | Scott Garrabrant and Stuart Armstrong like this | 1 comment

Since the invention of logical induction, people have been trying to figure out what logically updateless reasoning could be. This is motivated by the idea that, in the realm of Bayesian uncertainty (IE, empirical uncertainty), updateless decision theory is the simple solution to the problem of reflective consistency. Naturally, we’d like to import this success to logically uncertain decision theory.

At a research retreat during the summer, we realized that updateless decision theory wasn’t so easy to define even in the seemingly simple Bayesian case. A possible solution was written up in Conditioning on Conditionals. However, that didn’t end up being especially satisfying.

Here, I introduce the happy dance problem, which more clearly illustrates the difficulty in defining updateless reasoning in the Bayesian case. I also outline Scott’s current thoughts about the correct way of reasoning about this problem.

by Wei Dai 7 days ago | Scott Garrabrant likes this | link | on: The Happy Dance Problem

We can solve the problem in what seems like the right way by introducing a basic notion of counterfactual, which I’ll write □→. This is supposed to represent “what the agent’s code will do on different inputs”. The idea is that if we have the policy of dancing when we see the money, M□→H is true even in the world where we don’t see any money.

(I’m confused about why this notation needs to be introduced. I haven’t been following all the DT discussions super closely, so I’d appreciate if someone could catch me up. Or, since I’m visiting MIRI soon, perhaps someone could catch me up in person.)

In the language of my original UDT post, I would have written this as S(‘M’)=‘H’, where S is the agent’s code (M and H in quotes here to denote that they’re input/output strings rather than events). This is a logical statement about the output of S given ‘M’ as input, which I had conjectured could be conditioned on the same way we’d condition on any other logical statement (once we have a solution to logical uncertainty). Of course, issues like Agent Simulates Predictor has since come up, so is this new idea/notation an attempt to solve some of those issues? Can you explain what advantages this notation has over the S(‘M’)=‘H’ type of notation?

It’s not clear where the beliefs about this correlation come from, so these counterfactuals are still almost as mysterious as explicitly giving conditional probabilities for everything given different policies.

Intuitively, it comes from the fact that there’s a chunk of computation in Omega that’s analyzing S, which should be logically correlated with S’s actual output. Again, this was a guess of what a correct solution to logical uncertainty would say when you run the math. (Now that we have logical induction, can we tell if it actually says this?)

Catastrophe Mitigation Using DRL

Previously we derived a regret bound for DRL which assumed the advisor is “locally sane.” Such an advisor can only take actions that don’t lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into “dangerous” (impending catastrophe) and “safe” (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain “leaking” of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the “value leak” rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1.

Maybe it’s just my browser, but it look like it got cut off. Here’s the last of what it renders for me:

Averaging the previous inequality over kk, we get

1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[U!n∣J!n=K, Z!nT]−E[U!n∣Z!nT]]+O(1−γTη2+τ¯(1−γ)1−γT) 1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[Un!∣Jn!=K, ZnT!]−E[Un!∣ZnT!]]+O(1−γTη2+τ¯(1−γ)1−γT)

{k=0}{N-1}R{?k} (1-^T){n=0}{nT} [[U^!_n ^!n = K, Z^!{nT}]-[U^!n Z^!{nT}]] + O(+

 Looking for Recommendations RE UDT vs. bounded computation / meta-reasoning / opportunity cost? discussion post by David Krueger 16 days ago | 1 comment

At present, I think the main problem of logical updatelessness is something like: how can we make a principled trade-off between thinking longer to make a better decision, vs thinking less long so that we exert more logical control on the environment?

For example, in Agent Simulates Predictor, an agent who thinks for a short amount of time and then decides on a policy for how to respond to any conclusions which it comes to after thinking longer can decide “If I think longer, and see a proof that the predictor thinks I two-box, I can invalidate that proof by one-boxing. Adopting this policy makes the predictor less likely to find such a proof.” (I’m speculating; I haven’t actually written up a thing which does this, yet, but I think it would work.) An agent who thinks longer before making a decision can’t see this possibility because it has already proved that the predictor predicts two-boxing, so from the perspective of having thought longer, there doesn’t appear to be a way to invalidate the prediction – being predicted to two-box is just a fact, not a thing the agent has control over.

Similarly, in Prisoner’s Dilemma, an agent who hasn’t thought too long can adopt a strategy of first thinking longer and then doing whatever it predicts the other agent to do. This is a pretty good strategy, because it makes it so that the other agent’s best strategy is to cooperate. However, you have to think for long enough to find this particular strategy, but short enough that the hypotheticals which show that the strategy is a good idea aren’t closed off yet.

So, I think there is less conflict between UDT and bounded reasoning than you are implying. However, it’s far from clear how to negotiate the trade-offs sanely.

(However, in both cases, you still want to spend as long a time thinking as you can afford; it’s just that you want to make the policy decision, about how to use the conclusions of that thinking, as early as they can be made while remaining sensible.)

 Funding opportunity for AI alignment research link by Paul Christiano 89 days ago | Vadim Kosoy likes this | 3 comments

In the first round I’m planning to pay:

• $10k to Ryan Carey •$10k to Chris Pasek
• 20k to Peter Scheyer I’m excited to see what comes of this! Within a few months I’ll do another round of advertising + making decisions. I want to emphasize that given the evaluation process, this definitely shouldn’t be read as a strong negative judgment (or endorsement) of anyone’s application. reply  by David Krueger 34 days ago | link | parent | on: Funding opportunity for AI alignment research Paul - how widely do you want this shared? Fine with it being shared broadly. reply  by Stuart Armstrong 28 days ago | link | parent | on: Predictable Exploration If the other players can see what action you’ll take, then they may simply exploit you. Isn’t this a variant of the “agent simulates predictor” problem (with you playing the role of the predictor)? Thus any agent capable of exploiting you has to prove to you that it won’t, in order to get anything from you. That’s kind of what happens with your Nicerbots; even if perfectly predictable, they’re not really exploitable in any strong sense (they won’t cooperate with a defector). by Abram Demski 26 days ago | link | on: Predictable Exploration I think the point I was making here was a bit less clear than I wanted it to be. I was saying that, if you use predictable exploration on actions rather than policies, then you only get to see what happens when you predictably take a certain action. This is good for learning pure equilibria in games, but doesn’t give information which would help the agent reach the right mixed equilibria when randomized actions should be preferred; and indeed, it doesn’t seem like such an agent would reach the right mixed equilibria. I believe the “predictable exploration on policies” approach solves agent-simulates-predictor just fine, along with other problems (including counterfactual mugging) which require “some degree of updatelessness” without requiring the full reflective stability which we want from updatelessness. reply The Three Levels of Goodhart's Curse post by Scott Garrabrant 100 days ago | Vadim Kosoy, Abram Demski and Paul Christiano like this | 2 comments Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.  continue reading » (also x-posted from https://arbital.com/p/goodharts_curse/#subpage-8s5) Another, speculative point: If $$V$$ and $$U$$ were my utility function and my friend’s, my intuition is that an agent that optimizes the wrong function would act more robustly. If true, this may support the theory that Goodhart’s curse for AI alignment would be to a large extent a problem of defending against adversarial examples by learning robust features similar to human ones. Namely, the robust response may be because me and my friend have learned similar robust, high-level features; we just give them different importance. reply The Three Levels of Goodhart's Curse post by Scott Garrabrant 100 days ago | Vadim Kosoy, Abram Demski and Paul Christiano like this | 2 comments Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.  continue reading » (x-posted from Arbital ==> Goodhart’s curse) On “Conditions for Goodhart’s curse”: It seems like with AI alignment the curse happens mostly when V is defined in terms of some high-level features of the state, which are normally not easily maximized. I.e., V is something like a neural network $$V:s \mapsto V(s)$$ where $$s$$ is the state. Now suppose U’ is a neural network which outputs the AI’s estimate of these features. The AI can then manipulate the state/input to maximize these features. That’s just the standard problem of adversarial examples. So it seems like the conditions we’re looking for are generally met in the common setting were adversarial examples do work to maximize some loss function. One requirement there is that the input space is high-dimensional. So why doesn’t the 2D Gaussian example go wrong? [This is about the example from Arbital ==> Goodhart’s Curse where there is no bound $$\sqrt{n}$$ on $$V$$ and $$U$$]. There’s no high-level features to optimize by using the flexibility of the input space. On the other hand, you don’t need a flexible input space to fall prey to the winner’s curse. Instead of using the high flexibility of the input space you use the ‘high flexibility’ of the noise if you have many data points. The noise will take any possible value with enough data, causing the winner’s curse. If you care about a feature that is bounded under the real-world distribution but noise is unbounded, you will find that the most promising-looking data points are actually maximizing the noise. There’s a noise-free (i.e. no measurement errors) variant of the winner’s curse which suggests another connection to adversarial examples. If you simply have $$n$$ data points and pick the one that maximizes some outcome measure, you can conceptualize this as evolutionary optimization in the input space. Usually, adversarial examples are generated by following the gradient in the input space. Instead, the winner’s curse uses evolutionary optimization. reply  Predictable Exploration discussion post by Abram Demski 30 days ago | 5 comments If the other players can see what action you’ll take, then they may simply exploit you. Isn’t this a variant of the “agent simulates predictor” problem (with you playing the role of the predictor)? Thus any agent capable of exploiting you has to prove to you that it won’t, in order to get anything from you. That’s kind of what happens with your Nicerbots; even if perfectly predictable, they’re not really exploitable in any strong sense (they won’t cooperate with a defector). reply  by Alex Appel 30 days ago | Abram Demski likes this | link | parent | on: Predictable Exploration Hm, I got the same result from a different direction. (probably very confused/not-even-wrong thoughts ahead) It’s possible to view a policy of the form “I’ll compute X and respond based on what X outputs” as… tying your output to X, in a sense. Logical link formation, if you will. And policies of the form “I’ll compute X and respond in a way that makes that output of X impossible/improbable” (can’t always do this) correspond to logical link cutting. And with this, we see what the chicken rule in MUDT/exploration in LIDT is doing. It’s systematically cutting all the logical links it can, and going “well, if the statement remains correlated with me despite me trying my best to shake off anything that predicts me too well, I guess I”cause" it." But some potentially-useful links were cut by this process, such as “having short abstract reasoning available that lets others predict what you will do” (a partner in a prisoner’s dilemma, the troll in troll bridge, etc..) At the same time, some links should be cut by a policy that diagonalizes against predictions/calls upon an unpredictable process (anything that can be used to predict your behavior in matching pennies, evading Death when Death can’t crack your random number generator, etc…) So I wound up with “predictable policy selection that forms links to stuff that would be useful to correlate with yourself, and cuts links to stuff that would be detrimental to have correlated with yourself”. Predictably choosing an easy-to-predict policy is easy-to-predict, predictably choosing a hard-to-predict policy is hard-to-predict. This runs directly into problem 1 of “how do you make sure you have good counterfactuals of what would happen if you had a certain pattern of logical links, if you aren’t acting unpredictably”, and maybe some other problems as well, but it feels philosophically appealing. by Abram Demski 29 days ago | link | on: Predictable Exploration Thinking about this more, I think there’s an important disanalogy between trying to make policy decisions with earlier market states vs smaller proof-searches. In Agent Simulates Predictor, we can use an earlier market state to decide our policy, because the earlier market state can trust the predictor to make the right predictions, even if the predictor is using a more powerful logic (since logical inductors can learn to boundedly trust more powerful logics). However, with proof-based DTs, no analogous move is possible. Consider a version of Agent Simulates Predictor in which Omega searches for a proof that you one-box in PA+Con(PA); if one is found, Omega fills the1m box. Otherwise, not. Omega has $$T_1$$ time to think. The agent has $$T_2$$ time to think, $$T_2 >> T_1$$. The agent reasons in PA.

If the agent refused to use all its time, and only ran for $$T_0 << T_1$$ time, but still had enough time to find interesting proofs, then it could reason as follows: “If I one-box, then there is a short proof that I one-box which Omega can find. So I get $1M.” It may not know if PA+Con(PA) is sound, but that doesn’t matter; the agent just has to ensure that there is a proof which Omega will find. It wouldn’t find any proofs leading to higher utility that this, so it would one-box and get$1M.

Unfortunately, I don’t see any way to harness the shorter proof-search to choose a policy which would get the $1M in this case but choose to think longer in other cases where that’s beneficial. We might want the agent to reason: “If I stop and one-box right now, Omega will be able to prove that I one-box, and I’ll get$1M. If I wait longer, Omega won’t be able to prove what I do, so I’ll at most be able to get \$100. So, I’ll stop now and one-box.” However, this reasoning would have to take place at a proof-length in which several things hold at once:

• The agent can prove that it’s still “early” enough that its action would be provable to Omega if it acted now.
• It’s “late” enough that the agent can see that Omega’s predictions are sound (IE, it can check that Omega doesn’t reach false results in the limited time it has). This allows the agent to see that it’ll never get money from both boxes.

It seems very unlikely that there is a proof length where these can both be true, due to bounded Löb.

For logical induction, on the other hand, there’s quite likely to be a window with analogous properties.

by Abram Demski 30 days ago | link | on: Predictable Exploration

So I wound up with “predictable policy selection that forms links to stuff that would be useful to correlate with yourself, and cuts links to stuff that would be detrimental to have correlated with yourself”.

Agreed!

I’m reading this as “You want to make decisions as early as you can, because when you decide one of the things you can do is decide to put the decision off for later; but when you make a decision later, you can’t decide to put it earlier.”

And “logical time” here determines whether others can see your move when they decide to make theirs. You place yourself upstream of more things if you think less before deciding.

This runs directly into problem 1 of “how do you make sure you have good counterfactuals of what would happen if you had a certain pattern of logical links, if you aren’t acting unpredictably”, and maybe some other problems as well, but it feels philosophically appealing.

Here’s where I’m saying “just use the chicken rule again, in this stepped-back reasoning”. It likely re-introduces versions the same problems at the higher level, but perhaps iterating this process as many times as we can afford is in some sense the best we can do.

 Predictable Exploration discussion post by Abram Demski 30 days ago | 5 comments
by Alex Appel 30 days ago | Abram Demski likes this | link | on: Predictable Exploration

Hm, I got the same result from a different direction.

It’s possible to view a policy of the form “I’ll compute X and respond based on what X outputs” as… tying your output to X, in a sense. Logical link formation, if you will.

And policies of the form “I’ll compute X and respond in a way that makes that output of X impossible/improbable” (can’t always do this) correspond to logical link cutting.

And with this, we see what the chicken rule in MUDT/exploration in LIDT is doing. It’s systematically cutting all the logical links it can, and going “well, if the statement remains correlated with me despite me trying my best to shake off anything that predicts me too well, I guess I”cause" it."

But some potentially-useful links were cut by this process, such as “having short abstract reasoning available that lets others predict what you will do” (a partner in a prisoner’s dilemma, the troll in troll bridge, etc..)

At the same time, some links should be cut by a policy that diagonalizes against predictions/calls upon an unpredictable process (anything that can be used to predict your behavior in matching pennies, evading Death when Death can’t crack your random number generator, etc…)

So I wound up with “predictable policy selection that forms links to stuff that would be useful to correlate with yourself, and cuts links to stuff that would be detrimental to have correlated with yourself”.

Predictably choosing an easy-to-predict policy is easy-to-predict, predictably choosing a hard-to-predict policy is hard-to-predict.

This runs directly into problem 1 of “how do you make sure you have good counterfactuals of what would happen if you had a certain pattern of logical links, if you aren’t acting unpredictably”, and maybe some other problems as well, but it feels philosophically appealing.

 Funding opportunity for AI alignment research link by Paul Christiano 89 days ago | Vadim Kosoy likes this | 3 comments

Paul - how widely do you want this shared?

 by Alex Appel 40 days ago | link | parent | on: Smoking Lesion Steelman III: Revenge of the Tickle... What does the Law of Logical Causality say about CON(PA) in Sam’s probabilistic version of the troll bridge? My intuition is that in that case, the agent would think CON(PA) would be causally downstream of itself, because the distribution of actions conditional on CON(PA) and $$\neg$$CON(PA) are different. Can we come up with any example where the agent thinking it can control CON(PA) (or any other thing that enables accurate predictions of its actions) actually gets it into trouble?

I agree, my intuition is that LLC asserts that the troll, and even CON(PA), is downstream. And, it seems to get into trouble because it treats it as downstream.

I also suspect that Troll Bridge will end up formally outside the realm where LLC can be justified by the desire to make ratifiability imply CDT=EDT. (I’m working on another post which will go into that more.)

Smoking Lesion Steelman III: Revenge of the Tickle Defense
post by Abram Demski 50 days ago | Scott Garrabrant likes this | 2 comments

I improve the theory I put forward last time a bit, locate it in the literature, and discuss conditions when this approach unifies CDT and EDT.

What does the Law of Logical Causality say about CON(PA) in Sam’s probabilistic version of the troll bridge?

My intuition is that in that case, the agent would think CON(PA) would be causally downstream of itself, because the distribution of actions conditional on CON(PA) and $$\neg$$CON(PA) are different.

Can we come up with any example where the agent thinking it can control CON(PA) (or any other thing that enables accurate predictions of its actions) actually gets it into trouble?

Hyperreal Brouwer
post by Scott Garrabrant 49 days ago | Vadim Kosoy and Stuart Armstrong like this | 2 comments

This post explains how to view Kakutani’s fixed point theorem as a special case of Brouwer’s fixed point theorem with hyperreal numbers. This post is just math intuitions, but I found them useful in thinking about Kakutani’s fixed point theorem and many things in agent foundations. This came out of conversations with Sam Eisenstat.

by Stuart Armstrong 49 days ago | link | on: Hyperreal Brouwer

To quote the straw vulcan: Fascinating.

 Should I post technical ideas here or on LessWrong 2.0? discussion post by Stuart Armstrong 51 days ago | Abram Demski likes this | 3 comments

I intend to cross-post often.

 Should I post technical ideas here or on LessWrong 2.0? discussion post by Stuart Armstrong 51 days ago | Abram Demski likes this | 3 comments

I think technical research should be posted here. Moreover, I think that merging IAFF and LW is a bad idea. We should be striving to attract people from mainstream academia / AI research groups rather than making ourselves seem even more eccentric / esoteric.

 Should I post technical ideas here or on LessWrong 2.0? discussion post by Stuart Armstrong 51 days ago | Abram Demski likes this | 3 comments

I am much more likely to miss things posted to LessWrong2.0. I eagerly await this forum’s incorporation into LW. Until then, I’m also conflicted about where to post.

 Open Problems Regarding Counterfactuals: An Introduction For Beginners link by Alex Appel 129 days ago | Vadim Kosoy, Tsvi Benson-Tilsen, Vladimir Nesov and Wei Dai like this | 2 comments

Note that the problem with exploration already arises in ordinary reinforcement learning, without going into “exotic” decision theories. Regarding the question of why humans don’t seem to have this problem, I think it is a combination of

• The universe is regular (which is related to what you said about “we can’t see any plausible causal way it could happen”), so a Bayes-optimal policy with a simplicity prior has something going for it. On the other hand, sometimes you do need to experiment, so this can’t be the only explanation.

• Any individual human has parents that teach em things, including things like “touching a hot stove is dangerous.” Later in life, ey can draw on much of the knowledge accumulated by human civilization. This tunnels the exploration into safe channels, analogously to the role of the advisor in my recent posts.

• One may say that the previous point only passes the recursive buck, since we can consider all of humanity to be the “agent”. From this perspective, it seems that the universe just happens to be relatively safe, in the sense that it’s pretty hard for an individual human to do something that will irreparably damage all of humanity… or at least it was the case during most of human history.

• In addition, we have some useful instincts baked in by evolution (e.g. probably some notion of existing in a three dimensional space with objects that interact mechanically). Again, you could zoom further out and say evolution works because it’s hard to create a species that will wipe out all life.

 Open Problems Regarding Counterfactuals: An Introduction For Beginners link by Alex Appel 129 days ago | Vadim Kosoy, Tsvi Benson-Tilsen, Vladimir Nesov and Wei Dai like this | 2 comments

Typos on page 5:

• “random explanation” should be “random exploration”
• “Alpa” should be “Alpha”

Older

### NEW DISCUSSION POSTS

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
 by Stuart Armstrong on Predictable Exploration | 0 likes