Intelligent Agent Foundations Forumsign up / log in
Why I am not currently working on the AAMLS agenda
post by Jessica Taylor 16 days ago | Ryan Carey, Marcello Herreshoff, Sam Eisenstat, Abram Demski, Daniel Dewey, Scott Garrabrant and Stuart Armstrong like this | 1 comment

(note: this is not an official MIRI statement, this is a personal statement. I am not speaking for others who have been involved with the agenda.)

The AAMLS (Alignment for Advanced Machine Learning Systems) agenda is a project at MIRI that is about determining how to use hypothetical highly advanced machine learning systems safely. I was previously working on problems in this agenda and am currently not.

continue reading »

The “benign induction problem” link is broken.

reply

Formal Open Problem in Decision Theory
post by Scott Garrabrant 59 days ago | Marcello Herreshoff, Sam Eisenstat, Vadim Kosoy, Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | 13 comments

In this post, I present a new formal open problem. A positive answer would be valuable for decision theory research. A negative answer would be helpful, mostly for figuring out what is the closest we can get to a positive answer. I also give some motivation for the problem, and some partial progress.

Open Problem: Does there exist a topological space \(X\) (in some convenient category of topological spaces) such that there exists a continuous surjection from \(X\) to the space \([0,1]^X\) (of continuous functions from \(X\) to \([0,1]\))?

continue reading »

This comment is to explain partial results obtained by David Simmons in this thread, since the results and their proofs are difficult to follow as a result of being distributed across multiple comments, with many errors and corrections. The proofs given here are due to David Simmons.

Statements

Theorem 1: There is no metric space \(X\) and uniformly continuous function \(f:X\times X\rightarrow[0,1]\) such that every uniformly continuous function \(X\rightarrow[0,1]\) is a fiber of \(f\).

Theorem 2: There is no metric space \(X\) and function \(f:X\times X\rightarrow[0,1]\) that is uniformly continuous on bounded sets, such that every function \(X\rightarrow[0,1]\) which is uniformly continuous on bounded sets is a fiber of \(f\).

Theorem 3: There is no locally compact Polish space \(X\) and continuous function \(f:X\times X\rightarrow[0,1]\) such that every continuous function \(X\rightarrow[0,1]\) is a fiber of \(f\).

Commentary

Theorem 1 says that there is no solution to the version of Scott’s problem with continuity in topological spaces replaced with uniform continuity in metric spaces. A plausible reason that this version of the problem might be important is that if an agent has to be able to compute what its policy says to do to any desired precision using an amount of computational resources that does not depend on the input, then its policy must be uniformly continuous. The gist of the proof is that for any uniformly continuous \(f:X\times X\rightarrow[0,1]\), it is possible to construct a uniformly continuous function \(g:X\rightarrow[0,1]\) that requires greater precision in its input than \(f\) does to determine the output to any desired precision. I suspect it might be possible to adapt the proof to work in uniform spaces rather than just metric spaces, but I have not studied uniform spaces.

Theorem 2 is similar to theorem 1, but with uniform continuity replaced with uniform continuity on bounded subsets. I was not convinced that this is an important notion in its own right, but theorem 2 is useful as a lemma for theorem 3. See the thread under David Simmons’s comment for more discussion about what sort of continuity assumption is appropriate. The gist of the proof is to apply the proof of theorem 1 on a bounded set. The function \(g\) constructed will be uniformly continuous everywhere, so the proof actually shows a stronger result that unifies both theorems 1 and 2: there is no \(f:X\times X\rightarrow[0,1]\) that is uniformly continuous on bounded sets and admits every uniformly continuous function \(X\rightarrow[0,1]\) as a fiber.

Theorem 3 almost says that Scott’s problem cannot be solved with Polish spaces. It doesn’t quite say that, because there are Polish spaces that are not locally compact. However, non-locally-compact Polish spaces are not exponentiable, so in the version of the problem in which we want a surjection \(X\rightarrow[0,1]^X\), it isn’t even clear whether there exists an exponential object \([0,1]^X\), which may mean that non-exponentiable spaces are not promising, although I’m not sure. A reason to restrict attention to Polish spaces is that effective Polish spaces provide a topological setting in which there is a good notion of computability, so the nonexistence of a solution in Polish spaces would make it impossible to provide a computable solution in that sense. That said, there may be other notions of computability in topological spaces (domain theory?), which I am unfamiliar with. The gist of the proof is to find a metric in which bounded sets are compact, and apply theorem 2.

Proofs

Proof of theorem 1: Let \(X\) be a metric space, and \(f:X\times X\rightarrow[0,1]\) be uniformly continuous. If \(X\) is uniformly discrete, then all functions from \(X\) are uniformly continuous, so there is a uniformly continuous function \(X\rightarrow[0,1]\) that is not a fiber of \(f\) by Cantor’s theorem.

So assume that \(X\) is not uniformly discrete. Let \((x_n,y_n)_{n\in\mathbb{N}}\) be such that \(\forall k\) \(d(x_{k+1},y_{k+1})\leq\frac{1}{6}d(x_k,y_k)\). Note that for all \(k\) and \(\ell>k\), either (A) \(d(x_k,y_\ell)\geq\frac{1}{3}d(x_k,y_k)\) and \(d(x_k,x_\ell)\geq\frac{1}{3}d(x_k,y_k)\) or (B) \(d(y_k,y_\ell)\geq\frac{1}{3}d(x_k,y_k)\) and \(d(y_k,x_\ell)\geq\frac{1}{3}d(x_k,y_k)\). By extracting a subsequence we can assume that which of (A) or (B) holds depends only on \(k\) and not on \(\ell\). By swapping \(x_k\) and \(y_k\) if necessary we can assume that case (A) always holds.

For each \(z\) there is at most one \(k\) such that \(d(z,x_k)<\frac{1}{6}d(x_k,y_k)\), because if \(d(z,x_k)<\frac{1}{6}d(x_k,y_k)\) and \(d(z,x_\ell)<\frac{1}{6}d(x_\ell,y_\ell)\) with \(\ell>k\), then \(d(x_k,x_\ell)<\frac{1}{3}d(x_k,y_k)\), a contradiction.

It follows that by extracting a further subsequence we can assume that \(d(y_k,x_\ell)\geq\frac{1}{6}d(x_\ell,y_\ell)\) for all \(\ell>k\).

Since \(f\) is uniformly continuous, there is a function \(\delta:(0,\infty)\rightarrow(0,\infty)\) such that \(\forall\varepsilon>0\;d((u,v),(w,z))<\delta(\varepsilon)\rightarrow|f(u,v)-f(w,z)|<\varepsilon\). By extracting a further subsequence, we can assume that \(d(x_k,y_k)<\delta(2^{-k})\) for all \(k\). Let \(j:[0,\infty)\rightarrow[0,\infty)\) be an increasing uniformly continuous function such that \(j(0)=0\) and \(j(\frac{1}{6}d(x_k,y_k))>2^{-k}\) for all \(k\). Finally, let \(g(z):=\inf_kd(z,y_k)\). Then for all \(k\) we have \(g(y_k)=0\). On the other hand, for all \(\ell>k\) we have \(d(x_k,y_\ell)\geq\frac{1}{3}d(x_k,y_k)\), for \(\ell=k\) we have \(d(x_k,y_\ell)=d(x_k,y_k)\), and for \(\ell<k\) we have \(d(x_k,y_\ell)\geq\frac{1}{6}d(x_k,y_k)\). Thus \(g(x_k)=\inf_\ell j(d(x_k,y_\ell))\geq j(\frac{1}{6}d(x_k,y_k))>2^{-k}\). Clearly, \(g\) cannot be a fiber of \(f\). Moreover, since the functions \(j\) and \(z\mapsto\inf_kd(z,y_k)\) are both uniformly continuous, so is \(g\). \(\square\)

Proof of theorem 2: Let \(X\) be a metric space, and \(f:X\times X\rightarrow[0,1]\) be uniformly continuous on bounded sets. If all bounded subsets of \(X\) are uniformly discrete, then all functions from \(X\) are uniformly continuous on bounded sets, so there is a function \(X\rightarrow[0,1]\) that is uniformly continuous on bounded sets but not a fiber of \(f\) by Cantor’s theorem. Otherwise, let \(B\subseteq X\) be a bounded set that is not uniformly discrete, take a sequence \((x_n,y_n)_{n\in\mathbb{N}}\) in \(B\) as in the proof of theorem 1, and a function \(\delta:(0,\infty)\rightarrow(0,\infty)\) such that \(\forall\varepsilon>0\;d((u,v),(w,z))<\delta(\varepsilon)\rightarrow|f(u,v)-f(w,z)|<\varepsilon\) for \((u,v),(w,z)\in B\times B\), and define \(j\) and \(g\) as in the proof of theorem 1. \(g\) is uniformly continuous, but not a fiber of \(f\). \(\square\)

Proof of theorem 3: Let \(d\) be a metric for \(X\). Let \(f(x):=\sup\{r|B(x,r)\text{ is compact}\}\), where \(B(x,r)\) is the closed ball around \(x\) of radius \(r\). Let \(F(x,y):=f(x)/[f(x)-d(x,y)]\) if \(f(x)>d(x,y)\), and \(F(x,y):=\infty\) otherwise. Next, let \(g(y):=\min_n[n+F(x_n,y)]\), where \((x_n)_{n\in\mathbb{N}}\) enumerates a countable dense set. Then \(g\) is continuous and everywhere finite. \(g^{-1}([0,N])\subseteq\bigcup_{n\leq N}B(x_n,(1-\frac{1}{N})f(x_n))\), and is thus compact. It follows that with the metric \(d'(x,y):=d(x,y)+|g(y)-g(x)|\), which induces the same topology, bounded sets are compact. Thus theorem 3 follows from theorem 2 applied to \((X,d')\). \(\square\)

reply

AI safety: three human problems and one AI issue
post by Stuart Armstrong 9 days ago | Ryan Carey and Daniel Dewey like this | 1 comment

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

continue reading »

Thanks for writing this – I think it’s a helpful kind of reflection for people to do!

reply


I think the double decrease effect kicks in with uncertainty, but not with confident expectation of a smaller network.


I think it does do the double decrease for the known smaller network.

Take three agent \(A_1\), \(A_2\), and \(A_3\), with utilities \(u_1\), \(u_2\), and \(u_3\). Assume the indexes \(i\), \(j\), and \(k\) are always distinct.

For each \(A_i\), they can boost \(u_j\) at the cost described above in terms of \(u_i\).

What I haven’t really specified is the three-way synergy - can \(A_i\) boost \(u_j+u_k\) more efficiently that simply boosting \(u_j\) and \(u_k\) independently? In general yes (the two utilities \(u_j\) and \(u_k\) are synergistic with each other, after all), but let’s first assume there is zero three-way synergy.

Then each agent \(A_i\) will sacrifice \(1/2+1/2=1\) in \(u_i\) to boost \(u_j\) and \(u_k\) each by \(1\). Overall, each utility function goes up by \(1+1-1=1\). This scales linearly with the size of the trade network each agent sees (excluding themselves): if there were two agents total, each utility would go up by \(1/2\), as in the top post example. And if there were \(n+1\) agents, each utility would go up by \(n/2\).

However, if there are any three-way, four-way,…, or \(n\)-way synergies, then the trade network is more efficient than that. So there is a double decrease (or double increase, from the other perspective), as long as there are higher-order synergies between the utilities.

reply

CIRL Wireheading
post by Tom Everitt 21 days ago | Abram Demski and Stuart Armstrong like this | 1 comment

Cooperative inverse reinforcement learning (CIRL) generated a lot of attention last year, as it seemed to do a good job aligning an agent’s incentives with its human supervisor’s. Notably, it led to an elegant solution to the shutdown problem.

continue reading »
by Stuart Armstrong 11 days ago | link | on: CIRL Wireheading

but the agent incorrectly observes the action

It’s a bit annoying that this has to rely on an incorrect observation. Why not replace the human action, in state \(s_2\), with a simple automated system that chooses \(a_1^H\)? It’s an easy mistake to make while programming, and the agent has no fundamental understanding of the difference between the human and an imperfect automated system.

Basically, if the human acts in perfect accordance with their preferences, and if the agent correctly observes and learns this, the agent will converge on the right values. You put wireheading by removing “correctly observes”, but I think removing “human acts in perfect accordance with their preferences” is a better example for wireheading.

reply

Acausal trade: double decrease
discussion post by Stuart Armstrong 17 days ago | 2 comments

I think the double decrease effect kicks in with uncertainty, but not with confident expectation of a smaller network.

reply


Every point in this set is a Pareto optimum, so the outcome is based on how we choose the Pareto optimum, which is under specified.

Recurse, using the agent’s concepts of fairness, using the new Pareto set as the set of possibilities? And continue until the set stabilises? And then the agents will be truly indifferent between the outcomes.


The problem is that our original set was a product of the actions available to the players, so they were able to cut things off using their own actions. When you restrict to the Pareto frontier, this is no longer the case.

reply


This seems a proper version of what I was trying to do here: https://agentfoundations.org/item?id=513


Yeah. The original generator of these ideas was that we were trying to find (or prove impossible) an improvement on NicerBot: An agent with reflective oracles that cooperates with itself (regardless of what reflective oracle is chosen), but is never exploited in expectation (even by epsilon.)

reply


Does this generalise to giving gradations of fairness, rather than boolean fair/unfair set memberships?


I don’t see how it would. The closest thing I can think of is letting agents choose randomly between different fair sets, but I don’t see what that would buy for you.

reply

Cooperative Oracles: Nonexploited Bargaining
post by Scott Garrabrant 16 days ago | Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | 6 comments

In this post, we formalize and generalize the phenomenon described in the Eliezer Yudkowsky post Cooperating with agents with different ideas of fairness, while resisting exploitation.

continue reading »

Does this generalise to giving gradations of fairness, rather than boolean fair/unfair set memberships?

reply

Cooperative Oracles: Nonexploited Bargaining
post by Scott Garrabrant 16 days ago | Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | 6 comments

In this post, we formalize and generalize the phenomenon described in the Eliezer Yudkowsky post Cooperating with agents with different ideas of fairness, while resisting exploitation.

continue reading »

Every point in this set is a Pareto optimum, so the outcome is based on how we choose the Pareto optimum, which is under specified.

Recurse, using the agent’s concepts of fairness, using the new Pareto set as the set of possibilities? And continue until the set stabilises? And then the agents will be truly indifferent between the outcomes.

reply

Cooperative Oracles: Nonexploited Bargaining
post by Scott Garrabrant 16 days ago | Jessica Taylor, Patrick LaVictoire and Stuart Armstrong like this | 6 comments

In this post, we formalize and generalize the phenomenon described in the Eliezer Yudkowsky post Cooperating with agents with different ideas of fairness, while resisting exploitation.

continue reading »

This seems a proper version of what I was trying to do here: https://agentfoundations.org/item?id=513

reply

Generalizing Foundations of Decision Theory II
post by Abram Demski 35 days ago | Sam Eisenstat, Vadim Kosoy, Jessica Taylor and Patrick LaVictoire like this | 4 comments

As promised in the previous post, I develop my formalism for justifying as many of the decision-theoretic axioms as possible with generalized dutch-book arguments. (I’ll use the term “generalized dutch-book” to refer to arguments with a family resemblance to dutch-book or money-pump.) The eventual goal is to relax these assumptions in a way which addresses bounded processing power, but for now the goal is to get as much of classical decision theory as possible justified by a generalized dutch-book.

continue reading »

Regret Theory with General Choice Sets by John Quiggen is a generalization of DT of more the sort I was initially hoping to produce. It doesn’t try to justify probability theory (it assumes it). Like me, it considers sets of options rather than only binary choices. Unlike me, it requires that if the bookie makes sequential offers, the bookie must keep all previously-given offers in the set (where I only require the bookie to keep the option which the agent chooses). This blocks the money-pump argument for transitivity, but still allows significant constraints on preferences to be argued by money-pump.

The result of this modification to the setup is that the agent can have many different utility functions which are used for different choice-sets. The condition on this is that the utility function must stay the same whenever the “best achievable outcome” is the same. (Best see the paper for that notion.)

reply

Infinite ethics comparisons
post by Stuart Armstrong 24 days ago | 1 comment

Work done with Amanda Askell; the errors are mine.

It’s very difficult to compare utilities across worlds with infinite populations. For instance, it seems clear that world \(w_1\) is better than \(w_2\), if the number indicate the utilities of various agents:

  • \(w_1 = 1,0,1,0,1,0,1,0,1,0, \ldots\)
  • \(w_2 = 1,0,1,0,0,1,0,0,0,1, \ldots\)
continue reading »

It’s not clear whether we have or need to have preferences over worlds with countably infinitely many symmetric individuals.

At a minimum, it’s also worth noting that anthropic reasoning is problematic in such worlds (I guess equivalently) and . A framework which answers questions of the form “what do you expect to happen?” could also probably be used to answer ethical questions.

(This kind of exercise seems potentially worthwhile anyway. I’m not so sure whether it is particularly relevant to AI alignment, for orthogonal reasons—hoping to figure out our values in advance seems to be giving up the game, as does making the kind of structural assumptions about value that could come back to bite us if we were wrong about infinite ethics.)

reply


For strategies: This ties back in to the situation where there’s an observable event \(X\) that you can condition your strategy on, and the strategy space has a product structure \(\mathbb{S}=\mathbb{S}_X\times\mathbb{S}_{\neg X}\). This product structure seems important, since you should generally expect utility functions \(u\) to factor in the sense that \(u(s,t)=qu_X(s)+(1-q)u_{\neg X}(t)\) for some functions \(u_X\) and \(u_{\neg X}\), where \(q\) is the probability of \(X\) (I think for the relevance section, you want to assume that whenever there is such a product structure, \(p\) is supported on utility functions that factor, and you can define conditional utility for such functions). Arbitrary permutations of \(\mathbb{S}\) that do not preserve the product structure don’t seem like true symmetries, and I don’t think it should be expected that an aggregation rule should be invariant under them. In the real world, there are many observations that people can and do take into account when deciding what to do, so a good model of strategy-space should have a very rich structure.

For outcomes, which is what utility functions should be defined on anyway: Outcomes differ in terms of how achievable they are. I have an intuition that if an outcome is impossible, then removing it from the model shouldn’t have much effect. Like, you shouldn’t be able to rig the aggregator function in favor of moral theory 1 as opposed to moral theory 2 by having the model take into account all the possible outcomes that could realistically be achieved, and also a bunch of impossible outcomes that theory 2 thinks are either really good or really bad, and theory 1 thinks are close to neutral. A natural counter-argument is that before you know which outcomes are impossible, any Pareto-optimal way of aggregating your possible preference functions must not change based on what turns out to be achievable; I’ll have to think about that more. Also, approximate symmetries between peoples’ preferences seem relevant to interpersonal utility comparison in practice, in the sense that two peoples’ preferences tend to look fairly similar to each other in structure, but with each person’s utility function centered largely around what happens to themselves instead of the other person, and this seems to help us make comparisons of the form “the difference between outcomes 1 and 2 is more important for person A than for person B”; I’m not sure if this way of describing it is making sense.


OK, got a better formalism: https://agentfoundations.org/item?id=1449

reply


For strategies: This ties back in to the situation where there’s an observable event \(X\) that you can condition your strategy on, and the strategy space has a product structure \(\mathbb{S}=\mathbb{S}_X\times\mathbb{S}_{\neg X}\). This product structure seems important, since you should generally expect utility functions \(u\) to factor in the sense that \(u(s,t)=qu_X(s)+(1-q)u_{\neg X}(t)\) for some functions \(u_X\) and \(u_{\neg X}\), where \(q\) is the probability of \(X\) (I think for the relevance section, you want to assume that whenever there is such a product structure, \(p\) is supported on utility functions that factor, and you can define conditional utility for such functions). Arbitrary permutations of \(\mathbb{S}\) that do not preserve the product structure don’t seem like true symmetries, and I don’t think it should be expected that an aggregation rule should be invariant under them. In the real world, there are many observations that people can and do take into account when deciding what to do, so a good model of strategy-space should have a very rich structure.

For outcomes, which is what utility functions should be defined on anyway: Outcomes differ in terms of how achievable they are. I have an intuition that if an outcome is impossible, then removing it from the model shouldn’t have much effect. Like, you shouldn’t be able to rig the aggregator function in favor of moral theory 1 as opposed to moral theory 2 by having the model take into account all the possible outcomes that could realistically be achieved, and also a bunch of impossible outcomes that theory 2 thinks are either really good or really bad, and theory 1 thinks are close to neutral. A natural counter-argument is that before you know which outcomes are impossible, any Pareto-optimal way of aggregating your possible preference functions must not change based on what turns out to be achievable; I’ll have to think about that more. Also, approximate symmetries between peoples’ preferences seem relevant to interpersonal utility comparison in practice, in the sense that two peoples’ preferences tend to look fairly similar to each other in structure, but with each person’s utility function centered largely around what happens to themselves instead of the other person, and this seems to help us make comparisons of the form “the difference between outcomes 1 and 2 is more important for person A than for person B”; I’m not sure if this way of describing it is making sense.


I think I’ve got something that works; I’ll post it tomorrow.

reply


Ok, I chose the picture proof because it was a particularly simple example of symmetry. What kind of internal structure are you thinking of?


For strategies: This ties back in to the situation where there’s an observable event \(X\) that you can condition your strategy on, and the strategy space has a product structure \(\mathbb{S}=\mathbb{S}_X\times\mathbb{S}_{\neg X}\). This product structure seems important, since you should generally expect utility functions \(u\) to factor in the sense that \(u(s,t)=qu_X(s)+(1-q)u_{\neg X}(t)\) for some functions \(u_X\) and \(u_{\neg X}\), where \(q\) is the probability of \(X\) (I think for the relevance section, you want to assume that whenever there is such a product structure, \(p\) is supported on utility functions that factor, and you can define conditional utility for such functions). Arbitrary permutations of \(\mathbb{S}\) that do not preserve the product structure don’t seem like true symmetries, and I don’t think it should be expected that an aggregation rule should be invariant under them. In the real world, there are many observations that people can and do take into account when deciding what to do, so a good model of strategy-space should have a very rich structure.

For outcomes, which is what utility functions should be defined on anyway: Outcomes differ in terms of how achievable they are. I have an intuition that if an outcome is impossible, then removing it from the model shouldn’t have much effect. Like, you shouldn’t be able to rig the aggregator function in favor of moral theory 1 as opposed to moral theory 2 by having the model take into account all the possible outcomes that could realistically be achieved, and also a bunch of impossible outcomes that theory 2 thinks are either really good or really bad, and theory 1 thinks are close to neutral. A natural counter-argument is that before you know which outcomes are impossible, any Pareto-optimal way of aggregating your possible preference functions must not change based on what turns out to be achievable; I’ll have to think about that more. Also, approximate symmetries between peoples’ preferences seem relevant to interpersonal utility comparison in practice, in the sense that two peoples’ preferences tend to look fairly similar to each other in structure, but with each person’s utility function centered largely around what happens to themselves instead of the other person, and this seems to help us make comparisons of the form “the difference between outcomes 1 and 2 is more important for person A than for person B”; I’m not sure if this way of describing it is making sense.

reply


Your picture proof looks correct, but it relies on symmetry, and I was saying that I prefer IIA instead of symmetry. I’m not particularly confident in my endorsement of IIA, but I am fairly confident in my non-endorsement of symmetry. In real situations, strategies/outcomes have a significant amount of internal structure which seems relevant and is not preserved by arbitrary permutations.

You’ve just replaced a type error with another type error. Elements of \(\mathbb{U}\) are just (equivalence classes of) functions \(\mathbb{S}\rightarrow\mathbb{R}\). Conditioning like that isn’t a supported operation.


You’re right. I’ve drawn the set of utility functions too broadly. I’ll attempt to fix this in the post.

reply


Your picture proof looks correct, but it relies on symmetry, and I was saying that I prefer IIA instead of symmetry. I’m not particularly confident in my endorsement of IIA, but I am fairly confident in my non-endorsement of symmetry. In real situations, strategies/outcomes have a significant amount of internal structure which seems relevant and is not preserved by arbitrary permutations.

You’ve just replaced a type error with another type error. Elements of \(\mathbb{U}\) are just (equivalence classes of) functions \(\mathbb{S}\rightarrow\mathbb{R}\). Conditioning like that isn’t a supported operation.


Ok, I chose the picture proof because it was a particularly simple example of symmetry. What kind of internal structure are you thinking of?

reply

Two Major Obstacles for Logical Inductor Decision Theory
post by Scott Garrabrant 41 days ago | Alex Mennen, Sam Eisenstat, Abram Demski, Jessica Taylor, Patrick LaVictoire and Tsvi Benson-Tilsen like this | 1 comment

In this post, I describe two major obstacles for logical inductor decision theory: untaken actions are not observable and no updatelessness for computations. I will concretely describe both of these problems in a logical inductor framework, but I believe that both issues are general enough to transcend that framework.

continue reading »

I’ll just note that in a modal logic or halting oracle setting you don’t need the chicken rule, as we found in this old post: https://agentfoundations.org/item?id=4 So it seems like at least the first problem is about the approximation, not the thing being approximated.

reply

Older

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

The "benign induction
by David Krueger on Why I am not currently working on the AAMLS agenda | 0 likes

This comment is to explain
by Alex Mennen on Formal Open Problem in Decision Theory | 0 likes

Thanks for writing this -- I
by Daniel Dewey on AI safety: three human problems and one AI issue | 1 like

I think it does do the double
by Stuart Armstrong on Acausal trade: double decrease | 0 likes

>but the agent incorrectly
by Stuart Armstrong on CIRL Wireheading | 0 likes

I think the double decrease
by Owen Cotton-Barratt on Acausal trade: double decrease | 0 likes

The problem is that our
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Yeah. The original generator
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 0 likes

I don't see how it would. The
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Does this generalise to
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

>Every point in this set is a
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This seems a proper version
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This doesn't seem to me to
by Stuart Armstrong on Change utility, reduce extortion | 0 likes

[_Regret Theory with General
by Abram Demski on Generalizing Foundations of Decision Theory II | 0 likes

It's not clear whether we
by Paul Christiano on Infinite ethics comparisons | 1 like

RSS

Privacy & Terms