The Doomsday argument in anthropic decision theory
post by Stuart Armstrong 19 days ago | Abram Demski likes this | discuss

In Anthropic Decision Theory (ADT), behaviours that resemble the Self Sampling Assumption (SSA) derive from average utilitarian preferences (and from certain specific selfish preferences).

However, SSA implies the doomsday argument, and, to date, I hadn’t found a good way to express the doomsday argument within ADT.

This post will remedy that hole, by showing how there is a natural doomsday-like behaviour for average utilitarian agents within ADT.

Delegative Reinforcement Learning with a Merely Sane Advisor
post by Vadim Kosoy 20 days ago | discuss

Previously, we defined a setting called “Delegative Inverse Reinforcement Learning” (DIRL) in which the agent can delegate actions to an “advisor” and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is $$O(n^{2/3})$$) for one-shot DIRL (as opposed to standard regret bounds in RL which are only applicable in the episodic setting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider “Delegative Reinforcement Learning” (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoid catastrophic actions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a one-shot regret bound (in traditional normalization, $$O(n^{3/4})$$). Analogously to before, we allow for “corrupt” states in which both the advisor and the reward signal stop being reliable.

Autopoietic systems and difficulty of AGI alignment
post by Jessica Taylor 33 days ago | Ryan Carey, Owen Cotton-Barratt and Paul Christiano like this | 13 comments

I have recently come to the opinion that AGI alignment is probably extremely hard. But it’s not clear exactly what AGI or AGI alignment are. And there are some forms of aligment of “AI” systems that are easy. Here I operationalize “AGI” and “AGI alignment” in some different ways and evaluate their difficulties.

Density Zero Exploration
post by Alex Mennen 34 days ago | Abram Demski and Scott Garrabrant like this | discuss

The idea here is due to Scott Garrabrant. All I did was write it.

Logical Induction with incomputable sequences
post by Alex Mennen 34 days ago | Abram Demski and Scott Garrabrant like this | discuss

In the definition of a logical inductor, the deductive process is required to be computable. This, of course, does not allow the logical inductor to use randomness, or predict uncomputable sequences. The way traders were defined in the logical induction paper, this was necessary, because the traders were not given access to the output of the deductive process.

 Stable Pointers to Value: An Agent Embedded in Its Own Utility Function discussion post by Abram Demski 34 days ago | Tom Everitt, Scott Garrabrant and Vladimir Slepnev like this | discuss
Conditioning on Conditionals
post by Scott Garrabrant 34 days ago | Abram Demski likes this | discuss

(From conversations with Sam, Abram, Tsvi, Marcello, and Ashwin Sah) A basic EDT agent starts with a prior, updates on a bunch of observations, and then has an choice between various actions. It conditions on each possible action it could take, and takes the action for which this conditional leads the the highest expected utility. An updateless (but non-policy selection) EDT agent has a problem here. It wants to not update on the observations, but it wants to condition on the fact that its takes a specific action given its observations. It is not obvious what this conditional should look like. In this post, I agrue for a particular way to interpret this conditioning on this conditional (of taking a specific action given a specific observation).

The Three Levels of Goodhart's Curse
post by Scott Garrabrant 34 days ago | Vadim Kosoy, Abram Demski and Paul Christiano like this | discuss

Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.

 On the computational feasibility of forecasting using gamblers discussion post by Vadim Kosoy 63 days ago | discuss
Current thoughts on Paul Christano's research agenda
post by Jessica Taylor 65 days ago | Ryan Carey, Owen Cotton-Barratt, Sam Eisenstat, Paul Christiano, Stuart Armstrong and Wei Dai like this | 15 comments

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.

"Like this world, but..."
post by Stuart Armstrong 67 days ago | discuss

A putative new idea for AI control; index here.

Pick a very unsafe goal: $$G=$$“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?

I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.

Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.

 Improved formalism for corruption in DIRL discussion post by Vadim Kosoy 69 days ago | discuss
Smoking Lesion Steelman
post by Abram Demski 79 days ago | Tom Everitt, Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this | 8 comments

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

Delegative Inverse Reinforcement Learning

We introduce a reinforcement-like learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.

A cheating approach to the tiling agents problem
post by Vladimir Slepnev 81 days ago | Alex Mennen and Vadim Kosoy like this | 2 comments

(This post resulted from a conversation with Wei Dai.)

Formalizing the tiling agents problem is very delicate. In this post I’ll show a toy problem and a solution to it, which arguably meets all the desiderata stated before, but only by cheating in a new and unusual way.

Here’s a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor must solve another math question and design its own successor, and so on. The questions get harder each time, so they can’t all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.

Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself on each round, not just design a custom-made successor that solves the math question with PA and reproduces itself by quining.

Loebian cooperation in the tiling agents problem
post by Vladimir Slepnev 87 days ago | Alex Mennen, Vadim Kosoy, Abram Demski, Patrick LaVictoire and Stuart Armstrong like this | 4 comments

The tiling agents problem is about formalizing how AIs can create successor AIs that are at least as smart. Here’s a toy model I came up with, which is similar to Benya’s old model but simpler. A computer program X is asked one of two questions:

• Would you like some chocolate?

• Here’s the source code of another program Y. Do you accept it as your successor?

Humans are not agents: short vs long term
post by Stuart Armstrong 102 days ago | 2 comments

A putative new idea for AI control; index here.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it’s predictable that every year they are alive, they will have the same desire to survive till the next year.

 New circumstances, new values? discussion post by Stuart Armstrong 105 days ago | discuss
Cooperative Oracles: Stratified Pareto Optima and Almost Stratified Pareto Optima
post by Scott Garrabrant 109 days ago | Vadim Kosoy, Patrick LaVictoire and Stuart Armstrong like this | 8 comments

In this post, we generalize the notions in Cooperative Oracles: Nonexploited Bargaining to deal with the possibility of introducing extra agents that have no control but have preferences. We further generalize this to infinitely many agents. (Part of the series started here.)

 Futarchy, Xrisks, and near misses discussion post by Stuart Armstrong 109 days ago | Abram Demski likes this | discuss
Futarchy Fix
post by Abram Demski 112 days ago | Scott Garrabrant and Stuart Armstrong like this | 9 comments

Robin Hanson’s Futarchy is a proposal to let prediction markets make governmental decisions. We can view an operating Futarchy as an agent, and ask if it is aligned with the interests of its constituents. I am aware of two main failures of alignment: (1) since predicting rare events is rewarded in proportion to their rareness, prediction markets heavily incentivise causing rare events to happen (I’ll call this the entropy-market problem); (2) it seems prediction markets would not be able to assign probability to existential risk, since you can’t collect on bets after everyone’s dead (I’ll call this the existential risk problem). I provide three formulations of (1) and solve two of them, and make some comments on (2). (Thanks to Scott for pointing out the second of these problems to me; I don’t remember who originally told me about the first problem, but also thanks.)

Divergent preferences and meta-preferences
post by Stuart Armstrong 113 days ago | discuss

A putative new idea for AI control; index here.

In simple graphical form, here is the problem of divergent human preferences:

 Optimisation in manipulating humans: engineered fanatics vs yes-men discussion post by Stuart Armstrong 117 days ago | discuss
 An Approach to Logically Updateless Decisions discussion post by Abram Demski 122 days ago | Sam Eisenstat, Jack Gallagher and Scott Garrabrant like this | 4 comments
AI safety: three human problems and one AI issue
post by Stuart Armstrong 123 days ago | Ryan Carey and Daniel Dewey like this | 2 comments

A putative new idea for AI control; index here.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.

Older

### NEW DISCUSSION POSTS

Note that the problem with
 by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Typos on page 5: *
 by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Ah, you're right. So gain
 by Abram Demski on Smoking Lesion Steelman | 0 likes

> Do you have ideas for how
 by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I think I understand what
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>You don’t have to solve
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

My confusion is the
 by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

> First of all, it seems to
 by Abram Demski on Smoking Lesion Steelman | 0 likes

> figure out what my values
 by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I agree that selection bias
 by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>It seems quite plausible
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

> defending against this type
 by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

2. I think that we can avoid
 by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I hope you stay engaged with
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes