Intelligent Agent Foundations Forumhttp://agentfoundations.org/Intelligent Agent Foundations ForumComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1596Sam EisenstatnilComment on Current thoughts on Paul Christano's research agendahttp://agentfoundations.org/item?id=1595Wei DainilComment on Current thoughts on Paul Christano's research agendahttp://agentfoundations.org/item?id=1594Paul ChristianonilOn the computational feasibility of forecasting using gamblershttp://agentfoundations.org/item?id=1593Vadim KosoyComment on Current thoughts on Paul Christano's research agendahttp://agentfoundations.org/item?id=1592Wei DainilOpen Problems Regarding Counterfactuals: An Introduction For Beginnershttp://agentfoundations.org/item?id=1591Alex AppelComment on Current thoughts on Paul Christano's research agendahttp://agentfoundations.org/item?id=1590Wei DainilComment on Current thoughts on Paul Christano's research agendahttp://agentfoundations.org/item?id=1589Paul ChristianonilCurrent thoughts on Paul Christano's research agendahttp://agentfoundations.org/item?id=1534Jessica Taylor

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.

"Like this world, but..."http://agentfoundations.org/item?id=1527Stuart Armstrong

A putative new idea for AI control; index here.

Pick a very unsafe goal: \(G=\)“AI, make this world richer and less unequal.” What does this mean as a goal, and can we make it safe?

I’ve started to sketch out how we can codify “human understanding” in terms of human ability to answer questions.

Here I’m investigating the reverse problem, to see whether the same idea can be used to give instructions to an AI.

Improved formalism for corruption in DIRLhttp://agentfoundations.org/item?id=1587Vadim KosoyComment on Smoking Lesion Steelmanhttp://agentfoundations.org/item?id=1585Johannes TreutleinnilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1579Vadim KosoynilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1577Vadim KosoynilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1575Vadim KosoynilComment on Smoking Lesion Steelmanhttp://agentfoundations.org/item?id=1572Vadim KosoynilComment on Smoking Lesion Steelmanhttp://agentfoundations.org/item?id=1571Abram DemskinilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1570Vadim KosoynilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1564Vadim KosoynilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1563Alex MennennilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1562Vladimir SlepnevnilComment on Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1561Vadim KosoynilComment on Smoking Lesion Steelmanhttp://agentfoundations.org/item?id=1560Vadim KosoynilComment on Cooperative Oracles: Stratified Pareto Optima and Almost Stratified Pareto Optimahttp://agentfoundations.org/item?id=1557Vadim KosoynilComment on Two Questions about Solomonoff Inductionhttp://agentfoundations.org/item?id=1556Sam EisenstatnilComment on AI safety: three human problems and one AI issuehttp://agentfoundations.org/item?id=1555Vadim KosoynilComment on A cheating approach to the tiling agents problemhttp://agentfoundations.org/item?id=1554Vladimir SlepnevnilComment on Smoking Lesion Steelmanhttp://agentfoundations.org/item?id=1553Paul ChristianonilComment on A cheating approach to the tiling agents problemhttp://agentfoundations.org/item?id=1552Vladimir SlepnevnilSmoking Lesion Steelmanhttp://agentfoundations.org/item?id=1525Abram Demski

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

Comment on A cheating approach to the tiling agents problemhttp://agentfoundations.org/item?id=1551Alex MennennilDelegative Inverse Reinforcement Learninghttp://agentfoundations.org/item?id=1550Vadim Kosoy

We introduce a reinforcement-like learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an “advisor”. The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I’m wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.

The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (long-term planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a “soft maximizer” satisfies the conditions. Moreover, we allow for the existence of “corrupt states” in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don’t know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayes-optimality w.r.t. the advisor’s beliefs, which is arguably the best we can expect.

A cheating approach to the tiling agents problemhttp://agentfoundations.org/item?id=1547Vladimir Slepnev

(This post resulted from a conversation with Wei Dai.)

Formalizing the tiling agents problem is very delicate. In this post I’ll show a toy problem and a solution to it, which arguably meets all the desiderata stated before, but only by cheating in a new and unusual way.

Here’s a summary of the toy problem: we ask an agent to solve a difficult math question and also design a successor agent. Then the successor must solve another math question and design its own successor, and so on. The questions get harder each time, so they can’t all be solved in advance, and each of them requires believing in Peano arithmetic (PA). This goes on for a fixed number of rounds, and the final reward is the number of correct answers.

Moreover, we will demand that the agent must handle both subtasks (solving the math question and designing the successor) using the same logic. Finally, we will demand that the agent be able to reproduce itself on each round, not just design a custom-made successor that solves the math question with PA and reproduces itself by quining.

Some Criticisms of the Logical Induction paperhttp://agentfoundations.org/item?id=1544Tarn Somervell FletcherLoebian cooperation in the tiling agents problemhttp://agentfoundations.org/item?id=1532Vladimir Slepnev

The tiling agents problem is about formalizing how AIs can create successor AIs that are at least as smart. Here’s a toy model I came up with, which is similar to Benya’s old model but simpler. A computer program X is asked one of two questions:

  • Would you like some chocolate?

  • Here’s the source code of another program Y. Do you accept it as your successor?

Humans are not agents: short vs long termhttp://agentfoundations.org/item?id=1515Stuart Armstrong

A putative new idea for AI control; index here.

This is an example of humans not being (idealised) agents.

Imagine a human who has a preference to not live beyond a hundred years. However, they want to live to next year, and it’s predictable that every year they are alive, they will have the same desire to survive till the next year.

New circumstances, new values?http://agentfoundations.org/item?id=1506Stuart ArmstrongCooperative Oracles: Stratified Pareto Optima and Almost Stratified Pareto Optimahttp://agentfoundations.org/item?id=1508Scott Garrabrant

In this post, we generalize the notions in Cooperative Oracles: Nonexploited Bargaining to deal with the possibility of introducing extra agents that have no control but have preferences. We further generalize this to infinitely many agents. (Part of the series started here.)

Futarchy, Xrisks, and near misseshttp://agentfoundations.org/item?id=1505Stuart ArmstrongFutarchy Fixhttp://agentfoundations.org/item?id=1493Abram Demski

Robin Hanson’s Futarchy is a proposal to let prediction markets make governmental decisions. We can view an operating Futarchy as an agent, and ask if it is aligned with the interests of its constituents. I am aware of two main failures of alignment: (1) since predicting rare events is rewarded in proportion to their rareness, prediction markets heavily incentivise causing rare events to happen (I’ll call this the entropy-market problem); (2) it seems prediction markets would not be able to assign probability to existential risk, since you can’t collect on bets after everyone’s dead (I’ll call this the existential risk problem). I provide three formulations of (1) and solve two of them, and make some comments on (2). (Thanks to Scott for pointing out the second of these problems to me; I don’t remember who originally told me about the first problem, but also thanks.)

Divergent preferences and meta-preferenceshttp://agentfoundations.org/item?id=1492Stuart Armstrong

A putative new idea for AI control; index here.

In simple graphical form, here is the problem of divergent human preferences:

Optimisation in manipulating humans: engineered fanatics vs yes-menhttp://agentfoundations.org/item?id=1489Stuart ArmstrongAn Approach to Logically Updateless Decisionshttp://agentfoundations.org/item?id=1472Abram DemskiAcausal trade: conclusion: theory vs practicehttp://agentfoundations.org/item?id=1482Stuart Armstrong

A putative new idea for AI control; index here.

When I started this dive into acausal trade, I expected to find subtle and interesting theoretical considerations. Instead, most of the issues are practical.

Acausal trade: trade barriershttp://agentfoundations.org/item?id=1480Stuart ArmstrongValue Learning for Irrational Toy Modelshttp://agentfoundations.org/item?id=1467Patrick LaVictoireAcausal trade: universal utility, or selling non-existence insurance too latehttp://agentfoundations.org/item?id=1471Stuart ArmstrongWhy I am not currently working on the AAMLS agendahttp://agentfoundations.org/item?id=1470Jessica Taylor

(note: this is not an official MIRI statement, this is a personal statement. I am not speaking for others who have been involved with the agenda.)

The AAMLS (Alignment for Advanced Machine Learning Systems) agenda is a project at MIRI that is about determining how to use hypothetical highly advanced machine learning systems safely. I was previously working on problems in this agenda and am currently not.

Cooperative Oracles: Nonexploited Bargaininghttp://agentfoundations.org/item?id=1469Scott Garrabrant

In this post, we formalize and generalize the phenomenon described in the Eliezer Yudkowsky post Cooperating with agents with different ideas of fairness, while resisting exploitation. (Part of the series started here.)

Cooperative Oracles: Introductionhttp://agentfoundations.org/item?id=1468Scott Garrabrant

This is the first in a series of posts introducing a new tool called a Cooperative Oracle. All of these posts are joint work Sam Eisenstat, Tsvi Benson-Tilsen, and Nisan Stiennon.

Here is my plan for posts in this sequence. I will update this as I go.

  1. Introduction
  2. Nonexploited Bargaining
  3. Stratified Pareto Optima and Almost Stratified Pareto Optima
  4. Definition and Existence Proof
  5. Alternate Notions of Dependency