Intelligent Agent Foundations Forumhttp://agentfoundations.org/Intelligent Agent Foundations ForumComment on Catastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1719Vadim KosoynilWhere does ADT Go Wrong?http://agentfoundations.org/item?id=1717Abram DemskiComment on The Happy Dance Problemhttp://agentfoundations.org/item?id=1718Wei DainilComment on Catastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1716Gordon Worley IIInilCatastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1715Vadim Kosoy

Previously we derived a regret bound for DRL which assumed the advisor is “locally sane.” Such an advisor can only take actions that don’t lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into “dangerous” (impending catastrophe) and “safe” (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain “leaking” of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the “value leak” rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1.

Catastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1714Vadim KosoynilThe Happy Dance Problemhttp://agentfoundations.org/item?id=1713Abram Demski

Since the invention of logical induction, people have been trying to figure out what logically updateless reasoning could be. This is motivated by the idea that, in the realm of Bayesian uncertainty (IE, empirical uncertainty), updateless decision theory is the simple solution to the problem of reflective consistency. Naturally, we’d like to import this success to logically uncertain decision theory.

At a research retreat during the summer, we realized that updateless decision theory wasn’t so easy to define even in the seemingly simple Bayesian case. A possible solution was written up in Conditioning on Conditionals. However, that didn’t end up being especially satisfying.

Here, I introduce the happy dance problem, which more clearly illustrates the difficulty in defining updateless reasoning in the Bayesian case. I also outline Scott’s current thoughts about the correct way of reasoning about this problem.

Reflective oracles as a solution to the converse Lawvere problemhttp://agentfoundations.org/item?id=1712Sam Eisenstat

1 Introduction

Before the work of Turing, one could justifiably be skeptical of the idea of a universal computable function. After all, there is no computable function \(f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}\) such that for all computable \(g\colon\mathbb{N}\to\mathbb{N}\) there is some index \(i_{g}\) such that \(f\left(i_{g},n\right)=g\left(n\right)\) for all \(n\). If there were, we could pick \(g\left(n\right)=f\left(n,n\right)+1\), and then \[g\left(i_{g}\right)=f\left(i_{g},i_{g}\right)+1=g\left(i_{g}\right)+1,\] a contradiction. Of course, universal Turing machines don’t run into this obstacle; as Gödel put it, “By a kind of miracle it is not necessary to distinguish orders, and the diagonal procedure does not lead outside the defined notion.” [1]

The miracle of Turing machines is that there is a partial computable function \(f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}\) such that for all partial computable \(g\colon\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}\) there is an index \(i\) such that \(f\left(i,n\right)=g\left(n\right)\) for all \(n\). Here, we look at a different “miracle”, that of reflective oracles [2,3]. As we will see in Theorem 1, given a reflective oracle \(O\), there is a (stochastic) \(O\)-computable function \(f\colon\mathbb{N}\times\mathbb{N}\to\left\{ 0,1\right\}\) such that for any (stochastic) \(O\)-computable function \(g\colon\mathbb{N}\to\left\{ 0,1\right\}\), there is some index \(i\) such that \(f\left(i,n\right)\) and \(g\left(n\right)\) have the same distribution for all \(n\). This existence theorem seems to skirt even closer to the contradiction mentioned above.

We use this idea to answer “in spirit” the converse Lawvere problem posed in [4]. These methods also generalize to prove a similar analogue of the ubiquitous converse Lawvere problem from [5]. The original questions, stated in terms of topology, remain open, but I find that the model proposed here, using computability, is equally satisfying from the point of view of studying reflective agents. Those references can be consulted for more motivation on these problems from the perspective of reflective agency.

Section 3 proves the main lemma, and proves the converse Lawvere theorem for reflective oracles. In section 4, we use that to give a (circular) proof of Brouwer’s fixed point theorem, as mentioned in [4]. In section 5, we prove the ubiquitous converse Lawvere theorem for reflective oracles.

XOR Blackmail & Causalityhttp://agentfoundations.org/item?id=1710Abram DemskiRationalising humans: another mugging, but not Pascal'shttp://agentfoundations.org/item?id=1708Stuart ArmstrongComment on Looking for Recommendations RE UDT vs. bounded computation / meta-reasoning / opportunity cost?http://agentfoundations.org/item?id=1706Abram DemskinilLooking for Recommendations RE UDT vs. bounded computation / meta-reasoning / opportunity cost?http://agentfoundations.org/item?id=1705David KruegerReward learning summaryhttp://agentfoundations.org/item?id=1701Stuart Armstrong

A putative new idea for AI control; index here.

I’ve been posting a lot on value/reward learning recently, and, as usual, the process of posting (and some feedback) means that those posts are partially superseded already - and some of them are overly complex.

So here I’ll try and briefly summarise my current insights, with links to the other posts if appropriate (a link will cover all the points noted since the previous link):

Kolmogorov complexity makes reward learning worsehttp://agentfoundations.org/item?id=1702Stuart ArmstrongAnnouncing the AI Alignment Prizehttp://agentfoundations.org/item?id=1700Vladimir SlepnevComment on Funding opportunity for AI alignment researchhttp://agentfoundations.org/item?id=1699Paul ChristianonilOur values are underdefined, changeable, and manipulablehttp://agentfoundations.org/item?id=1698Stuart ArmstrongNormative assumptions: regrethttp://agentfoundations.org/item?id=1697Stuart ArmstrongBias in rationality is much worse than noisehttp://agentfoundations.org/item?id=1696Stuart ArmstrongLearning values, or defining them?http://agentfoundations.org/item?id=1695Stuart ArmstrongMixed-Strategy Ratifiability Implies CDT=EDThttp://agentfoundations.org/item?id=1690Abram Demski

I provide conditions under which CDT=EDT in Bayes-net causal models.

Comment on Funding opportunity for AI alignment researchhttp://agentfoundations.org/item?id=1694Paul ChristianonilComment on Predictable Explorationhttp://agentfoundations.org/item?id=1693Abram DemskinilComment on The Three Levels of Goodhart's Cursehttp://agentfoundations.org/item?id=1692Sören MindermannnilComment on The Three Levels of Goodhart's Cursehttp://agentfoundations.org/item?id=1691Sören MindermannnilLogical Updatelessness as a Subagent Alignment Problemhttp://agentfoundations.org/item?id=1689Scott Garrabrant

(Cross-posted an Less Wrong)

Comment on Predictable Explorationhttp://agentfoundations.org/item?id=1688Stuart ArmstrongnilComment on Predictable Explorationhttp://agentfoundations.org/item?id=1687Abram DemskinilComment on Predictable Explorationhttp://agentfoundations.org/item?id=1685Abram DemskinilComment on Predictable Explorationhttp://agentfoundations.org/item?id=1684Alex AppelnilPredictable Explorationhttp://agentfoundations.org/item?id=1683Abram DemskiComment on Funding opportunity for AI alignment researchhttp://agentfoundations.org/item?id=1680David KruegernilRationality and overriding human preferences: a combined modelhttp://agentfoundations.org/item?id=1678Stuart Armstrong

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

Comment on Smoking Lesion Steelman III: Revenge of the Tickle Defensehttp://agentfoundations.org/item?id=1677Abram DemskinilComment on Smoking Lesion Steelman III: Revenge of the Tickle Defensehttp://agentfoundations.org/item?id=1676Alex AppelnilHumans can be assigned any values whatsoever...http://agentfoundations.org/item?id=1675Stuart Armstrong

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

Comment on Hyperreal Brouwerhttp://agentfoundations.org/item?id=1673Stuart ArmstrongnilHyperreal Brouwerhttp://agentfoundations.org/item?id=1671Scott Garrabrant

This post explains how to view Kakutani’s fixed point theorem as a special case of Brouwer’s fixed point theorem with hyperreal numbers. This post is just math intuitions, but I found them useful in thinking about Kakutani’s fixed point theorem and many things in agent foundations. This came out of conversations with Sam Eisenstat.

Comment on Should I post technical ideas here or on LessWrong 2.0?http://agentfoundations.org/item?id=1672Scott GarrabrantnilComment on Should I post technical ideas here or on LessWrong 2.0?http://agentfoundations.org/item?id=1670Vadim KosoynilComment on Should I post technical ideas here or on LessWrong 2.0?http://agentfoundations.org/item?id=1669Abram DemskinilSmoking Lesion Steelman III: Revenge of the Tickle Defensehttp://agentfoundations.org/item?id=1663Abram Demski

I improve the theory I put forward last time a bit, locate it in the literature, and discuss conditions when this approach unifies CDT and EDT.

Should I post technical ideas here or on LessWrong 2.0?http://agentfoundations.org/item?id=1665Stuart ArmstrongResolving human inconsistency in a simple modelhttp://agentfoundations.org/item?id=1664Stuart Armstrong

A putative new idea for AI control; index here.

This post will present a simple model of an inconsistent human, and ponder how to resolve their inconsistency.

Let \(\bf{H}\) be our agent, in a turn-based world. Let \(R^l\) and \(R^s\) be two simple reward functions at each turn. The reward \(R^l\) is thought of as being a ‘long-term’ reward, while \(R^s\) is a short-term one.

Smoking Lesion Steelman IIhttp://agentfoundations.org/item?id=1662Abram Demski

After Johannes Treutlein’s comment on Smoking Lesion Steelman, and a number of other considerations, I had almost entirely given up on CDT. However, there were still nagging questions about whether the kind of self-ignorance needed in Smoking Lesion Steelman could arise naturally, how it should be dealt with if so, and what role counterfactuals ought to play in decision theory if CDT-like behavior is incorrect. Today I sat down to collect all the arguments which have been rolling around in my head on this and related issues, and arrived at a place much closer to CDT than I expected.

Comment on Open Problems Regarding Counterfactuals: An Introduction For Beginnershttp://agentfoundations.org/item?id=1658Vadim KosoynilComment on Open Problems Regarding Counterfactuals: An Introduction For Beginnershttp://agentfoundations.org/item?id=1657Vadim KosoynilComment on Funding opportunity for AI alignment researchhttp://agentfoundations.org/item?id=1654Vladimir SlepnevnilComment on Smoking Lesion Steelmanhttp://agentfoundations.org/item?id=1652Abram DemskinilComment on Autopoietic systems and difficulty of AGI alignmenthttp://agentfoundations.org/item?id=1651Jessica Taylornil