Intelligent Agent Foundations Forumhttp://agentfoundations.org/Intelligent Agent Foundations ForumTwo Types of Updatelessnesshttp://agentfoundations.org/item?id=1765Abram DemskiComment on Stable Pointers to Value II: Environmental Goalshttp://agentfoundations.org/item?id=1763Vadim KosoynilStable Pointers to Value II: Environmental Goalshttp://agentfoundations.org/item?id=1762Abram DemskiComment on Further Progress on a Bayesian Version of Logical Uncertaintyhttp://agentfoundations.org/item?id=1761Alex AppelnilFurther Progress on a Bayesian Version of Logical Uncertaintyhttp://agentfoundations.org/item?id=1760Alex Appel

I’d like to credit Daniel Demski for helpful discussion.

Strategy Nonconvexity Induced by a Choice of Potential Oracleshttp://agentfoundations.org/item?id=1759Alex AppelComment on In memoryless Cartesian environments, every UDT policy is a CDT+SIA policyhttp://agentfoundations.org/item?id=1758258nilComment on Logical counterfactuals and differential privacyhttp://agentfoundations.org/item?id=1752Nisan StiennonnilComment on An Untrollable Mathematicianhttp://agentfoundations.org/item?id=1751Sam EisenstatnilAn Untrollable Mathematicianhttp://agentfoundations.org/item?id=1750Abram Demski

Follow-up to All Mathematicians are Trollable.

It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the non-convergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild non-convergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

Logical counterfactuals and differential privacyhttp://agentfoundations.org/item?id=1749Nisan Stiennon

Edit: This article has major flaws. See my comment below.

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

Comment on The set of Logical Inductors is not Convexhttp://agentfoundations.org/item?id=1748Vadim KosoynilComment on The set of Logical Inductors is not Convexhttp://agentfoundations.org/item?id=1747Abram DemskinilComment on Smoking Lesion Steelman IIhttp://agentfoundations.org/item?id=1746Tom EverittnilGoodhart Taxonomyhttp://agentfoundations.org/item?id=1744Scott GarrabrantComment on Delegative Inverse Reinforcement Learninghttp://agentfoundations.org/item?id=1743Vadim KosoynilComment on Delegative Inverse Reinforcement Learninghttp://agentfoundations.org/item?id=1742Alex AppelnilComment on Delegative Inverse Reinforcement Learninghttp://agentfoundations.org/item?id=1741Alex AppelnilComment on Delegative Inverse Reinforcement Learninghttp://agentfoundations.org/item?id=1740Alex AppelnilMore precise regret bound for DRLhttp://agentfoundations.org/item?id=1739Vadim Kosoy

We derive a regret bound for DRL reflecting dependence on:

  • Number of hypotheses

  • Mixing time of MDP hypotheses

  • The probability with which the advisor takes optimal actions

That is, the regret bound we get is fully explicit up to a multiplicative constant (which can also be made explicit). Currently we focus on plain (as opposed to catastrophe) and uniform (finite number of hypotheses, uniform prior) DRL, although this result can and should be extended to the catastrophe and/or non-uniform settings.

Comment on Delegative Inverse Reinforcement Learninghttp://agentfoundations.org/item?id=1738Alex AppelnilValue learning subproblem: learning goals of simple agentshttp://agentfoundations.org/item?id=1737Alex MennenComment on Being legible to other agents by committing to using weaker reasoning systemshttp://agentfoundations.org/item?id=1736Stuart ArmstrongnilOracle paperhttp://agentfoundations.org/item?id=1735Stuart ArmstrongBeing legible to other agents by committing to using weaker reasoning systemshttp://agentfoundations.org/item?id=1734Alex Mennen

Suppose that an agent \(A_{1}\) reasons in a sound theory \(T_{1}\), and an agent \(A_{2}\) reasons in a theory \(T_{2}\), such that \(T_{1}\) proves that \(T_{2}\) is sound. Now suppose \(A_{1}\) is trying to reason in a way that is legible to \(A_{2}\), in the sense that \(A_{2}\) can rely on \(A_{1}\) to reach correct conclusions. One way of doing this is for \(A_{1}\) to restrict itself to some weaker theory \(T_{3}\), which \(T_{2}\) proves is sound, for the purposes of any reasoning that it wants to be legible to \(A_{2}\). Of course, in order for this to work, not only would \(A_{1}\) have to restrict itself to using \(T_{3}\), but \(A_{2}\) would to trust that \(A_{1}\) had done so. A plausible way for that to happen is for \(A_{1}\) to reach the decision quickly enough that \(A_{2}\) can simulate \(A_{1}\) making the decision to restrict itself to using \(T_{3}\).

Comment on Where does ADT Go Wrong?http://agentfoundations.org/item?id=1733Jack GallaghernilComment on Policy Selection Solves Most Problemshttp://agentfoundations.org/item?id=1732Abram DemskinilComment on Policy Selection Solves Most Problemshttp://agentfoundations.org/item?id=1731Abram DemskinilWhy DRL doesn't work for arbitrary environmentshttp://agentfoundations.org/item?id=1730Vadim KosoyComment on Policy Selection Solves Most Problemshttp://agentfoundations.org/item?id=1729Paul ChristianonilComment on Policy Selection Solves Most Problemshttp://agentfoundations.org/item?id=1728Stuart ArmstrongnilStable agent, subagent-unstablehttp://agentfoundations.org/item?id=1725Stuart ArmstrongPolicy Selection Solves Most Problemshttp://agentfoundations.org/item?id=1711Abram Demski

It seems like logically updateless reasoning is what we would want in order to solve many decision-theory problems. I show that several of the problems which seem to require updateless reasoning can instead be solved by selecting a policy with a logical inductor that’s run a small amount of time. The policy specifies how to make use of knowledge from a logical inductor which is run longer. This addresses the difficulties which seem to block logically updateless decision theory in a fairly direct manner. On the other hand, it doesn’t seem to hold much promise for the kind of insights which we would want from a real solution.

Comment on Catastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1724Vadim KosoynilCatastrophe Mitigation Using DRL (Appendices)http://agentfoundations.org/item?id=1723Vadim KosoyComment on Hyperreal Brouwerhttp://agentfoundations.org/item?id=1722Vadim KosoynilComment on Resolving human inconsistency in a simple modelhttp://agentfoundations.org/item?id=1721Vadim KosoynilComment on Catastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1719Vadim KosoynilWhere does ADT Go Wrong?http://agentfoundations.org/item?id=1717Abram DemskiComment on The Happy Dance Problemhttp://agentfoundations.org/item?id=1718Wei DainilComment on Catastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1716Gordon Worley IIInilCatastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1715Vadim Kosoy

Previously we derived a regret bound for DRL which assumed the advisor is “locally sane.” Such an advisor can only take actions that don’t lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into “dangerous” (impending catastrophe) and “safe” (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain “leaking” of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the “value leak” rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1.

Catastrophe Mitigation Using DRLhttp://agentfoundations.org/item?id=1714Vadim KosoynilThe Happy Dance Problemhttp://agentfoundations.org/item?id=1713Abram Demski

Since the invention of logical induction, people have been trying to figure out what logically updateless reasoning could be. This is motivated by the idea that, in the realm of Bayesian uncertainty (IE, empirical uncertainty), updateless decision theory is the simple solution to the problem of reflective consistency. Naturally, we’d like to import this success to logically uncertain decision theory.

At a research retreat during the summer, we realized that updateless decision theory wasn’t so easy to define even in the seemingly simple Bayesian case. A possible solution was written up in Conditioning on Conditionals. However, that didn’t end up being especially satisfying.

Here, I introduce the happy dance problem, which more clearly illustrates the difficulty in defining updateless reasoning in the Bayesian case. I also outline Scott’s current thoughts about the correct way of reasoning about this problem.

Reflective oracles as a solution to the converse Lawvere problemhttp://agentfoundations.org/item?id=1712Sam Eisenstat

1 Introduction

Before the work of Turing, one could justifiably be skeptical of the idea of a universal computable function. After all, there is no computable function \(f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}\) such that for all computable \(g\colon\mathbb{N}\to\mathbb{N}\) there is some index \(i_{g}\) such that \(f\left(i_{g},n\right)=g\left(n\right)\) for all \(n\). If there were, we could pick \(g\left(n\right)=f\left(n,n\right)+1\), and then \[g\left(i_{g}\right)=f\left(i_{g},i_{g}\right)+1=g\left(i_{g}\right)+1,\] a contradiction. Of course, universal Turing machines don’t run into this obstacle; as Gödel put it, “By a kind of miracle it is not necessary to distinguish orders, and the diagonal procedure does not lead outside the defined notion.” [1]

The miracle of Turing machines is that there is a partial computable function \(f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}\) such that for all partial computable \(g\colon\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}\) there is an index \(i\) such that \(f\left(i,n\right)=g\left(n\right)\) for all \(n\). Here, we look at a different “miracle”, that of reflective oracles [2,3]. As we will see in Theorem 1, given a reflective oracle \(O\), there is a (stochastic) \(O\)-computable function \(f\colon\mathbb{N}\times\mathbb{N}\to\left\{ 0,1\right\}\) such that for any (stochastic) \(O\)-computable function \(g\colon\mathbb{N}\to\left\{ 0,1\right\}\), there is some index \(i\) such that \(f\left(i,n\right)\) and \(g\left(n\right)\) have the same distribution for all \(n\). This existence theorem seems to skirt even closer to the contradiction mentioned above.

We use this idea to answer “in spirit” the converse Lawvere problem posed in [4]. These methods also generalize to prove a similar analogue of the ubiquitous converse Lawvere problem from [5]. The original questions, stated in terms of topology, remain open, but I find that the model proposed here, using computability, is equally satisfying from the point of view of studying reflective agents. Those references can be consulted for more motivation on these problems from the perspective of reflective agency.

Section 3 proves the main lemma, and proves the converse Lawvere theorem for reflective oracles. In section 4, we use that to give a (circular) proof of Brouwer’s fixed point theorem, as mentioned in [4]. In section 5, we prove the ubiquitous converse Lawvere theorem for reflective oracles.

XOR Blackmail & Causalityhttp://agentfoundations.org/item?id=1710Abram DemskiRationalising humans: another mugging, but not Pascal'shttp://agentfoundations.org/item?id=1708Stuart ArmstrongLooking for Recommendations RE UDT vs. bounded computation / meta-reasoning / opportunity cost?http://agentfoundations.org/item?id=1705David KruegerReward learning summaryhttp://agentfoundations.org/item?id=1701Stuart Armstrong

A putative new idea for AI control; index here.

I’ve been posting a lot on value/reward learning recently, and, as usual, the process of posting (and some feedback) means that those posts are partially superseded already - and some of them are overly complex.

So here I’ll try and briefly summarise my current insights, with links to the other posts if appropriate (a link will cover all the points noted since the previous link):

Kolmogorov complexity makes reward learning worsehttp://agentfoundations.org/item?id=1702Stuart Armstrong