Being legible to other agents by committing to using weaker reasoning systems
post by Alex Mennen 8 days ago | Vladimir Slepnev likes this | discuss

Suppose that an agent $$A_{1}$$ reasons in a sound theory $$T_{1}$$, and an agent $$A_{2}$$ reasons in a theory $$T_{2}$$, such that $$T_{1}$$ proves that $$T_{2}$$ is sound. Now suppose $$A_{1}$$ is trying to reason in a way that is legible to $$A_{2}$$, in the sense that $$A_{2}$$ can rely on $$A_{1}$$ to reach correct conclusions. One way of doing this is for $$A_{1}$$ to restrict itself to some weaker theory $$T_{3}$$, which $$T_{2}$$ proves is sound, for the purposes of any reasoning that it wants to be legible to $$A_{2}$$. Of course, in order for this to work, not only would $$A_{1}$$ have to restrict itself to using $$T_{3}$$, but $$A_{2}$$ would to trust that $$A_{1}$$ had done so. A plausible way for that to happen is for $$A_{1}$$ to reach the decision quickly enough that $$A_{2}$$ can simulate $$A_{1}$$ making the decision to restrict itself to using $$T_{3}$$.

 Why DRL doesn't work for arbitrary environments discussion post by Vadim Kosoy 10 days ago | discuss
 Stable agent, subagent-unstable discussion post by Stuart Armstrong 12 days ago | discuss
Policy Selection Solves Most Problems
post by Abram Demski 12 days ago | Vladimir Slepnev likes this | 4 comments

It seems like logically updateless reasoning is what we would want in order to solve many decision-theory problems. I show that several of the problems which seem to require updateless reasoning can instead be solved by selecting a policy with a logical inductor that’s run a small amount of time. The policy specifies how to make use of knowledge from a logical inductor which is run longer. This addresses the difficulties which seem to block logically updateless decision theory in a fairly direct manner. On the other hand, it doesn’t seem to hold much promise for the kind of insights which we would want from a real solution.

 Catastrophe Mitigation Using DRL (Appendices) discussion post by Vadim Kosoy 19 days ago | discuss
 Where does ADT Go Wrong? discussion post by Abram Demski 23 days ago | Jack Gallagher and Jessica Taylor like this | 1 comment
Catastrophe Mitigation Using DRL

Previously we derived a regret bound for DRL which assumed the advisor is “locally sane.” Such an advisor can only take actions that don’t lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into “dangerous” (impending catastrophe) and “safe” (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain “leaking” of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the “value leak” rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1.

The Happy Dance Problem
post by Abram Demski 24 days ago | Scott Garrabrant and Stuart Armstrong like this | 1 comment

Since the invention of logical induction, people have been trying to figure out what logically updateless reasoning could be. This is motivated by the idea that, in the realm of Bayesian uncertainty (IE, empirical uncertainty), updateless decision theory is the simple solution to the problem of reflective consistency. Naturally, we’d like to import this success to logically uncertain decision theory.

At a research retreat during the summer, we realized that updateless decision theory wasn’t so easy to define even in the seemingly simple Bayesian case. A possible solution was written up in Conditioning on Conditionals. However, that didn’t end up being especially satisfying.

Here, I introduce the happy dance problem, which more clearly illustrates the difficulty in defining updateless reasoning in the Bayesian case. I also outline Scott’s current thoughts about the correct way of reasoning about this problem.

Reflective oracles as a solution to the converse Lawvere problem
post by Sam Eisenstat 25 days ago | Alex Mennen, Alex Appel, Vadim Kosoy, Abram Demski, Jessica Taylor, Scott Garrabrant and Vladimir Slepnev like this | discuss

1 Introduction

Before the work of Turing, one could justifiably be skeptical of the idea of a universal computable function. After all, there is no computable function $$f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}$$ such that for all computable $$g\colon\mathbb{N}\to\mathbb{N}$$ there is some index $$i_{g}$$ such that $$f\left(i_{g},n\right)=g\left(n\right)$$ for all $$n$$. If there were, we could pick $$g\left(n\right)=f\left(n,n\right)+1$$, and then $g\left(i_{g}\right)=f\left(i_{g},i_{g}\right)+1=g\left(i_{g}\right)+1,$ a contradiction. Of course, universal Turing machines don’t run into this obstacle; as Gödel put it, “By a kind of miracle it is not necessary to distinguish orders, and the diagonal procedure does not lead outside the defined notion.” [1]

The miracle of Turing machines is that there is a partial computable function $$f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}$$ such that for all partial computable $$g\colon\mathbb{N}\to\mathbb{N}\cup\left\{ \bot\right\}$$ there is an index $$i$$ such that $$f\left(i,n\right)=g\left(n\right)$$ for all $$n$$. Here, we look at a different “miracle”, that of reflective oracles [2,3]. As we will see in Theorem 1, given a reflective oracle $$O$$, there is a (stochastic) $$O$$-computable function $$f\colon\mathbb{N}\times\mathbb{N}\to\left\{ 0,1\right\}$$ such that for any (stochastic) $$O$$-computable function $$g\colon\mathbb{N}\to\left\{ 0,1\right\}$$, there is some index $$i$$ such that $$f\left(i,n\right)$$ and $$g\left(n\right)$$ have the same distribution for all $$n$$. This existence theorem seems to skirt even closer to the contradiction mentioned above.

We use this idea to answer “in spirit” the converse Lawvere problem posed in [4]. These methods also generalize to prove a similar analogue of the ubiquitous converse Lawvere problem from [5]. The original questions, stated in terms of topology, remain open, but I find that the model proposed here, using computability, is equally satisfying from the point of view of studying reflective agents. Those references can be consulted for more motivation on these problems from the perspective of reflective agency.

Section 3 proves the main lemma, and proves the converse Lawvere theorem for reflective oracles. In section 4, we use that to give a (circular) proof of Brouwer’s fixed point theorem, as mentioned in [4]. In section 5, we prove the ubiquitous converse Lawvere theorem for reflective oracles.

 XOR Blackmail & Causality discussion post by Abram Demski 26 days ago | discuss
 Rationalising humans: another mugging, but not Pascal's discussion post by Stuart Armstrong 26 days ago | discuss
 Looking for Recommendations RE UDT vs. bounded computation / meta-reasoning / opportunity cost? discussion post by David Krueger 32 days ago | 1 comment
Reward learning summary
post by Stuart Armstrong 34 days ago | discuss

A putative new idea for AI control; index here.

I’ve been posting a lot on value/reward learning recently, and, as usual, the process of posting (and some feedback) means that those posts are partially superseded already - and some of them are overly complex.

So here I’ll try and briefly summarise my current insights, with links to the other posts if appropriate (a link will cover all the points noted since the previous link):

 Kolmogorov complexity makes reward learning worse discussion post by Stuart Armstrong 34 days ago | discuss
 Our values are underdefined, changeable, and manipulable discussion post by Stuart Armstrong 38 days ago | discuss
 Normative assumptions: regret discussion post by Stuart Armstrong 40 days ago | discuss
 Bias in rationality is much worse than noise discussion post by Stuart Armstrong 40 days ago | discuss
 Learning values, or defining them? discussion post by Stuart Armstrong 40 days ago | discuss
Mixed-Strategy Ratifiability Implies CDT=EDT
post by Abram Demski 41 days ago | discuss

I provide conditions under which CDT=EDT in Bayes-net causal models.

Logical Updatelessness as a Robust Delegation Problem
post by Scott Garrabrant 44 days ago | discuss

(Cross-posted an Less Wrong)

 Predictable Exploration discussion post by Abram Demski 47 days ago | 5 comments
Comparing LICDT and LIEDT
post by Abram Demski 50 days ago | Alex Appel likes this | discuss

Attempted versions of CDT and EDT can be constructed using logical inductors, called LICDT and LIEDT. It is shown, however, that LICDT fails XOR Blackmail, and LIEDT fails Newcomb. One interpretation of this is that LICDT and LIEDT do not implement CDT and EDT very well. I argue that they are indeed forms of CDT and EDT, but stray from expectations because they also implement the ratifiability condition I discussed previously. Continuing the line of thinking from that post, I discuss conditions in which LICDT=LIEDT, and try to draw out broader implications for decision theory.

Rationality and overriding human preferences: a combined model
post by Stuart Armstrong 51 days ago | discuss

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

Humans can be assigned any values whatsoever...
post by Stuart Armstrong 58 days ago | Ryan Carey likes this | discuss

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I’m posting it here because of the subsequent posts I’m intending to write.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

Hyperreal Brouwer
post by Scott Garrabrant 66 days ago | Vadim Kosoy and Stuart Armstrong like this | 2 comments

This post explains how to view Kakutani’s fixed point theorem as a special case of Brouwer’s fixed point theorem with hyperreal numbers. This post is just math intuitions, but I found them useful in thinking about Kakutani’s fixed point theorem and many things in agent foundations. This came out of conversations with Sam Eisenstat.

Older

### NEW DISCUSSION POSTS

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes