Intelligent Agent Foundations Forumsign up / log in
1.An Untrollable Mathematician
post by Abram Demski 425 days ago | Alex Appel, Sam Eisenstat, Vadim Kosoy, Jack Gallagher, Jessica Taylor, Paul Christiano, Scott Garrabrant and Vladimir Slepnev like this | 1 comment

Follow-up to All Mathematicians are Trollable.

It is relatively easy to see that no computable Bayesian prior on logic can converge to a single coherent probability distribution as we update it on logical statements. Furthermore, the non-convergence behavior is about as bad as could be: someone selecting the ordering of provable statements to update on can drive the Bayesian’s beliefs arbitrarily up or down, arbitrarily many times, despite only saying true things. I called this wild non-convergence behavior “trollability”. Previously, I showed that if the Bayesian updates on the provabilily of a sentence rather than updating on the sentence itself, it is still trollable. I left open the question of whether some other side information could save us. Sam Eisenstat has closed this question, providing a simple logical prior and a way of doing a Bayesian update on it which (1) cannot be trolled, and (2) converges to a coherent distribution.

continue reading »
2.Autopoietic systems and difficulty of AGI alignment
post by Jessica Taylor 584 days ago | Ryan Carey, Owen Cotton-Barratt and Paul Christiano like this | 13 comments

I have recently come to the opinion that AGI alignment is probably extremely hard. But it’s not clear exactly what AGI or AGI alignment are. And there are some forms of aligment of “AI” systems that are easy. Here I operationalize “AGI” and “AGI alignment” in some different ways and evaluate their difficulties.

continue reading »
3.Density Zero Exploration
post by Alex Mennen 584 days ago | Abram Demski, Paul Christiano and Scott Garrabrant like this | discuss

The idea here is due to Scott Garrabrant. All I did was write it.

continue reading »
4.Logical Induction with incomputable sequences
post by Alex Mennen 584 days ago | Abram Demski, Paul Christiano and Scott Garrabrant like this | discuss

In the definition of a logical inductor, the deductive process is required to be computable. This, of course, does not allow the logical inductor to use randomness, or predict uncomputable sequences. The way traders were defined in the logical induction paper, this was necessary, because the traders were not given access to the output of the deductive process.

continue reading »
5.The Three Levels of Goodhart's Curse
post by Scott Garrabrant 585 days ago | Vadim Kosoy, Abram Demski and Paul Christiano like this | 2 comments

Note: I now consider this post deprecated and instead recommend this updated version.

Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.

continue reading »
6.Current thoughts on Paul Christano's research agenda
post by Jessica Taylor 616 days ago | Ryan Carey, Owen Cotton-Barratt, Sam Eisenstat, Paul Christiano, Stuart Armstrong and Wei Dai like this | 15 comments

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.

continue reading »
7.Smoking Lesion Steelman
post by Abram Demski 630 days ago | Tom Everitt, Sam Eisenstat, Vadim Kosoy, Paul Christiano and Scott Garrabrant like this | 10 comments

It seems plausible to me that any example I’ve seen so far which seems to require causal/counterfactual reasoning is more properly solved by taking the right updateless perspective, and taking the action or policy which achieves maximum expected utility from that perspective. If this were the right view, then the aim would be to construct something like updateless EDT.

I give a variant of the smoking lesion problem which overcomes an objection to the classic smoking lesion, and which is solved correctly by CDT, but which is not solved by updateless EDT.

continue reading »
8.Where's the first benign agent?
link by Jacob Kopczynski 708 days ago | Patrick LaVictoire and Paul Christiano like this | 15 comments
9.On motivations for MIRI's highly reliable agent design research
post by Jessica Taylor 789 days ago | Ryan Carey, Sam Eisenstat, Daniel Dewey, Nate Soares, Patrick LaVictoire, Paul Christiano, Tsvi Benson-Tilsen and Vladimir Nesov like this | 10 comments

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

continue reading »
10.Towards learning incomplete models using inner prediction markets
discussion post by Vadim Kosoy 805 days ago | Jessica Taylor and Paul Christiano like this | 4 comments
11.Pursuing convergent instrumental subgoals on the user's behalf doesn't always require good priors
discussion post by Jessica Taylor 815 days ago | Daniel Dewey, Paul Christiano and Stuart Armstrong like this | 9 comments
12.Predicting HCH using expert advice
post by Jessica Taylor 846 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

Summary: in approximating a scheme like HCH , we would like some notion of “the best the prediction can be given available AI capabilities”. There’s a natural notion of “the best prediction of a human we should expect to get”. In general this doesn’t yield good predictions of HCH, but it does yield an HCH-like computation model that seems useful.

continue reading »
13.postCDT: Decision Theory using post-selected Bayes nets
post by Scott Garrabrant 868 days ago | Ryan Carey, Patrick LaVictoire and Paul Christiano like this | 1 comment

The purpose of this post is to document a minor idea about a new type of decision theory that works using a Bayes net. This is not a concrete proposal, since I will give no insight on which Bayes net to use. I am not that excited by this proposal, but think it is worth writing up anyway.

continue reading »
14.Attacking the grain of truth problem using Bayes-Savage agents
discussion post by Vadim Kosoy 885 days ago | Paul Christiano likes this | discuss
15.Asymptotic Decision Theory
link by Jack Gallagher 890 days ago | Abram Demski, Jessica Taylor, Patrick LaVictoire, Paul Christiano and Tsvi Benson-Tilsen like this | 2 comments
16.Online Learning 1: Bias-detecting online learners
post by Ryan Carey 898 days ago | Vadim Kosoy, Jessica Taylor, Nate Soares and Paul Christiano like this | 7 comments

Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.

continue reading »
17.What does it mean for correct operation to rely on transfer learning?
post by Jessica Taylor 1116 days ago | Daniel Dewey, Patrick LaVictoire, Paul Christiano and Stuart Armstrong like this | discuss

Summary: Some approaches to AI value alignment rely on transfer learning. I attempt to explain this idea more clearly.

continue reading »
18.A possible training procedure for human-imitators
discussion post by Jessica Taylor 1132 days ago | Patrick LaVictoire and Paul Christiano like this | 4 comments
19.Some work on connecting UDT and Reinforcement Learning
link by David Krueger 1193 days ago | Patrick LaVictoire and Paul Christiano like this | 5 comments
20.Existence of distributions that are expectation-reflective and know it
post by Tsvi Benson-Tilsen 1201 days ago | Kaya Stechly, Abram Demski, Jessica Taylor, Nate Soares and Paul Christiano like this | discuss

We prove the existence of a probability distribution over a theory \({T}\) with the property that for certain definable quantities \({\varphi}\), the expectation of the value of a function \({E}[{\ulcorner {\varphi}\urcorner}]\) is accurate, i.e. it equals the actual expectation of \({\varphi}\); and with the property that it assigns probability 1 to \({E}\) behaving this way. This may be useful for self-verification, by allowing an agent to satisfy a reflective consistency property and at the same time believe itself or similar agents to satisfy the same property. Thanks to Sam Eisenstat for listening to an earlier version of this proof, and pointing out a significant gap in the argument. The proof presented here has not been vetted yet.

continue reading »
21.A limit-computable, self-reflective distribution
post by Tsvi Benson-Tilsen 1225 days ago | Sam Eisenstat, Vadim Kosoy, Abram Demski, Jessica Taylor, Nate Soares, Patrick LaVictoire, Paul Christiano and Scott Garrabrant like this | 1 comment

We present a \(\Delta_2\)-definable probability distribution \({\Psi}\) that satisfies Christiano’s reflection schema for its own defining formula. The strategy is analogous to the chicken step employed by modal decision theory to obfuscate itself from the eyes of \({\mathsf{PA}}\); we will prevent the base theory \({T}\) from knowing much about \({\Psi}\), so that \({\Psi}\) can be coherent over \({T}\) and also consistently believe in reflection statements. So, the method used here is technical and not fundamental, but it does at least show that limit-computable and reflective distributions exist. These results are due to Sam Eisenstat and me, and this post benefited greatly from extensive notes from Sam; any remaining errors are probably mine.

Prerequisites: we assume familiarity with Christiano’s original result and the methods used there. In particular, we will freely use Kakutani’s fixed point theorem. See Christiano et al.’s paper.

continue reading »
22.Multibit reflective oracles
discussion post by Benja Fallenstein 1521 days ago | Jessica Taylor, Nate Soares and Paul Christiano like this | discuss
23.Improving the modal UDT optimality result
discussion post by Benja Fallenstein 1582 days ago | Patrick LaVictoire and Paul Christiano like this | 2 comments
24.Exploiting EDT
post by Benja Fallenstein 1596 days ago | Ryan Carey, Abram Demski, Daniel Dewey, Nate Soares, Patrick LaVictoire and Paul Christiano like this | 9 comments

The problem with EDT is, as David Lewis put it, its “irrational policy of managing the news” (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it’s taking these actions tells it about events the agent can’t affect at all. The canonical example is the smoking lesion problem.

I’ve long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values “smoking” (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the “tickle defense” of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb’s problem. (It’s a variation of the thought experiment in my LessWrong post on “the sin of updating when you can change whether you exist”.)

continue reading »

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms