1. Value Learning for Irrational Toy Models discussion post by Patrick LaVictoire 797 days ago | discuss
 2. HCH as a measure of manipulation discussion post by Patrick LaVictoire 862 days ago | 6 comments
 3. Censoring out-of-domain representations discussion post by Patrick LaVictoire 900 days ago | Jessica Taylor and Stuart Armstrong like this | 3 comments
4.Vector-Valued Reinforcement Learning
post by Patrick LaVictoire 993 days ago | Ryan Carey and Jessica Taylor like this | 1 comment

In order to study algorithms that can modify their own reward functions, we can define vector-valued versions of reinforcement learning concepts.

Imagine that there are several different goods that we could care about; then a utility function is represented by a preference vector $$\vec\theta$$. Furthermore, if it is possible for the agent (or the environment or other agents) to modify $$\vec \theta$$, then we will want to index them by the timestep.

Consider an agent that can take actions, some of which affect its own reward function. This agent would (and should) wirehead if it attempts to maximize the discounted rewards as calculated by its future selves; i.e. at timestep $$n$$ it would choose actions to maximize \begin{eqnarray} U_n = \sum_{k\geq n} \gamma_k \vec{x}_k\cdot\vec{\theta}_k\end{eqnarray}

where $$\vec x_k$$ is the vector of goods gained at time $$k$$, $$\vec \theta_k$$ is the preference vector at timestep $$k$$, and $$\gamma_k$$ is the time discount factor at time $$k$$. (We will often use the case of an exponential discount $$\gamma^k$$ for $$0<\gamma<1$$.)

However, we might instead maximize the value of tomorrow’s actions in light of today’s reward function, \begin{eqnarray} V_n = \sum_{k\geq n} \gamma_k\vec{x}_k\cdot\vec{\theta}_{n} \end{eqnarray}

(the only difference being $$\vec \theta_n$$ rather than $$\vec \theta_k$$). Genuinely maximizing this should lead to more stable goals; concretely, we can consider environments that can offer “bribes” to self-modify, and a learner maximizing $$U_n$$ would generally accept such bribes, while a learner maximizing $$V_n$$ would be cautious about doing so.

So what do we see when we adapt existing RL algorithms to such problems? There’s then a distinction between Q-learning and SARSA, where Q-learning foolishly accepts bribes that SARSA passes on, and this seems to be the flip side of the concept of interruptibility!

 5. Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences discussion post by Patrick LaVictoire 1128 days ago | Jessica Taylor and Stuart Armstrong like this | discuss
6.Proof Length and Logical Counterfactuals Revisited
post by Patrick LaVictoire 1415 days ago | Sam Eisenstat, Jessica Taylor and Scott Garrabrant like this | 5 comments

Update: This version of the Trolljecture fails too; see the counterexample due to Sam.

In An Informal Conjecture on Proof Length and Logical Counterfactuals, Scott discussed a “trolljecture” from a MIRI workshop, which attempted to justify (some) logical counterfactuals based on the lengths of proofs of various implications. Then Sam produced a counterexample, and Benja pointed to another counterexample.

But at the most recent MIRI workshop, I talked with Jonathan Lee and Holger Dell about a related way of evaluating logical counterfactuals, and we came away with a revived trolljecture!

7.A simple model of the Löbstacle
post by Patrick LaVictoire 1501 days ago | Abram Demski and Jessica Taylor like this | discuss

The idea of the Löbstacle is that basic trust in yourself and your successors is necessary but tricky: necessary, because naively modeling your successor’s decisions cannot rule out them making a bad decision, unless they are in some sense less intelligent than you; tricky, because the strongest patches of this problem lead to inconsistency, and weaker patches can lead to indefinite procrastination (because you always trust your successors to do the thing you are now putting off). (For a less handwavy explanation, see the technical agenda document on Vingean reflection.)

It is difficult to specify the circumstances under which this kind of self-trust succeeds or fails. Here is one simple example in which it can succeed, but for rather fragile reasons.

 8. Agent Simulates Predictor using Second-Level Oracles discussion post by Patrick LaVictoire 1506 days ago | Jessica Taylor and Nate Soares like this | discuss
9.Agents that can predict their Newcomb predictor
post by Patrick LaVictoire 1537 days ago | Jessica Taylor likes this | 4 comments

There’s a certain type of problem where it appears that having more computing power hurts you. That problem is the “agent simulates predictor” Newcomb’s Dilemma.

There’s a version of Newcomb’s Problem that poses the same sort of challenge to UDT that comes up in some multi-agent/game-theoretic scenarios.

Suppose:

• The predictor does not run a detailed simulation of the agent, but relies instead on a high-level understanding of the agent’s decision theory and computational power.
• The agent runs UDT, and has the ability to fully simulate the predictor.

Since the agent can deduce (by low-level simulation) what the predictor will do, the agent does not regard the prediction outcome as contingent on the agent’s computation. Instead, either predict-onebox or predict-twobox has a probability of 1 (since one or the other of those is deducible), and a probability of 1 remains the same regardless of what we condition on. The agent will then calculate greater utility for two-boxing than for one-boxing.

Meanwhile, the predictor, knowing that the the agent runs UDT and will fully simulate the predictor, can reason as in the preceding paragraph, and thus deduce that the agent will two-box. So the large box is left empty and the agent two-boxes (and the agent’s detailed simulation of the predictor correctly shows the predictor correctly predicting two-boxing).

The agent would be better off, though, running a different decision theory that does not two-box here, and that the predictor can deduce does not two-box.

EDITED 5/19/15: There’s a formal model of this due to Vladimir Slepnev where the agent and the predictor both have different types of predictive powers, such that in some sense they each know how the other will act in this universe. We’ll write this out along with another case where things work out properly.

(One algorithm has more computing power, but the other has stronger axioms: in particular, strong enough to prove that the other formal system is sound, as ZFC proves that PA is sound.)

In one of the following cases, proof-based UDT one-boxes for correct reasons; in the other case, it two-boxes analogously to the reasoning above.

10.Modal Bargaining Agents
post by Patrick LaVictoire 1557 days ago | Benja Fallenstein, Jessica Taylor and Nate Soares like this | 18 comments

Summary: Bargaining problems are interesting in the case of Löbian cooperation; Eliezer suggested a geometric algorithm for resolving bargaining conflicts by leaving the Pareto frontier, and this algorithm can be made into a modal agent, given an additional suggestion by Benja.

 11. A toy model of a corrigibility problem link by Patrick LaVictoire 1582 days ago | Benja Fallenstein, Daniel Dewey, Jessica Taylor and Nate Soares like this | discuss
12.Forum Digest: Updateless Decision Theory
post by Patrick LaVictoire 1585 days ago | Abram Demski, Benja Fallenstein, Jessica Taylor, Luke Muehlhauser and Nate Soares like this | discuss

Summary: This is a quick expository recap, with links, of the posts on this forum on the topic of updateless decision theory, through 3/19/15. Read this if you want to learn more about UDT, or if you’re curious about what we’ve been working on lately!

13.Welcome, new contributors!
post by Patrick LaVictoire 1585 days ago | Benja Fallenstein, Jessica Taylor, Luke Muehlhauser and Nate Soares like this | 1 comment

Today is the day; we’re opening up this forum to allow contributions from more people! See our How to Contribute page for the details.

14.Meta- the goals of this forum
post by Patrick LaVictoire 1594 days ago | Benja Fallenstein, Jessica Taylor and Luke Muehlhauser like this | 1 comment

Summary: We’re planning to publicize and open up the forum very soon, and so it’s a good time to discuss what we would like this forum to achieve, how we plan for moderation to work, and what discussions are on-topic.

Currently, this forum is read-only for everyone except for a few veterans of the mailing list it replaces. In a few days, we’re planning to open up posting (in a tiered way, similar in spirit to the tiered privileges of MathOverflow), and the comments and Likes of the full members will play a material role in moderating the community. So it’s a good time for those of us who are already here to discuss our goals for the forum, so that we stand a better chance of coordinating.

15.Proposal: Modeling goal stability in machine learning
post by Patrick LaVictoire 1601 days ago | Nate Soares likes this | 2 comments

Summary: We might learn some interesting things if we can construct a model of goal stability, wireheading, and corrigibility in a present-day machine learning algorithm. I outline one way that we could potentially do this, and ask if you’d like to help!

 16. Obstacle to modal optimality when you're being modalized discussion post by Patrick LaVictoire 1641 days ago | Benja Fallenstein, Jessica Taylor and Nate Soares like this | discuss
 17. An Introduction to Löb's Theorem in MIRI Research post by Patrick LaVictoire 1641 days ago | Luke Muehlhauser and Nate Soares like this | discuss At a recent MIRIx workshop, I gave an introductory talk about the surprising number of times that MIRI applied Löb’s Theorem in their research papers. It was well-received, so I wrote up and expanded my notes into a primer for new researchers: Any comments appreciated!

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes