by Daniel Dewey 791 days ago | Ryan Carey likes this | link | parent | on: AI safety: three human problems and one AI issue Thanks for writing this – I think it’s a helpful kind of reflection for people to do! reply
 by Wei Dai 816 days ago | link ＞(e.g. almost no naturally occurring single pages of text are value-corrupting in an hour) I don’t see what “naturally occurring” here could mean (even informally) that would both make this statement true and make it useful to try to design a system that could safely process “naturally occurring single pages of text”. And how would a system like this know whether a given input is “naturally occurring” and hence safe to process? Please explain? reply
 by Daniel Dewey 816 days ago | link “naturally occurring” means “could be inputs to this AI system from the rest of the world”; naturally occurring inputs don’t need to be recognized, they’re here as a base case for the induction. Does that make sense? If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren’t, I’d guess that possible input single pages of text aren’t value-corrupting in an hour. (I would certainly want a much better answer than “I guess it’s fine” if we were really running something like this.) To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn’t going to kill us. If you think it’s really unlikely that any argument of this inductive form could be run, I’d be interested in that (or if Paul or someone else thought I’m on the wrong track / making some kind of fundamental mistake.) reply
 by Wei Dai 816 days ago | Patrick LaVictoire and Vladimir Nesov like this | link Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn’t expecting “naturally occurring” to mean coming from an environment with no other powerful reasoners. I think if we’re in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you’ve already acknowledged. We probably don’t need “powerful reasoners” to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety? Presumably you’re building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different “noise seeds” but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one? The output of this system isn’t “naturally occurring” so subsequent inputs to it won’t be either. If we’re to use this system a second time in a way that preserves your “non-corrupting invariant”, we have to either prevent all information flow from the first output to the second input, or have another argument for why this system along with whatever part of human civilization the information flows through preserves “non-corrupting” as a whole. Otherwise, someone could for example submit a bunch of queries to the system that would help them eventually craft a value-corrupting single page of text, and disguise these queries as innocent queries. reply
 by Daniel Dewey 816 days ago | link These objections are all reasonable, and 3 is especially interesting to me – it seems like the biggest objection to the structure of the argument I gave. Thanks. I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul’s are not amenable to any kind of argument for confidence, and we will only ever be able to say “well, I ran out of ideas for how to break it”, so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety. Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? If so, I’d be really interested in why – apologies if you’ve already tried to explain this and I just haven’t figured that out. reply
 by Wei Dai 815 days ago | link I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them. Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out. reply
 by Daniel Dewey 815 days ago | link Ah, gotcha. I’ll think about those points – I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.) It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift? reply
 by Wei Dai 814 days ago | Daniel Dewey likes this | link Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift? Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible. It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that. It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs). So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.) reply
 by Jacob Kopczynski 815 days ago | link Your point 2 is an excellent summary of my reasons for being skeptical of relying on human reasoning. (I also expect more outlandish means of transmissible value-corruption would show up, on the principles that edge cases are hard to predict and that we don’t really understand our minds.) reply
 by Daniel Dewey 818 days ago | link | parent | on: Where's the first benign agent? I also commented there last week and am awaiting moderation. Maybe we should post our replies here soon? reply
 by Daniel Dewey 827 days ago | link | parent | on: ALBA: can you be "aligned" at increased "capacity"... FWIW, this also reminded me of some discussion in Paul’s post on capability amplification, where Paul asks whether we can even define good behavior in some parts of capability-space, e.g.: The next step would be to ask: can we sensibly define “good behavior” for policies in the inaccessible part H? I suspect this will help focus our attention on the most philosophically fraught aspects of value alignment. I’m not sure if that’s relevant to your point, but it seemed like you might be interested. reply
 by Daniel Dewey 827 days ago | link | parent | on: ALBA: can you be "aligned" at increased "capacity"... I’m not sure you’ve gotten quite ALBA right here, and I think that causes a problem for your objection. Relevant writeups: most recent and original ALBA. As I understand it, ALBA proposes the following process: H trains A to choose actions that would get the best immediate feedback from H. A is benign (assuming that H could give not-catastrophic immediate feedback for all actions and that the learning process is robust). H defines the feedback, and so A doesn’t make decisions that are more effective at anything than H is; A is just faster. A (and possibly H) is (are) used to define a slow process A+ that makes “better” decisions than A or H would. (Better is in quotes because we don’t have a definition of better; the best anyone knows how to do right now is look at the amplification process and say “yep, that should do better.”) Maybe H uses A as an assistant, maybe a copy of A breaks down a decision problem into parts and hands them off to other copies of A, maybe A makes decisions that guide a much larger cognitive process. The whole loop starts over with A+ used as H. The claim is that step 2 produces a system that is able to give “better” feedback than the human could – feedback that considers more consequences more accurately in more complex decision situations, that has spent more effort introspecting, etc. This should make it able to handle circumstances further and further outside human-ordinary, eventually scaling up to extraordinary circumstances. So, while you say that the best case to hope for is $$r_i\rightarrow r$$, it seems like ALBA is claiming to do more. A second objection is that while you call each $$r_i$$ a “reward function”, each system is only trained to take actions that maximize the very next reward they get (not sum of future rewards). This means that each system is only effective at anything insofar as the feedback function it’s maximizing at each step considers the long-term consequences of each action. So, if $$r_i\rightarrow r$$, we don’t have reason to think that the system will be competent at anything outside of the “normal circumstances + a few exceptions” you describe – all of its planning power comes from $$r_i$$, so we should expect it to be basically incompetent where $$r_i$$ is incompetent. reply
 by Stuart Armstrong 827 days ago | link that is able to give “better” feedback than the human could – feedback that considers more consequences more accurately in more complex decision situations, that has spent more effort introspecting This is roughly how I would run ALBA in practice, and why I said it was better in practice than in theory. I’d be working with considerations I mentioned in this post and try and formalise how to extend utilities/rewards to new settings. reply
 by Daniel Dewey 826 days ago | link If I read Paul’s post correctly, ALBA is supposed to do this in theory – I don’t understand the theory/practice distinction you’re making. reply
 by Stuart Armstrong 824 days ago | link I disagree. I’m arguing that the concept of “aligned at a certain capacity” makes little sense, and this is key to ALBA in theory. reply
 by Daniel Dewey 891 days ago | David Krueger and Patrick LaVictoire like this | link | parent | on: Minimizing Empowerment for Safety Discussed briefly in Concrete Problems, FYI: https://arxiv.org/pdf/1606.06565.pdf reply
 by Daniel Dewey 895 days ago | link | parent | on: Learning Impact in RL This is a neat idea! I’d be interested to hear why you don’t think it’s satisfying from a safety point of view, if you have thoughts on that. reply
 by Owen Cotton-Barratt 895 days ago | Jessica Taylor and Patrick LaVictoire like this | link Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what’s small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this). Nevertheless I think this idea looks promising enough to explore further, would also like to hear David’s reasons. reply
 by David Krueger 891 days ago | link Yes, as Owen points out, there are general problems with reduced impact that apply to this idea, i.e. measuring long-term impacts. reply
 by Daniel Dewey 938 days ago | link | parent | on: My current take on the Paul-MIRI disagreement on a... Thanks for writing this, Jessica – I expect to find it helpful when I read it more carefully! reply
 by Daniel Dewey 1232 days ago | link | parent | on: World-models containing self-models Thanks Jessica. This was helpful, and I think I see more what the problem is. Re point 1: I see what you mean. The intuition behind my post is that it seems like it should be possible to make a bounded system that can eventually come to hold any computable hypothesis given enough evidence, including a hypothesis including a model of itself of arbitrary precision (which is different from Solomonoff, which can clearly never think about systems like itself). It’s clearly not possible for the system to hold and update infinitely many hypotheses the way Solomonoff does, and a system would need some kind of logical uncertainty or other magic to evaluate complex or self-referential hypotheses, but it seems like these hypotheses should be “in its class”. Does this make sense, or do you think there is a mistake there? Re point 2: I’m not confident that’s an accurate summary; I’m precisely proposing that the agent learn a model of the world containing a model of the agent (approximate or precise). I agree that evaluating this kind of model will require logical uncertainty or similar magic, since it will be expensive and possibly self-referential. Re point 3: I see what you mean, though for self-modeling the agent being predicted should only be as smart as the agent doing the prediction. It seems like approximation and logical uncertainty are the main ingredients needed here. Are there particular parts of the unbounded problem that are not solved by reflective oracles? reply
 by Jessica Taylor 1232 days ago | Patrick LaVictoire likes this | link Re point 1: Suppose the agent considers all hypotheses of length up to $$l$$ bits that run in up to $$t$$ time. Then the agent takes $$2^l t$$ time to run. For an individual hypothesis to reason about the agent, it must use $$t$$ computation time to reason about a computation of size $$2^l t$$. A theoretical understanding of how this works would solve a large part of the logical uncertainty / naturalized induction / Vingean reflection problem. Maybe it’s possible for this to work without having a theoretical understanding of why it works, but the theoretical understanding is useful too (it seems like you agree with this). I think there are some indications that naive solutions won’t automatically work; see e.g. this post. Re point 2: It seems like this is learning a model from the state and action to state, and a model from state to state that ignores the agent. But it isn’t learning a model that e.g. reasons about the agent’s source code to predict the next state. An integrated model should be able to do reasoning like this. Re point 3: I think you still have a Vingean reflection problem if a hypothesis that runs in $$t$$ time predicts a computation of size $$2^l t$$. Reflective Solomonoff induction solves a problem with an unrealistic computation model, and doesn’t translate to a solution with a finite (but large) amount of computing resources. The main part not solved is the general issue of predicting aspects of large computations using a small amount of computing power. reply
 by Daniel Dewey 1231 days ago | link Thanks. I agree that these are problems. It seems to me that the root of these problems is logical uncertainty / vingean reflection (which seem like two sides of the same coin); I find myself less confused when I think about self-modeling as being basically an application of “figuring out how to think about big / self-like hypotheses”. Is that how you think of it, or are there aspects of the problem that you think are missed by this framing? reply
 by Jessica Taylor 1231 days ago | Daniel Dewey likes this | link Yes, this is also how I think about it. I don’t know anything specific that doesn’t fit into this framing. reply
 by Daniel Dewey 1236 days ago | link | parent | on: Notes from a conversation on act-based and goal-di... “Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place.” I’ve had this conversation with Nate before, and I don’t understand why I should think it’s true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different? reply
 by Daniel Dewey 1232 days ago | link Thanks, Paul – I missed this response earlier, and I think you’ve pointed out some of the major disagreements here. I agree that there’s something somewhat consequentialist going on during all kinds of complex computation. I’m skeptical that we need better decision theory to do this reliably – are there reasons or intuition-pumps you know of that have a bearing on this? reply
 by Paul Christiano 1232 days ago | Daniel Dewey likes this | link I mentioned two (which I don’t find persuasive): Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.) We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don’t understand consequentialism well. I guess there is one more: If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don’t understand. This may make alignment way harder. reply
 by Jessica Taylor 1235 days ago | link Here’s the argument as I understand it (paraphrasing Nate). If we have a system predict a human making plans, then we need some story for why it can do this effectively. One story is that, like Solomonoff induction, it’s learning a physical model of the human and simulating the human this way. However, in practice, this is unlikely to be the reason an actual prediction engine predicts humans well (it’s too computationally difficult). So we need some other argument for why the predictor might work. Here’s one argument: perhaps it’s looking at a human making plans, figuring out what humans are planning towards, and using its own planning capabilities (towards the same goal) to predict what plans the human will make. But it seems like, to be confident that this will work, you need to have some understanding of how the predictor’s planning capabilities work. In particular, humans trying to study correct planning run into some theoretical problems (including e.g. decision theory and logical uncertainty), and it seems like a system would need to answer some of these same questions in order to predict humans well. I’m not sure what to think of this argument. Paul’s current proposal contains reinforcement learning agents who plan towards an objective defined by a more powerful agent, so it is leaning on the reinforcement learner’s ability to plan towards desirable goals. Rather than understand how the reinforcement learner works internally, Paul proposes giving the reinforcement learner a good enough objective (defined by the powerful agent) such that optimizing this objective is equivalent to optimizing what the humans want. This raises some problems, so probably some additional ingredient is necessary. I suspect I’ll have better opinions on this after thinking about the informed oversight problem some more. I also asked Nate about the analogy between computer vision and learning to predict a human making plans. It seems like computer vision is an easier problem for a few reasons: it doesn’t require serial thought (so it can be done by e.g. a neural network with a fixed number of layers), humans solve the problem using something similar to neural networks anyway, and planning towards the wrong goal is much more dangerous than recognizing objects incorrectly. reply
 by Daniel Dewey 1235 days ago | link Thanks, Jessica. This argument still doesn’t seem right to me – let me try to explain why. It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the world, and it seems like most of my behaviors are approximately predictable using a model of me that falls far short of modeling my full brain. This makes me think that an AI won’t need to have hand-made planning faculties to learn to predict planners (human or otherwise), any more than it’ll need weather faculties to predict weather or physics faculties to predict physical systems. Does that make sense? (I think the analogy to computer vision point toward the learnability of planning; humans use neural nets to plan, after all!) reply
 by Daniel Dewey 1234 days ago | link I agree with paragraphs 1, 2, and 3. To recap, the question we’re discussing is “do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?” A couple of notes on paragraph 4: - I’m not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don’t include built-in planning faculties. - You are bringing up understandability of an NTM-based human-decision-predictor. I think that’s a fine thing to talk about, but it’s different from the question we were talking about. - You’re also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about. In paragraph 5, you seem to be proposing that to make any competent predictor, we’ll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument (“predicting planners requires planning faculties so that you can emulate the planner” vs “predicting anything requires some amount of prioritization and decision-making”). In these cases, I’m more skeptical that a deep theoretical understanding of decision-making is important, but I’m open to talking about it – it just seems different from the original question. Overall, I feel like this response is out-of-scope for the current question – does that make sense, or do I seem off-base? reply
 by Jessica Taylor 1234 days ago | link Regarding paragraph 4: I see more now what you’re saying about NTMs. In some sense NTMs don’t have “built-in” planning capabilities; to the extent that they plan well, it’s because they learned that transition functions that make plans work better to predict some things. I think it’s likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn’t actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans. I don’t think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans’ objective. Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system’s accuracy. This is probably sufficient for a lot of applications. reply
 by Daniel Dewey 1233 days ago | link Thanks Jessica – sorry I misunderstood about hijacking. A couple of questions: Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning. My feeling is that today’s current understanding of planning – if I run this computation, I will get the result, and if I run it again, I’ll get the same one – are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction? reply
 by Paul Christiano 1233 days ago | link (The discussion seems to apply without modification to any predictor.) It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures. Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular. reply
 by Jessica Taylor 1233 days ago | link I agree with this. reply
 by Paul Christiano 1233 days ago | link If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory. Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.” Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue. reply
 by Jessica Taylor 1233 days ago | link Suppose there are N binary dimensions that predictors can vary on. Then we’d need $$2^N$$ predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the $$2^N$$ experts efficiently. Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave. reply
 by Daniel Dewey 1232 days ago | link Thanks Jessica, I think we’re on similar pages – I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there. reply
 Older

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes