Intelligent Agent Foundations Forumsign up / log in

An agent that cares only about given worlds is a useful concept. If these worlds are more like gameboards of an abstract game (with the agent being part of the gameboards), we can talk about game-aligned AI. By virtue of only caring about the abstract game it won’t be motivated to figure out how its decisions influence our physical world (which it doesn’t care about) and so won’t normally be dangerous despite not being human-aligned.

reply

by David Krueger 892 days ago | link

This seems only loosely related to my OP.

But it is quite interesting… so you’re proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma.

Anyways, I think that’s a promising avenue to investigate.
Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.

reply

by Vladimir Nesov 892 days ago | link

The issue in the OP is that possibility of other situations influences agent’s decision. The standard way of handling this is to agree to disregard other situations, including by appealing to Omega’s stipulated ability to inspire belief (it’s the whole reason for introducing the trustworthiness clause). This belief, if reality of situations is treated equivalently to their probability in agent’s eyes, expels the other situations from consideration.

The idea Paul mentioned is just another way of making sure that the other situations don’t intrude on the thought experiment, but since the main principle is to get this done somehow, it doesn’t really matter if a universal prior likes anti-muggers more than muggers, since in that case we’d just need to change the thought experiment.

Thought experiments are not natural questions that rate usefulness of decision theories, they are tests that examine particular features of decision theories, so if such investigation goes too far afield (as in looking into a priori weights of anti-muggers), that calls for a change in a thought experiment.

reply

by David Krueger 892 days ago | Vladimir Nesov likes this | link

I reason as follows:

  1. Omega inspires belief only after the agent encounters Omega.
  2. According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
  3. Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).

I think either: 1. the agent does update, in which case, why not update on the result of the coin-flip? or 2. the agent doesn’t update, in which case, what matters is simply the optimal policy given the original prior.

reply

by Vladimir Nesov 892 days ago | link

Game-aligned agents aren’t useful for AI safety as complete agents, since if you give them enough data from the real world, they start caring about it. This aspect applies more straightforwardly to very specialized sub-agents.

It’s misleading to say that such agents assign probability zero to the real world, since the computations they optimize don’t naturally talk about things like worlds at all. For example, consider a chess-playing AI that should learn to win against a fixed opponent program. It only reasons about chess, there is no reason to introduce concepts like physical worlds in its reasoning, because they probably aren’t any help for playing chess. (Unless the opponent or the agent itself is a sufficiently large program written by humans, data about the real world.)

There do seem to be some caveats, basically an agent shouldn’t be given either motivation or opportunity to pursue consequentialist reasoning about the real world. So if it finds an almost perfect strategy and has nothing more to do, it might start looking into more esoteric ways of improving the outcome that pass through worlds we might care about. This possibility is already implausible in practice, but adding a term in agent’s value for solving some hard harmless problem, like inverting hashes, should reliably take its attention away from physical worlds, at the cost of degrading its performance on the main task a little bit.

(My interest is in building an analogy with complete agents. Since bounded agents can’t perfectly channel idealized preference, there is a question of how they can do good things while remaining misaligned to some extent, so studying simple concepts of misalignment might help.)

reply


Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.

reply


Speaking for myself, the main issue is that we have no idea how to do step 3, how to tell a pre-existing sovereign what to do. A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly. If in 100 years the humans come up with new principles in how the AI should make decisions (philosophical progress), it may be impossible to express these principles as directions for an existing AI that was designed without the benefit of understanding these principles.

(Of course, the humans shouldn’t be physically there, or it will be too hard to say what it means to keep them safe, but making accurate uploads and packaging the 100 years as a pure computation solves this issue without any conceptual difficulty.)

reply

by Paul Christiano 929 days ago | link

A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly.

It’s not clear to me why “limited scope” and “can be replaced” are related. An agent with broad scope can still be optimizing something like “what the human would want me to do today” and the human could have preferences like “now that humans believe that an alternative design would have been better, gracefully step aside.” (And an agent with narrow scope could be unwilling to step aside if so doing would interfere with accomplishing its narrow task.)

reply

by Vladimir Nesov 929 days ago | link

Being able to “gracefully step aside” (to be replaced) is an example of what I meant by “limited scope” (in time). Even if AI’s scope is “broad”, the crucial point is that it’s not literally everything (and by default it is). In practice it shouldn’t be more than a small part of the future, so that the rest can be optimized better, using new insights. (Also, to be able to ask what humans would want today, there should remain some humans who didn’t get “optimized” into something else.)

reply


It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve.

For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine.

Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained.

I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.

reply

by Vladimir Nesov 987 days ago | Abram Demski likes this | link | parent | on: Updatelessness and Son of X

UDT, in its global policy form, is trying to solve two problems: (1) coordination between the instances of an agent faced with alternative environments; and (2) not losing interest in counterfactuals as soon as observations contradict them. I think that in practice, UDT is a wrong approach to problem (1), and the way in which it solves problem (2) obscures the nature of that problem.

Coordination, achieved with UDT, is like using identical agents to get cooperation in PD. Already in simple use cases we have different amounts of computational resources for instances of the UDT agent that could make the decision processes different, hence workarounds with keeping track of how much computation to give the decision processes, so that coordination doesn’t break, or hierarchies of decision processes that can access more and more resources. Even worse, the instances could be working on different problems and don’t need to coordinate at the level of computational resources needed to work on these problems. But we know that cooperation is possible in much greater generality, even between unrelated agents, and I think this is the right way of handling the differences between the instances of an agent.

It’s useful to restate the problem of not ignoring counterfactuals, as a problem of preserving values. It’s not quite reflective stability, as it’s stability under external observations rather than reflection, but when an agent plans for future observations it can change itself to preserve its values when the observations happen (hence “Son of CDT” that one-boxes). One issue is that the resulting values are still not right, they ignore counterfactuals that are not in the future of where the self-modification took place, and it’s even less clear how self-modification addresses computational uncertainty. So the problem is not just preserving values, but formulating them in the first place so that they can already talk about counterfactuals and computational resources.

I think that in the first approximation, the thing in common between instances of an agent (within a world, between alternative worlds, and at different times) should be a fixed definition of values, while the decision algorithms should be allowed to be different and to coordinate with each other as unrelated agents would. This requires an explanation of what kind of thing values are, their semantics, so that the same values (1) can be interpreted in unrelated situations to guide decisions, including worlds that don’t have our physical laws, and by agents that don’t know the physical laws of the situations they inhabit, but (2) retain valuation of all the other situations, which should in particular motivate acausal coordination as an instrumental drive. Each one of these points is relatively straightforward to address, but not together. I’m utterly confused about this problem, and I think it deserves more attention.

reply

by Wei Dai 987 days ago | Ryan Carey likes this | link

But we know that cooperation is possible in much greater generality, even between unrelated agents

It seems to me like cooperation might be possible in much greater generality. I don’t see how we know that it is possible. Please explain?

Each one of these points is relatively straightforward to address, but not together.

I’m having trouble following you here. Can you explain more about each point, and how they can be addresses separately?

reply

by Vladimir Nesov 1008 days ago | link | parent | on: Control and security

This works as a subtle argument for security mindset in AI control (while not being framed as such). One issue is that it might deemphasize some AI control problems that are not analogous to practical security problems, like detailed value elicitation (where in security you formulate a few general principles and then give up). That is, the concept of {AI control problems that are analogous to security problems} might be close enough to the concept of {all AI control problems} to replace it in some people’s minds.

reply

by Paul Christiano 1008 days ago | link

It seems to me like failures of value learning can also be a security problem: if some gap between the AI’s values and the human values is going to cause trouble, the trouble is most likely to show up in some adversarially-crafted setting.

I do agree that this is not closely analogous to security problems that cause trouble today.

I also agree that sorting out how to do value elicitation in the long-run is not really a short-term security problem, but I am also somewhat skeptical that it is a critical control problem. I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior, and a failure of this property (e.g. because the AI has a bad conception of “effective control”) is likely to be a security problem.

reply

by Vladimir Nesov 1008 days ago | link

I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior

This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided

  • The line between “us” that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
  • “Behave effectively” includes capability to disable potential misaligned AIs in the wild
  • “Effective control” allows replacing whatever the AI is doing with something else at any level of detail.

The advantage of introducing the concept of detailed values of the AI in the initial design is that it protects the setup from manipulation by the AI. If we don’t do that, the control problem becomes much more complicated. In the approach you are talking about, initially there are no explicitly formulated detailed values, only instrumental skills and humans.

So it’s a tradeoff: solving the value elicitation/use problem makes AIs easier to control, but if it’s possible to control an AI anyway, the problem could initially remain unsolved. I’m skeptical that it’s possible to control an AI other than by giving it completely defined values (so that it learns further details by further examining the fixed definition), if that AI is capable enough to prevent AI risk from other AIs. But it feels plausible that the path you’re outlining can be made safe.

reply


I agree with the sentiment that there are philosophical difficulties that AI needs to take into account, but that very likely take far too long to formulate. Simpler kinds of indirect normativity that involve prediction of uploads allow delaying that work to after AI.

So this issue doesn’t block all actionable work, as its straightforward form would suggest. There might be no need for the activities to be in this order in physical time. Instead it motivates work on the simpler kinds of indirect normativity that would allow such philosophical investigations to take place inside AI’s values. In particular, it motivates figuring out what kind of thing AI’s values are, in sufficient generality so that it would be able to represent the results of unexpected future philosophical progress.

reply

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms