Intelligent Agent Foundations Forumhttp://agentfoundations.org/Intelligent Agent Foundations ForumComment on Control and securityhttp://agentfoundations.org/item?id=1043Vladimir NesovnilComment on Control and securityhttp://agentfoundations.org/item?id=1040Paul ChristianonilComment on Control and securityhttp://agentfoundations.org/item?id=1039Vladimir NesovnilComment on Online Learning 1: Bias-detecting online learnershttp://agentfoundations.org/item?id=1037Jessica TaylornilEquilibria in Adversarial Supervised Learninghttp://agentfoundations.org/item?id=1036Ryan Carey

Note: This describes an idea of Jessica Taylor’s.

Comment on Control and securityhttp://agentfoundations.org/item?id=1035Jessica TaylornilComment on Online Learning 1: Bias-detecting online learnershttp://agentfoundations.org/item?id=1034Paul ChristianonilComment on Online Learning 1: Bias-detecting online learnershttp://agentfoundations.org/item?id=1033Paul ChristianonilControl and securityhttp://agentfoundations.org/item?id=1032Paul Christiano

I used to think of AI security as largely unrelated to AI control, and my impression is that some people on this forum probably still do. I’ve recently shifted towards seeing control and security as basically the same, and thinking that security may often be a more appealing way to think and talk about control.

Asymptotic Decision Theoryhttp://agentfoundations.org/item?id=1031Jack GallagherComment on Two Questions about Solomonoff Inductionhttp://agentfoundations.org/item?id=1030Vadim KosoynilComment on Online Learning 1: Bias-detecting online learnershttp://agentfoundations.org/item?id=1029Vadim KosoynilComment on Online Learning 1: Bias-detecting online learnershttp://agentfoundations.org/item?id=1028Paul ChristianonilComment on Online Learning 1: Bias-detecting online learnershttp://agentfoundations.org/item?id=1027Ryan CareynilOnline Learning 1: Bias-detecting online learnershttp://agentfoundations.org/item?id=1025Ryan Carey

Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.

Index of some decision theory postshttp://agentfoundations.org/item?id=1026Tsvi Benson-TilsenLogical inductor limits are dense under pointwise convergencehttp://agentfoundations.org/item?id=1024Sam Eisenstat

Logical inductors [1] are very complex objects, and even their limits are hard to get a handle on. In this post, I investigate the topological properties of the set of all limits of logical inductors.

Comment on Logical Inductors that trust their limitshttp://agentfoundations.org/item?id=1017Devi BorgnilComment on Logical Inductors that trust their limitshttp://agentfoundations.org/item?id=1016Devi BorgnilComment on Variations of the Garrabrant-inductorhttp://agentfoundations.org/item?id=1015Sune Kristian JakobsennilComment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=1014Jessica TaylornilComment on (C)IRL is not solely a learning processhttp://agentfoundations.org/item?id=1013Patrick LaVictoirenilComment on The set of Logical Inductors is not Convexhttp://agentfoundations.org/item?id=1012Sam EisenstatnilThe set of Logical Inductors is not Convexhttp://agentfoundations.org/item?id=1011Scott Garrabrant

Sam Eisenstat asked the following interesting question: Given two logical inductors over the same deductive process, is every (rational) convex combination of them also a logical inductor? Surprisingly, the answer is no! Here is my counterexample.

Logical Inductors contain Logical Inductors over other complexity classeshttp://agentfoundations.org/item?id=1010Scott Garrabrant

In the Logical Induction paper, we give a definition of logical inductors over polynomial time traders. It is clear from our definition that our use of polynomial time is rather arbitrary, and we could define e.g. an exponential time logical inductor. However, it may be less clear that actually logical inductors over one complexity class contain logical inductors over other complexity classes within them.

Learning doesn't solve philosophy of ethicshttp://agentfoundations.org/item?id=1008Stuart ArmstrongModel of human (ir)rationalityhttp://agentfoundations.org/item?id=1001Stuart Armstrong

A putative new idea for AI control; index here.

This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it’s our definition which determines what’s a bias and what’s a preference in humans).

Comment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=1007Stuart ArmstrongnilVariations of the Garrabrant-inductorhttp://agentfoundations.org/item?id=1006Sune Kristian JakobsenTest linkhttp://agentfoundations.org/item?id=1005Malo BourgonnilComment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=1004Jessica TaylornilComment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=1003Stuart ArmstrongnilComment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=1002Jessica TaylornilComment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=1000Stuart ArmstrongnilComment on (C)IRL is not solely a learning processhttp://agentfoundations.org/item?id=999Wei DainilComment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=998Jessica TaylornilComment on Heroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=997Stuart ArmstrongnilHeroin model: AI "manipulates" "unmanipulatable" rewardhttp://agentfoundations.org/item?id=994Stuart Armstrong

A putative new idea for AI control; index here.

A conversation with Jessica has revealed that people weren’t understanding my points about AI manipulating the learning process. So here’s a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.

Logical Inductors that trust their limitshttp://agentfoundations.org/item?id=989Scott Garrabrant

Here is another open question related to Logical Inductors. I have not thought about it very long, so it might be easy.

Does there exist a logical inductor \(\{\mathbb P_n\}\) over PA such that for all \(\phi\):

  1. PA proves that \(\mathbb P_\infty(\phi)\) exists and is in \([0,1]\), and

  2. \(\mathbb{E}_n(\mathbb{P}_\infty(\phi))\eqsim_n\mathbb{P}_n(\phi)\)?

Stratified learning and actionhttp://agentfoundations.org/item?id=947Stuart Armstrong

A putative new idea for AI control; index here.

(C)IRL is not solely a learning processhttp://agentfoundations.org/item?id=945Stuart Armstrong

A putative new idea for AI control; index here.

I feel Inverse Reinforcement Learning (IRL) and Cooperative Inverse Reinforcement Learning (CIRL) are very good ideas, and will likely be essential for safe AI if we can’t come up with some sort of sustainable low impact, modular, or Oracle design. But IRL and CIRL have a weakness. In a nutshell:

  1. The models (C)IRL uses for humans are underspecified.
  2. This should cause CIRL to have motivated and manipulative learning.
  3. Even without that, (C)IRL can end up fitting a terrible model to humans.
  4. To solve those issues, (C)IRL will need to make creative modelling decisions that go beyond (standard) learning.
Learning values versus learning knowledgehttp://agentfoundations.org/item?id=946Stuart ArmstrongUniversal Inductorshttp://agentfoundations.org/item?id=941Scott Garrabrant

Now that the Logical Induction paper is out, I am directing my attention towards decision theory. The approach I currently think will be most fruitful is attempting to make a logically updateless version of Wei Dai’s Updateless Decision Theory. Abram Demski has posted on here about this, but I think Logical Induction provides a new angle with which we can attack the problem. This post will present an alternate way of viewing Logical Induction which I think will be especially helpful for building a logical UDT. (The Logical Induction paper is a prerequisite for this post.)

IRL is hardhttp://agentfoundations.org/item?id=940Vadim Kosoy

We show that assuming the existence of public-key cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the “teacher” agent, the environment and the utility functions are computable in polynomial-time and there is only 1 bit of information to learn.

Oracle design as de-black-boxer.http://agentfoundations.org/item?id=937Stuart ArmstrongCausal graphs and counterfactualshttp://agentfoundations.org/item?id=930Stuart ArmstrongSimplified explanation of stratificationhttp://agentfoundations.org/item?id=927Stuart Armstrong

A putative new idea for AI control; index here.

I’ve previously talked about stratified indifference/learning. In this short post, I’ll try and present the idea, as simply and clearly as possible.

The non-indifferent behaviour of stratified indifference?http://agentfoundations.org/item?id=914Stuart ArmstrongnilCorrigibility through stratified indifference and learninghttp://agentfoundations.org/item?id=911Stuart ArmstrongModeling the capabilities of advanced AI systems as episodic reinforcement learninghttp://agentfoundations.org/item?id=910Jessica Taylor

Here I’ll summarize the main abstraction I use for thinking about future AI systems. This is essentially the same model that Paul uses. I’m not actually introducing any new ideas in this post; mostly this is intended to summarize my current views.