Intelligent Agent Foundations Forumsign up / log in
Learning incomplete models using dominant markets
post by Vadim Kosoy 6 days ago | Jessica Taylor likes this | discuss

This post is formal treatment of the idea outlined here.

Given a countable set of incomplete models, we define a forecasting function that converges in the Kantorovich-Rubinstein metric with probability 1 to every one of the models which is satisfied by the true environment. This is analogous to Blackwell-Dubins merging of opinions for complete models, except that Kantorovich-Rubinstein convergence is weaker than convergence in total variation. The forecasting function is a dominant stochastic market for a suitably constructed set of traders.

continue reading »
HCH as a measure of manipulation
discussion post by Patrick LaVictoire 12 days ago | 6 comments
Dominant stochastic markets
post by Vadim Kosoy 13 days ago | discuss

We generalize the formalism of dominant markets to account for stochastic “deductive processes,” and prove a theorem regarding the asymptotic behavior of such markets. In a following post, we will show how to use these tools to formalize the ideas outlined here.

continue reading »
Modal Combat for games other than the prisoner's dilemma
post by Alex Mennen 25 days ago | Patrick LaVictoire and Scott Garrabrant like this | 1 comment
continue reading »
Generalizing Foundations of Decision Theory
discussion post by Abram Demski 25 days ago | Ryan Carey, Vadim Kosoy, Jessica Taylor and Scott Garrabrant like this | 8 comments
Translation "counterfactual"
post by Stuart Armstrong 27 days ago | discuss

In a previous post, I briefly mentioned translations as one of three possible counterfactuals for indifference. Here I want to clarify what I meant there, because the idea is interesting.

continue reading »
Nearest unblocked strategy versus learning patches
post by Stuart Armstrong 28 days ago | 9 comments

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.

For instance, if the AI is maximising a reward \(R\), and does some behaviour \(B_i\) that we don’t like, we can patch the AI’s algorithm with patch \(P_i\) (‘maximise \(R_0\) subject to these constraints…’), or modify \(R\) to \(R_i\) so that \(B_i\) doesn’t come up. I’ll focus more on the patching example, but the modified reward one is similar.

continue reading »
Some problems with making induction benign, and approaches to them
post by Jessica Taylor 28 days ago | Nate Soares, Patrick LaVictoire and Stuart Armstrong like this | 4 comments

The universal prior is malign. I’ll talk about sequence of problems causing it to be malign and possible solutions.

continue reading »
Maximally efficient agents will probably have an anti-daemon immune system
discussion post by Jessica Taylor 29 days ago | Ryan Carey, Patrick LaVictoire and Scott Garrabrant like this | discuss
All the indifference designs
discussion post by Stuart Armstrong 29 days ago | Patrick LaVictoire likes this | 1 comment
Prediction Based Robust Cooperation
post by Scott Garrabrant 29 days ago | Patrick LaVictoire likes this | 1 comment

In this post, We present a new approach to robust cooperation, as an alternative to the “modal combat” framework. This post is very hand-waivey. If someone would like to work on making it better, let me know.

continue reading »
Counterfactually uninfluenceable agents
post by Stuart Armstrong 30 days ago | discuss

Techniques used to counter agents taking biased decisions do not produce uninfluenceable agents.

\(\newcommand{\vd}{P}\newcommand{\vdh}{\widehat{\vd}}\)However, using counterfactual tools, we can construct uninfluenceable \(\vdh\) and \(\vd\), starting from biased and influenceable ones.

continue reading »
Indifference and compensatory rewards
discussion post by Stuart Armstrong 36 days ago | discuss
Are daemons a problem for ideal agents?
discussion post by Jessica Taylor 40 days ago | 1 comment
Entangled Equilibria and the Twin Prisoners' Dilemma
post by Scott Garrabrant 40 days ago | Vadim Kosoy and Patrick LaVictoire like this | 2 comments

In this post, I present a generalization of Nash equilibria to non-CDT agents. I will use this formulation to model mutual cooperation in a twin prisoners’ dilemma, caused by the belief that the other player is similar to you, and not by mutual prediction. (This post came mostly out of a conversation with Sam Eisenstat, as well as contributions from Tsvi Benson-Tilsen and Jessica Taylor)

continue reading »
How likely is a random AGI to be honest?
discussion post by Jessica Taylor 41 days ago | 1 comment
Minimizing Empowerment for Safety
discussion post by David Krueger 42 days ago | 2 comments
True understanding comes from passing exams
post by Stuart Armstrong 45 days ago | 5 comments

I’ll try to clarify what I was doing with the AI truth setup in a previous post. First I’ll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.

continue reading »
Does UDT *really* get counter-factually mugged?
discussion post by David Krueger 47 days ago | 7 comments
Learning Impact in RL
discussion post by David Krueger 47 days ago | Daniel Dewey likes this | 6 comments
Humans as a truth channel
post by Stuart Armstrong 50 days ago | discuss

Defining truth and accuracy is tricky, so when I’ve proposed designs for things like Oracles, I’ve either used a very specific and formal question, or and indirect criteria for truth.

Here I’ll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.

continue reading »
Hacking humans
discussion post by Stuart Armstrong 50 days ago | discuss
Censoring out-of-domain representations
discussion post by Patrick LaVictoire 50 days ago | Jessica Taylor and Stuart Armstrong like this | 3 comments
Emergency learning
post by Stuart Armstrong 54 days ago | Ryan Carey likes this | discuss

Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?

Well, drinking coffee by the barrel at Miri’s emergency research retreat I’d… still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn’t reveal any new approaches, I’d try and get something like this working.

continue reading »
Thoughts on Quantilizers
post by Stuart Armstrong 55 days ago | Ryan Carey and Abram Demski like this | discuss

This post will look at some of the properties of quantilizers, when they succeed and how they might fail.

Roughly speaking, let \(f\) be some true objective function that we want to maximise. We haven’t been able to specify it fully, so we have instead a proxy function \(g\). There is a cost function \(c=f-g\) which measures how much \(g\) falls short of \(f\). Then a quantilizer will choose actions (or policies) radomly from the top \(n\%\) of actions available, ranking those actions according to \(g\).

continue reading »
Older

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

I don't know which open
by Jessica Taylor on Some problems with making induction benign, and ap... | 0 likes

KWIK learning is definitely
by Vadim Kosoy on Some problems with making induction benign, and ap... | 0 likes

I should have said "reliably
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

I think that one can argue
by Vadim Kosoy on Generalizing Foundations of Decision Theory | 0 likes

"Having a well-calibrated
by Jessica Taylor on HCH as a measure of manipulation | 0 likes

Re #2, I think this is an
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

Re #1, an obvious set of
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

There's the additional
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

I agree it's not a complete
by David Krueger on An idea for creating safe AI | 0 likes

I spoke with Huw about this
by David Krueger on An idea for creating safe AI | 0 likes

Both of your conjectures are
by Alex Mennen on Generalizing Foundations of Decision Theory | 0 likes

I can think of two problems:
by Ryan Carey on HCH as a measure of manipulation | 0 likes

Question that I haven't seen
by Patrick LaVictoire on All the indifference designs | 0 likes

Agree that IRL doesn't solve
by Jessica Taylor on Some problems with making induction benign, and ap... | 0 likes

Designing an agent which is
by Vadim Kosoy on An idea for creating safe AI | 0 likes

RSS

Privacy & Terms