Intelligent Agent Foundations Forumsign up / log in

I like this suggestion of a more feasible form of steganography for NNs to figure out! But I think you’d need further advances in transparency to get useful informed oversight capabilities from (transformed or not) copies of the predictive network.

reply


Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

reply


Re #1, an obvious set of questions to include in \(q\) are questions of approval for various aspects of the AI’s policy. (In particular, if we want the AI to later calculate a human’s HCH and ask it for guidance, then we would like to be sure that HCH’s answer to that question is not manipulated.)

reply


There’s the additional objection of “if you’re doing this, why not just have the AI ask HCH what to do?”

Overall, I’m hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human’s HCH via certain informational content, than for the AI to reliably calculate the human’s HCH. But I don’t have strong arguments for this intuition.

reply

by Jessica Taylor 854 days ago | link

“Having a well-calibrated estimate of HCH” is the condition you want, not “being able to reliably calculate HCH”.

reply

by Patrick LaVictoire 854 days ago | link

I should have said “reliably estimate HCH”; I’d also want quite a lot of precision in addition to calibration before I trust it.

reply


Question that I haven’t seen addressed (and haven’t worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)?

reply


This is a lot of good work! Modal combat is increasingly deprecated though (in my opinion), for reasons like the ones you noted in this post, compared to studying decision theory with logical inductors; and so I’m not sure this is worth developing further.

reply


Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.

(In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)

A whitelisting variant would be way more reliable than a blacklisting one, clearly.

reply


Nice! One thing that might be useful for context: what’s the theoretical correct amount of time that you would expect an algorithm to spend on the right vs. the left if the session gets interrupted each time it goes 1 unit to the right? (I feel like there should be a pretty straightforward way to calculate the heuristic version where the movement is just Brownian motion that gets interrupted early if it hits +1.)

reply


Typo: The statement of Theorem 4.1 omits the word “continuous”.

reply


Stuart did make it easier for many of us to read his recent ideas by crossposting them here. I’d like there to be some central repository for the current set of AI control work, and I’m hoping that the forum could serve as that.

Is there a functionality that, if added here, would make it trivial to crosspost when you wrote something of note?

reply

Older

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms