by Patrick LaVictoire 822 days ago | link | parent | on: Proposal for an Implementable Toy Model of Informe... I like this suggestion of a more feasible form of steganography for NNs to figure out! But I think you’d need further advances in transparency to get useful informed oversight capabilities from (transformed or not) copies of the predictive network. reply
 by Patrick LaVictoire 854 days ago | link | parent | on: HCH as a measure of manipulation Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general. reply
 by Patrick LaVictoire 854 days ago | link | parent | on: HCH as a measure of manipulation Re #1, an obvious set of questions to include in $$q$$ are questions of approval for various aspects of the AI’s policy. (In particular, if we want the AI to later calculate a human’s HCH and ask it for guidance, then we would like to be sure that HCH’s answer to that question is not manipulated.) reply
 by Patrick LaVictoire 854 days ago | link | parent | on: HCH as a measure of manipulation There’s the additional objection of “if you’re doing this, why not just have the AI ask HCH what to do?” Overall, I’m hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human’s HCH via certain informational content, than for the AI to reliably calculate the human’s HCH. But I don’t have strong arguments for this intuition. reply
 by Jessica Taylor 854 days ago | link “Having a well-calibrated estimate of HCH” is the condition you want, not “being able to reliably calculate HCH”. reply
 by Patrick LaVictoire 854 days ago | link I should have said “reliably estimate HCH”; I’d also want quite a lot of precision in addition to calibration before I trust it. reply
 by Patrick LaVictoire 857 days ago | link | parent | on: All the indifference designs Question that I haven’t seen addressed (and haven’t worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)? reply
 by Patrick LaVictoire 861 days ago | link | parent | on: Modal Combat for games other than the prisoner's d... This is a lot of good work! Modal combat is increasingly deprecated though (in my opinion), for reasons like the ones you noted in this post, compared to studying decision theory with logical inductors; and so I’m not sure this is worth developing further. reply
 by Patrick LaVictoire 885 days ago | link | parent | on: Censoring out-of-domain representations Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one. (In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.) A whitelisting variant would be way more reliable than a blacklisting one, clearly. reply
 by Patrick LaVictoire 959 days ago | link | parent | on: (Non-)Interruptibility of Sarsa(λ) and Q-Learning Nice! One thing that might be useful for context: what’s the theoretical correct amount of time that you would expect an algorithm to spend on the right vs. the left if the session gets interrupted each time it goes 1 unit to the right? (I feel like there should be a pretty straightforward way to calculate the heuristic version where the movement is just Brownian motion that gets interrupted early if it hits +1.) reply
 by Patrick LaVictoire 1001 days ago | link | parent | on: Asymptotic Decision Theory Typo: The statement of Theorem 4.1 omits the word “continuous”. reply
 by Patrick LaVictoire 1021 days ago | link | parent | on: (C)IRL is not solely a learning process Stuart did make it easier for many of us to read his recent ideas by crossposting them here. I’d like there to be some central repository for the current set of AI control work, and I’m hoping that the forum could serve as that. Is there a functionality that, if added here, would make it trivial to crosspost when you wrote something of note? reply
 Older

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes