by Tom Everitt 307 days ago | link | parent My confusion is the following: Premises (*) and inferences (=>): The primary way for the agent to avoid traps is to delegate to a soft-maximiser. Any action with boundedly negative utility, a soft-maximiser will take with positive probability. Actions leading to traps do not have infinitely negative utility. => The agent will fall into traps with positive probability. If the agent falls into a trap with positive probability, then it will have linear regret. => The agent will have linear regret. So when you say in the beginning of the post “a Bayesian DIRL agent is guaranteed to attain most of the value”, you must mean that in a different sense than a regret sense?

 by Vadim Kosoy 306 days ago | link Your confusion is because you are thinking about regret in an anytime setting. In an anytime setting, there is a fixed policy $$\pi$$, we measure the expected reward of $$\pi$$ over a time interval $$t$$ and compare it to the optimal expected reward over the same time interval. If $$\pi$$ has probability $$p > 0$$ to walk into a trap, regret has the linear lower bound $$\Omega(pt)$$. On other hand, I am talking about policies $$\pi_t$$ that explicitly depend on the parameter $$t$$ (I call this a “metapolicy”). Both the advisor and the agent policies are like that. As $$t$$ goes to $$\infty$$, the probability $$p(t)$$ to walk into a trap goes to $$0$$, so $$p(t)t$$ is a sublinear function. A second difference with the usual definition of regret is that I use an infinite sum of rewards with geometric time discount $$e^{-1/t}$$ instead of a step function time discount that cuts off at $$t$$. However, this second difference is entirely inessential, and all the theorems work about the same with step function time discount. reply

### NEW DISCUSSION POSTS

I found an improved version
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

I misunderstood your
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 0 likes

Caught a flaw with this
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

As you say, this isn't a
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 1 like

Note: I currently think that
 by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
 by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
 by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
 by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
 by Alex Appel on Smoking Lesion Steelman | 1 like

Two minor comments. First,
 by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
 by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
 by Jessica Taylor on Musings on Exploration | 0 likes

A few comments. Traps are
 by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
 by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
 by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes