The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring.

My understanding of logical inductor exploration (e.g. in asymptotic decision theory) is that the exploration steps the agent learns from mostly don’t happen in its own lifetime, rather they happen in the lifetimes of similar but simpler agents. This allows exploration to work for single-shot problems such as 5 and 10. Intuitively, if you are in a 5 and 10 problem and your brain has size 10^1000, then you can simulate someone whose brain has size 10^999 doing a 5 and 10 problem, and thereby learn the relation between the agent’s action and how much utility they get. So each particular agent has some chance of exploring irrecoverably, but in aggregate not many of them will (and it’s hard to predict which will and which won’t).

As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling.

Thompson sampling still randomizes (it randomizes its belief about the world it’s in) and is therefore vulnerable to troll bridge.

A: While that is a really interesting note that I hadn’t spotted before, the standard formulation of exploration steps in logical inductor decision theory involve infinite exploration steps over all time, so even though an agent of this type would be able to inductively learn from what other agents do in different decision problems in less time than it naively appears, that wouldn’t make it explore less.

B: What I intended with the remark about Thompson sampling was that troll bridge functions on there being two distinct causes of “attempting to cross the bridge”. One is crossing because you believe it to be the best action, and the other is crossing because an exploration step occurred, and Thompson sampling doesn’t have a split decision criterion like this. Although now that you point it out, it is possible to make a Thompson sampling variant where the troll blows up the bridge when “crossing the bridge” is not the highest-ranked action.






