Intelligent Agent Foundations Forumsign up / log in
Training Garrabrant inductors to predict counterfactuals
discussion post by Tsvi Benson-Tilsen 511 days ago | Abram Demski, Jessica Taylor and Scott Garrabrant like this | discuss

The ideas in this post are due to Scott, me, and possibly others. Thanks to Nisan Stiennon for working through the details of an earlier version of this post with me.

Github pdf:

We will use the notation and definitions given in Let \({{\overline{{\mathbb{P}}}}}\) be a universal Garrabrant inductor and let \({{\overline{U}}}: {\mathbb{N}}^+ \to {\textrm{Expr}}(2^\omega \to {\mathbb{R}})\) be a sequence of utility function machines. We will define an agent schema \(({A^{U_{n}}_{n}})\).

We give a schema where each agent selects a single action with no observations. Roughly, \({A^{U_{n}}_{n}}\) learns how to get what it wants by computing what the \({A^{U_{i}}_{i}}\) with \(i < n\) did, and also what various traders predicted would happen, given each action that the \({A^{U_{i}}_{i}}\) could have taken. The traders are rewarded for predicting what (counterfactually) would be the case in terms of bitstrings, and then their predictions are used to evaluate expected utilities of actions currently under consideration. This requires modifying our UGI and the traders involved to take a possible action as input, so that we get a prediction (a “counterfactual distribution over worlds”) for each action.

More precisely, define \[\begin{aligned} {A^{U_{n}}_{n}} := &\textrm{ let } {\hat{{\mathbb{P}}}}_n := {\textrm{Counterfactuals}}(n)\\ &\;{\textrm{return}}{\operatorname{arg\,max}}_{a \in {\textrm{Act}}} {\hat{{\mathbb{E}}}}_n[a](U_n)\end{aligned}\]

where \[{\hat{{\mathbb{E}}}}_n[a](U_n):= \sum_{\sigma \in 2^n} {\hat{{\mathbb{P}}}}_n[a](\sigma) \cdot U_n(\sigma) .\] Here \({\hat{{\mathbb{P}}}}_n\) is a dictionary of belief states, one for each action, defined by the function \({\textrm{Counterfactuals}}: {\mathbb{N}}^+ \to ({\textrm{Act}}\to \Delta({2^\omega}))\) using recursion as follows:

input: \(n \in {\mathbb{N}}^+\)

output: A dictionary of belief states \({\mathbb{P}}: {\textrm{Act}}\to \Delta({2^\omega})\)

initialize: \({\textrm{hist}}_{n-1} {\leftarrow}\) array of belief states of length \(n-1\)

for \(i\leq n-1\):

\({\hat{{\mathbb{P}}}}_i {\leftarrow}{\textrm{Counterfactuals}}(i)\)

\(a_i {\leftarrow}{\operatorname{arg\,max}}_{a \in {\textrm{Act}}} \sum_{\sigma \in 2^i} {\hat{{\mathbb{P}}}}_i[a](\sigma) \cdot U_i(\sigma)\)

\({\textrm{hist}}_{n-1}[i] {\leftarrow}{\hat{{\mathbb{P}}}}_i[a_i]\)

for \((a : {\textrm{Act}})\):

\({\mathbb{P}}[a] {\leftarrow}{\textrm{MarketMaker}}({\textrm{hist}}_{n-1}, {\textrm{TradingFirm}}'(a, a_{\leq n-1}, {\textrm{hist}}_{n-1}))\)

return \({\mathbb{P}}\)

Here, we use a modified form of traders and of the \({\textrm{TradingFirm}}'\) function from the \(LIA\) algorithm given in the logical induction paper. In detail, let traders have the type \[{\mathbb{N}}^+ \times {\textrm{Act}}\to \textrm{trading strategy}.\] On day \(n\), traders are passed a possible action \(a \in {\textrm{Act}}\), which we interpret as “an action that \({A^{U_{n}}_{n}}\) might take”. Then each trader returns a trading strategy, and those trading strategies are used as usual to construct a belief state \({\mathbb{P}}[a]\). We pass to \({\textrm{TradingFirm}}'\) the full history \(a_{\leq n-1}\) of the actions taken by the previous \({A^{U_{i}}_{i}}\), since \({\textrm{TradingFirm}}'\) calls the Budgeter function; that function requires computing the traders’s previous trading strategies, which require passing the \(a_i\) as arguments.

Thus, traders are evaluated based on the predictions they made about logic when given the actual action \(a_n\) as input. In particular, the sequence \(({\mathbb{P}}_n[a_n])\) is a UGI over the class of efficient traders given access to the actual actions taken by the agent \({A^{U_{n}}_{n}}\).

This scheme probably suffers from spurious counterfactuals, but feels like a natural baseline proposal.





If you drop the
by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes


Privacy & Terms