Intelligent Agent Foundations Forumsign up / log in
Logical counterfactuals and differential privacy
post by Nisan Stiennon 27 days ago | Abram Demski and Scott Garrabrant like this | 1 comment

Edit: This article has major flaws. See my comment below.

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

Summary

For a logical inductor \(\mathbb{P}\), define logical counterfactuals by

\[ \mathbb{P}_n(\phi | \psi) := \sum_y P_k(\phi | \psi \wedge Y = y) P_n(Y = y) \]

for a suitable \(k < n\) and a random variable \(Y\) independent of \(\psi\) with respect to \(\mathbb{P}_k\). Using this definition, one can construct agents that perform well in ASP-like problems.

Motivation

Recall the Agent Simulates Predictor problem:

\[ U_n = 10^6 \mathbb{P}_{n-1}(A_n = 1) + 10^3 \mathbb{1}(A_n = 2) \]

Naively, we want to solve this by argmaxing:

\[ A_n = \operatorname{argmax}_a \mathbb{E}_n[U_n | A_n = a] \]

Hopefully, \(\mathbb{P}_n(A_n = 1) \approx 1\), \(\mathbb{P}_{n-1}(A_n = 1) \approx 1\), and \(\mathbb{E}_n[U_n | A_n = 1] \approx 10^6\). Also, two-boxing should be less attractive than one-boxing:

\[ \mathbb{E}_n[U_n | A_n = 2] \approx 10^3 \]

However, if we make this well-defined with \(\varepsilon\)-exploration, we’ll get

\[ \mathbb{E}_n[U_n | A_n = 2] \approx 10^6 + 10^3 \]

and then the agent will two-box, contradiction. Instead we’d like to use predictable exploration and set

\[ \mathbb{E}_n[U_n | A_n = 2] := \mathbb{E}_k[U_n | A_n = 2] \]

for \(k\) small enough that the right-hand side is sensible. Let’s see how.

Predictable exploration

Choose \(k \ll n\) so that \(\mathbb{P}_k(A_n = 2) \gg 0\). Our agent decides whether to explore at stage \(k\), and uses its beliefs at stage \(k\) as a substitute for counterfactuals:

\begin{align*} \operatorname{explore}_0 &:= \mathbb{P}_k(\operatorname{explore}_0) < \varepsilon \\ \operatorname{explore}_1 &:= \left\{ \begin{array}{ll} 1 & \mathbb{P}_k(\operatorname{explore}_1 = 1) < \frac{1}{2} \\ 2 & \mbox{otherwise} \end{array} \right. \\ \forall a \, \mathbb{E}_n(\phi | A = a) &:= \mathbb{E}_k(\phi | A = a \wedge \operatorname{explore}_0) \mbox{ if } \mathbb{P}_n(A = a) < \delta \\ A_n &:= \left\{ \begin{array}{ll} \operatorname{explore}_1 & \mbox{if } \operatorname{explore}_0 \\ \operatorname{argmax}_a \mathbb{E}_n[U_n | A_n = a] & \mbox{otherwise} \end{array} \right. \end{align*}

Here \(\varepsilon, \delta\) are small positive numbers. It’s easy to see that, under reasonable assumptions, this agent 1-boxes on Agent Simulates Predictor. But it can’t use the full strength of \(\mathbb{P}_n\) in its counterfactual reasoning, and this is a problem.

Differential privacy

To illustrate the problem, add a term to the utility function that sometimes rewards two-boxing:

\begin{align*} U_n &= 10^6 \mathbb{P}_{n-1}(A_n = 1) + 10^3 \mathbb{1}(A_n = 2) + 10^6 \mathbb{1}(A_n = 2 \wedge X_{n-1}) \\ X_{n-1} &:= \mathbb{P}_{n-1}(X_{n-1}) < \frac{1}{2} \end{align*}

The agent should two-box if and only if \(X\). Assuming that’s the case, and \(\mathbb{P}_{n-1}\) knows this, we have:

\begin{align*} \mathbb{P}_{n-1}(A_n = 1) &= \frac{1}{2} \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 1] &= \frac{1}{2} 10^6 \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 2] &= \mathbb{E}_k[U_n | A_n = 2 \wedge \operatorname{explore}_0] \\ &= 10^3 + \frac{1}{2}10^6 \end{align*}

So if \(\neg X_{n-1}\), two-boxing is the more attractive option, which is a contradiction. (I’m rounding \(\varepsilon\) to zero for simplicity.)

The problem is that the counterfactual has to rely on \(\mathbb{P}_k\)’s imperfect knowledge of \(X_{n-1}\). We want to combine \(\mathbb{P}_k\)’s ignorance of \(\operatorname{explore}_0\) with \(\mathbb{P}_n\)’s knowledge of \(X_{n-1}\).

If \(X\) is independent of \(A\) conditioned on \(\operatorname{explore}_0\) with respect to \(\mathbb{P}_k\), then we can do this:

\begin{align*} \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0] &= \sum_x \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge X = x] \mathbb{P}_k(X = x | A = a \wedge \operatorname{explore}_0) \\ &= \sum_x \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge X = x] \mathbb{P}_k(X = x | \operatorname{explore}_0) \end{align*}

Then replace \(\mathbb{P}_k(X = x | \operatorname{explore}_0)\) with \(\mathbb{P}_n(X = x | \operatorname{explore}_0)\):

\[ \mathbb{E}_n[U | A = a \wedge \operatorname{explore}_0] := \sum_x \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge X = x] \mathbb{P}_n(X = x | \operatorname{explore}_0) \]

This is more accurate than \(\mathbb{E}_n[U | A = a \wedge \operatorname{explore}_0]\), and unbiased.

If \(X\) is not independent of \(A\) conditional on \(\operatorname{explore}_0\), we can introduce an auxilliary variable and construct a version of \(X\) that is independent. This construction is a solution to the following differential privacy problem: Make a random variable \(Y\) that is a function of \(X\) and independent randomness, maximizing the mutual conditional information \(H(X;Y | A)\), subject to the constraint that \(A\) is independent of \(Y\). Using the identity

\[ H(X|A) = H(X;Y|A) + H(X|AY) \]

we see that the maximum is attained when \(H(X|AY) = 0\), which means that \(X\) is a function of \(A\) and \(Y\).

Now here’s the construction of \(Y\):

Let \(\mathcal{X}\) be the finite set of possible values of \(X\), and let \(\mathcal{A}\) be the finite set of possible values of \(A\). We’ll iteratively construct a set \(\mathcal{Y}\) and define a random variable \(Y\) taking values in \(\mathcal{Y}\). To start with, let \(\mathcal{Y} = \emptyset\).

Now choose

\[ (a, x) := \operatorname{argmin}_{ \begin{array}{c} a \in \mathcal{A} \\ x \in \mathcal{X} \\ \mathbb{P}(X = x, Y \notin \mathcal{Y} | A = a) > 0 \end{array} } \mathbb{P}(X = x, Y \notin \mathcal{Y} | A = a) \]

and for each \(a' \in \mathcal{A} \backslash \{a\}\), choose some \(f(a') \in \mathcal{X}\) such that \(\mathbb{P}(X = f(a'), Y \notin \mathcal{Y} | A = a') > 0\). Then make a random binary variable \(T_{a'}\) such that

\[ \mathbb{P}(T_{a'} \wedge X = f(a') \wedge Y \notin \mathcal{Y} | A = a') = \mathbb{P}(X = x \wedge Y \notin \mathcal{Y}| A = a) \]

Then let \(y\) be the event defined by

\[ (X = x \wedge Y \notin \mathcal{Y} \wedge A = a) \vee \bigvee_{a' \in \mathcal{A} \backslash \{a\}} (T_{a'} \wedge X = f(a') \wedge Y \notin \mathcal{Y} \wedge A = a') \]

and add \(y\) to \(\mathcal{Y}\). After repeating this process \(|\mathcal{X}||\mathcal{A}|\) times, we are done.

We can do this with a logical inductor as well. In general, to get a sentence \(T\) such that \(\mathbb{P}_k(T \wedge B | C) \approx p\), take \(T := \mathbb{P}_k(T \wedge B | C) < p \wedge B \wedge C\).

Now given random variables \(U\) and \(A\), and some informative sentences \(\phi_1, \dots, \phi_\ell\), let \(X \in \{T, F\}^\ell\) be the random variable encoding the values of \(\phi_0, \dots, \phi_{\ell-1}\). The above construction works approximately and conditional on \(\operatorname{explore}_0\) to give us a random variable \(Y\) that is approximately independent of \(A\) conditional on \(\operatorname{explore}_0\) with respect to \(\mathbb{P}_k\). Now we define

\[ \mathbb{E}_n[U | A = a] := \sum_y \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge Y = y] \mathbb{P}_n(Y = y | \operatorname{explore}_0) \]

whenever \(\mathbb{P}_n(A = a) < \delta\).

This succeeds on the problem at the beginning of this section: Assume \(A_n = 2 \leftrightarrow X_{n-1}\), and assume that \(\mathbb{P}_{n-1}\) knows this. Then:

\begin{align*} \mathbb{P}_{n-1}(A_n = 1) &= \frac{1}{2} \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 1] &= \frac{1}{2} 10^6 \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 2] &= \mathbb{E}_k[U_n | A_n = 2 \wedge \operatorname{explore}_0 \wedge \neg X_{n-1}] \\ &= 10^3 \\ X_{n-1} \to \mathbb{E}_n[U_n | A_n = 2] &= \frac{1}{2} 10^6 + 10^3 + 10^6 \\ X_{n-1} \to \mathbb{E}_n[U_n | A_n = 1] &= \mathbb{E}_k[U_n | A_n = 1 \wedge \operatorname{explore}_0 \wedge X_{n-1} \\ &= 10^6 \end{align*}

which does not lead to contradiction. In fact, there are agents like this that do at least as well as any constant agent:

Theorem

Let \(U_n(\mathbb{P}, A)\) be a utility function defined with metasyntactic variables \(n\), \(\mathbb{P}\), and \(A\). It must be computable in polynomial time as a function of \(A\), \(\mathbb{P}_{f_i(n)}(A = a)\), and \(X := \mathbb{P}_{f_i(n)}(X) < p\), where \(f_i\) can be any polytime functions that doesn’t grow too slowly and such that \(f_i(n) < n\). Then there exists a logical inductor \(\mathbb{P}\) such that for every \(n\), there exists \(k < n\), \(\varepsilon, \delta > 0\), and a pseudorandom variable \(Y\) such that the agent \(A\) defined below performs at least as well on \(U_n\) as any constant agent, up to a margin of error that approaches \(0\) as \(n \to \infty\):

\begin{align*} \operatorname{explore}_0 &:= \mathbb{P}_k(\operatorname{explore}_0) < \varepsilon \\ \operatorname{explore}_1 &:= \left\{ \begin{array}{ll} a_1 & \mbox{if } \mathbb{P}_k(\operatorname{explore}_1 = a_1) < \frac{1}{\ell} \mbox{ ; else} \\ a_2 & \mbox{if } \mathbb{P}_k(\operatorname{explore}_1 = a_2) < \frac{1}{\ell} \mbox{ ; else} \\ \vdots &\\ a_\ell & \mbox{otherwise} \end{array} \right. \\ \forall a \, \mathbb{E}_n(\phi | A = a) &:= \sum_y \mathbb{E}_k(\phi | A = a \wedge \operatorname{explore}_0 \wedge Y = y) \mathbb{P}_n(Y = y | \operatorname{explore}_0) \mbox{ if } \mathbb{P}_n(A = a) < \delta \\ A_n &:= \left\{ \begin{array}{ll} \operatorname{explore}_1 & \mbox{if } \operatorname{explore}_0 \\ \operatorname{argmax}_a \mathbb{E}_n[U_n | A_n = a] & \mbox{otherwise} \end{array} \right. \end{align*}

Proof sketch

Choose \(k\) smaller than the strength parameter of the weakest predictor in \(U_n\). If \(a_n\) is the best constant policy for \(U_n\), assume \(A_n = a_n\). Since \(\mathbb{P}_n\) can compute \(U_n\), our agent’s factual estimate \(\mathbb{E}_n[U_n | A_n = a_n]\) is accurate, and the counterfactual estimate \(\mathbb{E}_n[U_n | A_n = a']\) for \(a' \neq a_n\) is an accurate estimate of the utility assigned to the constant policy \(a'\), as long as we make \(Y\) rich enough. So the agent will choose \(a_n\). Thus we have an implication of the form “if \(\mathbb{P}\) believes \(A_n = a_n\), then \(A_n = a_n\) is true”, and so we can create a logical inductor \(\mathbb{P}\) that always believes that \(A_n = a_n\) for every \(n\) by adding a trader with a large budget that bids up the price of \(A_n = a_n\).

Isn’t this just UDTv2?

This is much less general than UDTv2. If you like, you can think of this as an agent that at time \(k\) chooses a program to run, and then runs that program at time \(n\), except the program always happens to be “argmax over this kind of counterfactual”.

Also, it doesn’t do policy selection.

Next steps

Instead of handing the agent a pseudorandom variable \(Y\) that captures everything important, I’d like to have traders inside a logical inductor figure out what \(Y\) should be on their own.

Also, I’d rather not have to hand the agent an optimal value of \(k\).

Also, I hope that these counterfactuals can be used to do policy selection and win at counterfactual mugging.



by Nisan Stiennon 25 days ago | link

This doesn’t quite work. The theorem and examples only work if you maximize the unconditional mutual information, \(H(X;Y)\), not \(H(X;Y|A)\). And the choice of \(X\) is doing a lot of work — it’s not enough to make it “sufficiently rich”.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

RSS

Privacy & Terms