Logical counterfactuals and differential privacy
post by Nisan Stiennon 367 days ago | Abram Demski and Scott Garrabrant like this | 1 comment

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

# Summary

For a logical inductor $$\mathbb{P}$$, define logical counterfactuals by

$\mathbb{P}_n(\phi | \psi) := \sum_y P_k(\phi | \psi \wedge Y = y) P_n(Y = y)$

for a suitable $$k < n$$ and a random variable $$Y$$ independent of $$\psi$$ with respect to $$\mathbb{P}_k$$. Using this definition, one can construct agents that perform well in ASP-like problems.

# Motivation

Recall the Agent Simulates Predictor problem:

$U_n = 10^6 \mathbb{P}_{n-1}(A_n = 1) + 10^3 \mathbb{1}(A_n = 2)$

Naively, we want to solve this by argmaxing:

$A_n = \operatorname{argmax}_a \mathbb{E}_n[U_n | A_n = a]$

Hopefully, $$\mathbb{P}_n(A_n = 1) \approx 1$$, $$\mathbb{P}_{n-1}(A_n = 1) \approx 1$$, and $$\mathbb{E}_n[U_n | A_n = 1] \approx 10^6$$. Also, two-boxing should be less attractive than one-boxing:

$\mathbb{E}_n[U_n | A_n = 2] \approx 10^3$

However, if we make this well-defined with $$\varepsilon$$-exploration, we’ll get

$\mathbb{E}_n[U_n | A_n = 2] \approx 10^6 + 10^3$

and then the agent will two-box, contradiction. Instead we’d like to use predictable exploration and set

$\mathbb{E}_n[U_n | A_n = 2] := \mathbb{E}_k[U_n | A_n = 2]$

for $$k$$ small enough that the right-hand side is sensible. Let’s see how.

# Predictable exploration

Choose $$k \ll n$$ so that $$\mathbb{P}_k(A_n = 2) \gg 0$$. Our agent decides whether to explore at stage $$k$$, and uses its beliefs at stage $$k$$ as a substitute for counterfactuals:

\begin{align*} \operatorname{explore}_0 &:= \mathbb{P}_k(\operatorname{explore}_0) < \varepsilon \\ \operatorname{explore}_1 &:= \left\{ \begin{array}{ll} 1 & \mathbb{P}_k(\operatorname{explore}_1 = 1) < \frac{1}{2} \\ 2 & \mbox{otherwise} \end{array} \right. \\ \forall a \, \mathbb{E}_n(\phi | A = a) &:= \mathbb{E}_k(\phi | A = a \wedge \operatorname{explore}_0) \mbox{ if } \mathbb{P}_n(A = a) < \delta \\ A_n &:= \left\{ \begin{array}{ll} \operatorname{explore}_1 & \mbox{if } \operatorname{explore}_0 \\ \operatorname{argmax}_a \mathbb{E}_n[U_n | A_n = a] & \mbox{otherwise} \end{array} \right. \end{align*}

Here $$\varepsilon, \delta$$ are small positive numbers. It’s easy to see that, under reasonable assumptions, this agent 1-boxes on Agent Simulates Predictor. But it can’t use the full strength of $$\mathbb{P}_n$$ in its counterfactual reasoning, and this is a problem.

# Differential privacy

To illustrate the problem, add a term to the utility function that sometimes rewards two-boxing:

\begin{align*} U_n &= 10^6 \mathbb{P}_{n-1}(A_n = 1) + 10^3 \mathbb{1}(A_n = 2) + 10^6 \mathbb{1}(A_n = 2 \wedge X_{n-1}) \\ X_{n-1} &:= \mathbb{P}_{n-1}(X_{n-1}) < \frac{1}{2} \end{align*}

The agent should two-box if and only if $$X$$. Assuming that’s the case, and $$\mathbb{P}_{n-1}$$ knows this, we have:

\begin{align*} \mathbb{P}_{n-1}(A_n = 1) &= \frac{1}{2} \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 1] &= \frac{1}{2} 10^6 \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 2] &= \mathbb{E}_k[U_n | A_n = 2 \wedge \operatorname{explore}_0] \\ &= 10^3 + \frac{1}{2}10^6 \end{align*}

So if $$\neg X_{n-1}$$, two-boxing is the more attractive option, which is a contradiction. (I’m rounding $$\varepsilon$$ to zero for simplicity.)

The problem is that the counterfactual has to rely on $$\mathbb{P}_k$$’s imperfect knowledge of $$X_{n-1}$$. We want to combine $$\mathbb{P}_k$$’s ignorance of $$\operatorname{explore}_0$$ with $$\mathbb{P}_n$$’s knowledge of $$X_{n-1}$$.

If $$X$$ is independent of $$A$$ conditioned on $$\operatorname{explore}_0$$ with respect to $$\mathbb{P}_k$$, then we can do this:

\begin{align*} \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0] &= \sum_x \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge X = x] \mathbb{P}_k(X = x | A = a \wedge \operatorname{explore}_0) \\ &= \sum_x \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge X = x] \mathbb{P}_k(X = x | \operatorname{explore}_0) \end{align*}

Then replace $$\mathbb{P}_k(X = x | \operatorname{explore}_0)$$ with $$\mathbb{P}_n(X = x | \operatorname{explore}_0)$$:

$\mathbb{E}_n[U | A = a \wedge \operatorname{explore}_0] := \sum_x \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge X = x] \mathbb{P}_n(X = x | \operatorname{explore}_0)$

This is more accurate than $$\mathbb{E}_n[U | A = a \wedge \operatorname{explore}_0]$$, and unbiased.

If $$X$$ is not independent of $$A$$ conditional on $$\operatorname{explore}_0$$, we can introduce an auxilliary variable and construct a version of $$X$$ that is independent. This construction is a solution to the following differential privacy problem: Make a random variable $$Y$$ that is a function of $$X$$ and independent randomness, maximizing the mutual conditional information $$H(X;Y | A)$$, subject to the constraint that $$A$$ is independent of $$Y$$. Using the identity

$H(X|A) = H(X;Y|A) + H(X|AY)$

we see that the maximum is attained when $$H(X|AY) = 0$$, which means that $$X$$ is a function of $$A$$ and $$Y$$.

Now here’s the construction of $$Y$$:

Let $$\mathcal{X}$$ be the finite set of possible values of $$X$$, and let $$\mathcal{A}$$ be the finite set of possible values of $$A$$. We’ll iteratively construct a set $$\mathcal{Y}$$ and define a random variable $$Y$$ taking values in $$\mathcal{Y}$$. To start with, let $$\mathcal{Y} = \emptyset$$.

Now choose

$(a, x) := \operatorname{argmin}_{ \begin{array}{c} a \in \mathcal{A} \\ x \in \mathcal{X} \\ \mathbb{P}(X = x, Y \notin \mathcal{Y} | A = a) > 0 \end{array} } \mathbb{P}(X = x, Y \notin \mathcal{Y} | A = a)$

and for each $$a' \in \mathcal{A} \backslash \{a\}$$, choose some $$f(a') \in \mathcal{X}$$ such that $$\mathbb{P}(X = f(a'), Y \notin \mathcal{Y} | A = a') > 0$$. Then make a random binary variable $$T_{a'}$$ such that

$\mathbb{P}(T_{a'} \wedge X = f(a') \wedge Y \notin \mathcal{Y} | A = a') = \mathbb{P}(X = x \wedge Y \notin \mathcal{Y}| A = a)$

Then let $$y$$ be the event defined by

$(X = x \wedge Y \notin \mathcal{Y} \wedge A = a) \vee \bigvee_{a' \in \mathcal{A} \backslash \{a\}} (T_{a'} \wedge X = f(a') \wedge Y \notin \mathcal{Y} \wedge A = a')$

and add $$y$$ to $$\mathcal{Y}$$. After repeating this process $$|\mathcal{X}||\mathcal{A}|$$ times, we are done.

We can do this with a logical inductor as well. In general, to get a sentence $$T$$ such that $$\mathbb{P}_k(T \wedge B | C) \approx p$$, take $$T := \mathbb{P}_k(T \wedge B | C) < p \wedge B \wedge C$$.

Now given random variables $$U$$ and $$A$$, and some informative sentences $$\phi_1, \dots, \phi_\ell$$, let $$X \in \{T, F\}^\ell$$ be the random variable encoding the values of $$\phi_0, \dots, \phi_{\ell-1}$$. The above construction works approximately and conditional on $$\operatorname{explore}_0$$ to give us a random variable $$Y$$ that is approximately independent of $$A$$ conditional on $$\operatorname{explore}_0$$ with respect to $$\mathbb{P}_k$$. Now we define

$\mathbb{E}_n[U | A = a] := \sum_y \mathbb{E}_k[U | A = a \wedge \operatorname{explore}_0 \wedge Y = y] \mathbb{P}_n(Y = y | \operatorname{explore}_0)$

whenever $$\mathbb{P}_n(A = a) < \delta$$.

This succeeds on the problem at the beginning of this section: Assume $$A_n = 2 \leftrightarrow X_{n-1}$$, and assume that $$\mathbb{P}_{n-1}$$ knows this. Then:

\begin{align*} \mathbb{P}_{n-1}(A_n = 1) &= \frac{1}{2} \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 1] &= \frac{1}{2} 10^6 \\ \neg X_{n-1} \rightarrow \mathbb{E}_n[U_n | A_n = 2] &= \mathbb{E}_k[U_n | A_n = 2 \wedge \operatorname{explore}_0 \wedge \neg X_{n-1}] \\ &= 10^3 \\ X_{n-1} \to \mathbb{E}_n[U_n | A_n = 2] &= \frac{1}{2} 10^6 + 10^3 + 10^6 \\ X_{n-1} \to \mathbb{E}_n[U_n | A_n = 1] &= \mathbb{E}_k[U_n | A_n = 1 \wedge \operatorname{explore}_0 \wedge X_{n-1} \\ &= 10^6 \end{align*}

which does not lead to contradiction. In fact, there are agents like this that do at least as well as any constant agent:

## Theorem

Let $$U_n(\mathbb{P}, A)$$ be a utility function defined with metasyntactic variables $$n$$, $$\mathbb{P}$$, and $$A$$. It must be computable in polynomial time as a function of $$A$$, $$\mathbb{P}_{f_i(n)}(A = a)$$, and $$X := \mathbb{P}_{f_i(n)}(X) < p$$, where $$f_i$$ can be any polytime functions that doesn’t grow too slowly and such that $$f_i(n) < n$$. Then there exists a logical inductor $$\mathbb{P}$$ such that for every $$n$$, there exists $$k < n$$, $$\varepsilon, \delta > 0$$, and a pseudorandom variable $$Y$$ such that the agent $$A$$ defined below performs at least as well on $$U_n$$ as any constant agent, up to a margin of error that approaches $$0$$ as $$n \to \infty$$:

\begin{align*} \operatorname{explore}_0 &:= \mathbb{P}_k(\operatorname{explore}_0) < \varepsilon \\ \operatorname{explore}_1 &:= \left\{ \begin{array}{ll} a_1 & \mbox{if } \mathbb{P}_k(\operatorname{explore}_1 = a_1) < \frac{1}{\ell} \mbox{ ; else} \\ a_2 & \mbox{if } \mathbb{P}_k(\operatorname{explore}_1 = a_2) < \frac{1}{\ell} \mbox{ ; else} \\ \vdots &\\ a_\ell & \mbox{otherwise} \end{array} \right. \\ \forall a \, \mathbb{E}_n(\phi | A = a) &:= \sum_y \mathbb{E}_k(\phi | A = a \wedge \operatorname{explore}_0 \wedge Y = y) \mathbb{P}_n(Y = y | \operatorname{explore}_0) \mbox{ if } \mathbb{P}_n(A = a) < \delta \\ A_n &:= \left\{ \begin{array}{ll} \operatorname{explore}_1 & \mbox{if } \operatorname{explore}_0 \\ \operatorname{argmax}_a \mathbb{E}_n[U_n | A_n = a] & \mbox{otherwise} \end{array} \right. \end{align*}

## Proof sketch

Choose $$k$$ smaller than the strength parameter of the weakest predictor in $$U_n$$. If $$a_n$$ is the best constant policy for $$U_n$$, assume $$A_n = a_n$$. Since $$\mathbb{P}_n$$ can compute $$U_n$$, our agent’s factual estimate $$\mathbb{E}_n[U_n | A_n = a_n]$$ is accurate, and the counterfactual estimate $$\mathbb{E}_n[U_n | A_n = a']$$ for $$a' \neq a_n$$ is an accurate estimate of the utility assigned to the constant policy $$a'$$, as long as we make $$Y$$ rich enough. So the agent will choose $$a_n$$. Thus we have an implication of the form “if $$\mathbb{P}$$ believes $$A_n = a_n$$, then $$A_n = a_n$$ is true”, and so we can create a logical inductor $$\mathbb{P}$$ that always believes that $$A_n = a_n$$ for every $$n$$ by adding a trader with a large budget that bids up the price of $$A_n = a_n$$.

# Isn’t this just UDTv2?

This is much less general than UDTv2. If you like, you can think of this as an agent that at time $$k$$ chooses a program to run, and then runs that program at time $$n$$, except the program always happens to be “argmax over this kind of counterfactual”.

Also, it doesn’t do policy selection.

# Next steps

Instead of handing the agent a pseudorandom variable $$Y$$ that captures everything important, I’d like to have traders inside a logical inductor figure out what $$Y$$ should be on their own.

Also, I’d rather not have to hand the agent an optimal value of $$k$$.

Also, I hope that these counterfactuals can be used to do policy selection and win at counterfactual mugging.

 by Nisan Stiennon 365 days ago | link This doesn’t quite work. The theorem and examples only work if you maximize the unconditional mutual information, $$H(X;Y)$$, not $$H(X;Y|A)$$. And the choice of $$X$$ is doing a lot of work — it’s not enough to make it “sufficiently rich”. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes