Summary: It’s possible to set up an zerosum game between two agents so that, in any Nash equilibrium, one agent picks a policy to optimize a particular objective subject to some constraints on expected features of the state resulting from the policy. This seems potentially useful for getting an approximate agent to maximize some objective subject to constraints.
Consider the following optimization problem:
Maximize \(f(\pi)\) subject to \(\mathbb{E}[\mathbf{a}(X)  do(\pi)] \succeq \mathbf{b}\).
where \(\pi\) is a policy, \(f\) maps policies to reals, \(X \in \mathcal{X}\) is a random variable representing the state of the world, \(\mathbf{a} \in \mathcal{X} \rightarrow \mathbb{R}^n\) is a feature vector, and \(\mathbf{b} \in \mathbb{R}^n\) is a vector. This objective instructs the agent to maximize some objective \(f(\pi)\) while ensuring that random variables \(a_1(X), ..., a_n(X)\) have expectations at least \(b_1, ..., b_n\) respectively.
How might we design a agent that solves the constrained optimization problem, if all we have available are agents that optimize unconstrained objectives? Consider a game between two agents, an actor and a critic:
 The critic picks Lagrange multipliers \(\mathbf{\lambda} \in \mathbb{R}^n, \mathbf{\lambda} \succeq \mathbf{0}\).
 The actor picks a policy \(\pi\), which results in world state \(X\).
 The actor gets utility \(f(\pi) + \mathbf{\lambda} \cdot (\mathbf{a}(X)  \mathbf{b})\). The critic gets the negation of this utility.
Let’s assume the critic and actor pick their \(\lambda\) and \(\pi\) simultaneously, and attempt to find the Nash equilibrium. Since the critic has an infinite action space, there is not guaranteed to be a Nash equilibrium.
To make the problem easier, let’s assume that the set of allowed policies is convex, and \(f\) is concave (that is, for any two policies \(\pi_1\) and \(\pi_2\), and \(\theta \in (0, 1)\), then if \(\pi_3\) is the policy formed by first picking \(\pi_1\) with probability \(\theta\) or \(\pi_2\) with probability \(1  \theta\) and then running that policy, \(f(\pi_3) \geq \theta f(\pi_1) + (1  \theta) f(\pi_2)\)).
We will prove:
Theorem: in any Nash equilibrium, the actor’s policy solves the original optimization problem.
This is due to Lagrangian duality; the critic is picking Lagrange multipliers. It’s possible that a setup like this results in reasonable behavior if we use approximate agents instead of optimal agents (and perhaps place some high absolute limit on \(\mathbf{\lambda}\)), which would be the main practical application of this idea, though I’m not sure if it does.
Proof of theorem
The actor’s expected utility is \(g_{\mathbf{\lambda}}(\pi) := f(\pi) + \mathbf{\lambda} \cdot (\mathbb{E}[\mathbf{a}(X)  do(\pi)]  \mathbf{b})\).
With the assumptions above, we need only consider pure Nash equilibria. The critic’s utility is linear in \(\mathbf{\lambda}\) and the actor’s utility is concave in \(\mathbf{\pi}\); thus a pure strategy \(\mathbf{\lambda}\) weakly dominates any mixed strategy with mean \(\mathbf{\lambda}\), and a policy \(\pi\) weakly dominates any distribution over policies whose mean is \(\pi\).
First consider the game from the critic’s perspective. Consider a few different cases:
 Suppose \(\mathbb{E}[a_i(X)  do(\pi)] < b_i\) for some \(i\). Then the critic gets more expected utility the higher \(\lambda_i\) is; thus there is no bestresponse \(\lambda_i\) value. Therefore, no Nash equilibria have this property.
 Suppose \(\mathbb{E}[a_i(X)  do(\pi)] = b_i\) for some \(i\). Then the critic is indifferent about \(\lambda_i\) and may set it to anything.
 Suppose \(\mathbb{E}[a_i(X)  do(\pi)] > b_i\) for some \(i\). Then the critic must set \(\lambda_i = 0\) to maximize expected utility.
So in any Nash equilibrium we must have:
Property 1: \(\mathbb{E}[\mathbf{a}(X)  do(\pi)] \succeq \mathbf{b}\)
Property 2: \(\mathbf{\lambda} \cdot (\mathbf{E}[\mathbf{a}(X)  do(\pi)]  \mathbf{b}) = 0\); equivalently, \(g_{\mathbf{\lambda}}(\pi) = f(\pi)\)
These properties correspond to the primal feasibility and complementary slackness KKT conditions.
Now let’s consider the game from the actor’s perspective. The actor will pick \(\pi\) to maximize
\[g_{\mathbf{\lambda}}(\pi) := f(\pi) + \mathbf{\lambda} \cdot (\mathbb{E}[\mathbf{a}(X)  do(\pi)]  \mathbf{b})\]
Let \(\pi^*\) solve the original optimization problem: it maximizes \(f(\pi^*)\) subject to \(\mathbb{E}[\mathbf{a}(X)  do(\pi^*)] \succeq \mathbf{b}\). Now observe that, in any Nash equilibrium:
 \(f(\pi) \leq f(\pi^*)\). This is because \(\pi\) satisfies the constraints of the original optimization problem and \(\pi^*\) maximizes \(f\) subject to the constraints.
 \(f(\pi) \geq f(\pi^*)\). This is because we have \(f(\pi') \leq g_{\mathbf{\lambda}}(\pi')\) for all \(\pi'\) satisfying the constraints; in particular, \(f(\pi^*) \leq g_{\mathbf{\lambda}}(\pi^*) \leq g_{\mathbf{\lambda}}(\pi) = f(\pi)\).
So in any Nash equilibrium, we have
Property 3: \(f(\pi) = f(\pi^*)\)
Properties 1 and 3 together imply the theorem:
Theorem: in any Nash equilibrium, the actor’s policy solves the original optimization problem.
When does a Nash equilibrium exist?
We’ve shown a property that all Nash equilibria of this game satisfy, but is a Nash equilibrium guaranteed to exist? Whenever some policy satisfies the constraints of the original optimization problem, I think one does, but I haven’t been able to prove this.
There’s no Nash equilibrium when no \(\pi\) satisfies the constraints; in that case the critic would always want to set the \(\lambda_i\) value corresponding to a violated constraint to an extremeley high value.
Example: Secondary objectives
Consider the following optimization problem:
Maximize \(f(\pi)\) subject to \(\mathbb{E}[a(X)  do(\pi)] \geq b\).
In effect, an agent solving this optimization problem will ensure that it solves the primary objective of getting a highenough expected \(a(X)\) value, and will otherwise attempt to optimize the secondary objective \(f(\pi)\).
In the game corresponding to this optimization problem, the critic picks \(\lambda\) to be the exchange rate between the primary and secondary objectives that causes the actor to maximize the secondary objective subject to getting enough of the primary objective.
It’s possible that something like this could be used for mild optimization, though I haven’t worked out the details. We would want to set \(f\) to be something simple that we’re pretty comfortable with an AI approximately maximizing under the constraint, and I haven’t found anything of this form yet.
Example: Generative adversarial exponential families
(note: this probably isn’t actually useful)
Suppose we want to approximately sample from an exponential family \(p_{\mathbf{\eta}}(x) = \exp(\mathbf{\eta} \cdot \mathbf{\phi}(x)  g(\mathbf{\eta}))\) with \(\mathbf{\eta}\) set to maximize likelihood of the training data \(x_1, ..., x_m\). Define \(\mathbf{\mu}(q) := \mathbb{E}_{X \sim q}[\mathbf{\phi}(X)]\), and \(\mathbf{\mu}_{data} := \frac{1}{m} \sum_{i=1}^m \mathbf{\phi}(x_i)\).
We would like to obtain a distribution \(q\) that maximizes \(H(q)\) subject to \(\mathbf{\mu}(q) = \mathbf{\mu}_{data}\). By the maximum entropy theorem, the optimal \(q\) will be \(p_{\mathbf{\eta}*}\), where \(\mathbf{\eta}^*\) is chosen to maximize the likelihood of the data according to \(p_{\eta^*}\).
Consider the following game, which resembles generative adversarial networks:
 The critic picks a natural parameter \(\mathbf{\eta} \in \mathbb{R}^n\).
 The actor picks a distribution \(q : \Delta \mathcal{X}\)
 We take a sample \(X \sim q\).
 The actor gets utility \(\log q(X) + \mathbf{\eta} \cdot (\mathbf{\phi}(X)  \mathbf{\mu}_{data})\); the critic gets the negation of this utility.
Note that \( \log q(X)\) is an unbiased estimate of \(H(q)\), so we can consider \(H(q)\) to be the objective to be optimized subject to the constraints \(\mathbf{\mu}(q) = \mathbf{\mu}_{data}\).
At any Nash equilibrium, we will have:
 The actor picks \(q := p_{\mathbf{\eta*}}\)
 The critic picks \(\mathbf{\eta} := \mathbf{\eta}^*\)
The first fact is proven as a special case of the original theorem (plus the maximum entropy theorem); the second fact is not difficult to show given the first fact.
I believe this approach is similar to contrastive divergence. To recover the power of the original generative adversarial networks, perhaps \(\phi\) ought to be features output by a neural net. An advantage of this approach over generative adversarial networks is that the actor is penalized for differing from the training distribution in a hardtodetect way (such as by plagiarizing from training data, or encoding steganographic messages; this is similar to the informed oversight problem), because the exponential family distribution \(p_{\mathbf{\eta}^*}\) is the uniquely optimal solution (although this doesn’t guarantee that the approximate solution will be very close to \(p_{\mathbf{\eta}^*}\)).
Unfortunately, this approach requires estimating \(q(X)\), so \(q\) must be represented as a variational autoencoder or something similar. But if we’re using variational autoencoders, then we might as well directly train a generative model. So the approach I’m sketching here is probably not directly useful on its own (though I can see it being inspiration for a more useful approach to training generative models in a safely scalable way).
