AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.
E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org
Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.
In the following, all infradistributions are crisp.
Fix finite action set and finite observation set . For any and , let
be defined by
In other words, this kernel samples a time step out of the geometric distribution with parameter , and then produces the sequence of length that appears in the destiny starting at .
For any continuous[1] function , we get a decision rule. Namely, this rule says that, given infra-Bayesian law and discount parameter , the optimal policy is
The usual maximin is recovered when we have some reward function and corresponding to it is
Given a set of laws, it is said to be learnable w.r.t. when there is a family of policies such that for any
For we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any we have the learnable decision rule
This is the "mesomism" I taked about before.
Also, any monotonically increasing seems to be learnable, i.e. any s.t. for we have . For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.
On the other hand, decision rules of the form are not learnable in general, and so are decision rules of the form for monotonically increasing.
Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?
A positive answer to the above would provide interesting generalizations of infra-Bayesianism. A negative answer to the above would provide an interesting novel justification of the maximin. Indeed, learnability is not a criterion that was ever used in axiomatic constructions of decision theory[3], AFAIK.
We can try considering discontinuous functions as well, but it seems natural to start with continuous. If we want the optimal policy to exist, we usually need to be at least upper semicontinuous.
There are weaker conditions than "communicating" that are sufficient, e.g. "resettable" (meaning that the agent can always force returning to the initial state), and some even weaker conditions that I will not spell out here.
I mean theorems like VNM, Savage etc.
Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.
Here is my proposal for how to formulate a theorem that would make this idea rigorous.
Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis represents an infradistribution over : the "space of counterpossible computational universes". We can say that is a "true hypothesis" when there is some in the credal set (a distribution over ) s.t. the ground truth "looks" as if it's sampled from . The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.
We can now try to say that is "rich" if for any true hypothesis , there is a refinement which is also a true hypothesis and "knows" at least one bit of information that doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes . But, it's also completely boring: the required can be constructed by "hardcoding" an additional fact into . This doesn't look like "discovering interesting structure", but rather just like brute-force memorization.
What if instead we require that knows infinitely many bits of information that doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the -th bit in the sequence depends only the largest s.t. divides , but the dependence on is already uncomputable (or at least inexpressible by a finite automaton).
However, for our original application, this is entirely insufficient. This is because in the formal language we use to define (e.g. combinator calculus) has some "easy" equivalence relations. For example, consider the family of programs of the form "if 2+2=4 then output 0, otherwise...". All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many "new" bits of information. Once again, we are left with brute-force memorization.
Here's the improved condition: For any true hypothesis , there is a true refinement s.t. conditioning on any finite set of observations cannot produce a refinement of .
There is a technicality here, because we're talking about infradistributions, so what is "conditioning" exactly? For credal sets, I think it is sufficient to allow two types of "conditioning":
This rules-out the counterexample from before: the easy equivalence relation can be represented inside , and then the entire sequence of "novel" bits can be generated by a conditioning.
Alright, so does actually satisfy this condition? I think it's very probable, but I haven't proved it yet.
I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community.
I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)
Yes, this is an important point, of which I am well aware. This is why I expect unbounded-ADAM to only be a toy model. A more realistic ADAM would use a complexity measure that takes computational complexity into account instead of . For example, you can look at the measure I defined here. More realistically, this measure should be based on the frugal universal prior.
Thank you for the clarification.
How do you expect augmented humanity will solve the problem? Will it be something other than "guessing it with some safe weak lesser tries / clever theory"?
in each of the 50 different subject areas that we tested it on, it's as good as the best expert humans in those areas
That sounds like an incredibly strong claim, but I suspect that the phrasing is very misleading. What kind of tests is Hassabis talking about here? Maybe those are tests that rely on remembering known facts much more than on making novel inferences? Surely Gemini is not (say) as good as the best mathematicians at solving open problems in mathematics?
Here is a way to construct many learnable undogmatic ontologies, including such with finite state spaces.
A deterministic partial environment (DPE) over action set and observation set is a pair where and s.t.
DPEs are equipped with a natural partial order. Namely, when and .
Let be a strong upwards antichain in the DPE poset which doesn't contain the bottom DPE (i.e. the DPE with ). Then, it naturally induces an infra-POMDP. Specifically:
If is non-empty for all and , this is a learnable undogmatic ontology.
Any yields an example . Namely, iff and for any it holds that:
I think that for any continuous some non-trivial hidden reward functions over such an ontology, the class of communicating RUMDPs is learnable. If the hidden reward function doesn't depend on the action argument, it's equivalent to some instrumental reward function.
...the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)
It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where and are equivalent when there exists some s.t. and ), but not a unique correct prior, just like there is no unique correct UTM. In fact, my arguments about IBH already rely on the asymptotic of the prior to some extent.
One way to view the non-uniqueness of the prior is through an evolutionary perspective: agents with prior are likely to evolve/flourish in universes sampled from prior , while agents with prior are likely to evolve/flourish in universes sampled from prior . No prior is superior across all universes: there's no free lunch.
For the purpose of AI alignment, the solution is some combination of (i) learn the user's prior and (ii) choose some intuitively appealing measure of description complexity, e.g. length of lambda-term (i is insufficient in itself because you need some ur-prior to learn the user's prior). The claim is, different reasonable choices in ii will lead to similar results.
Given all that, I'm not sure what's still unsatisfying. Is there any reason to believe something is missing in this picture?
...I'm still comfortable sticking with "most are wide open".
Allow me to rephrase. The problems are open, that's fair enough. But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely." On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only reinforce the idea that UDT is "morally" correct. That is, not that any of the old attempted formalizations of UDT is correct, but that the intuition behind UDT, and its recommendation in many specific scenarios, are largely justified.
ETA: Oh, I think you're saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it's not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?
While writing this part, I realized that some of my thinking about IBH was confused, and some of my previous claims were wrong. This is what happens when I'm overeager to share something half-baked. I apologize. In the following, I try to answer the question while also setting the record straight.
An IBH agent considers different infra-Bayesian hypotheses starting from the most optimistic ones (i.e. those that allow guaranteeing the most expected utility) and working its way down, until it finds something that works[1]. Such algorithms are known as "upper confidence bound" (UCB) in learning theory. When multiple IBH agents interact, they start with each trying to achieve its best possible payoff in the game[2], and gradually relax their demands, until some coalition reaches a payoff vector which is admissible for it to guarantee. This coalition then "locks" its strategy, while other agents continue lowering their demands until there is a new coalition among them, and so on.
Notice that the pace at which agents lower their demands might depend on their priors (by affecting how many hypotheses they have to cull at each level), their time discounts and maaaybe also other parameters of the learning algorithm. Some properties this process has:
We can operationalize "CDT agent" as e.g. a learning algorithm satisfying an internal regret bound (see sections 4.4 and 7.4 in Cesa-Bianchi and Lugosi) and the process of self-modification as learning on two different timescales: a slow outer loop that chooses a learning algorithm for a quick inner loop (this is simplistic, but IMO still instructive). Such an agent would indeed choose IBH over CDT if playing a Prisoner's Dilemma (and would prefer an IBH variant that lowers its demands slowly enough to get more of the gains-of-trade but quickly enough to actually converge), whereas in the 3-player Prisoner's Dilemma there is at least some IBH variant that would be no worse than CDT.
If all players have metalearning in the outer loop, then we get dynamics similar to Chicken [version in which both swerving is better than flipping a coin[4]], where hard-bargaining (slower to lower demands) IBH corresponds to "straight" and soft-bargaining (quick to lower demands) IBH corresponds to "swerve". Chicken [this version] between two identical IBH agents results in both swerving. Chicken beween hard-IBH and soft-IBH results in hard-IBH getting a higher probability of going straight[5]. Chicken between two CDTs results in a correlated equilibrium, which might have some probability of crashing. Chicken between IBH and CDT... I'm actually not sure what happens off the top of my head, the analysis is not that trivial.
This is pretty similar to "modal UDT" (going from optimistic to pessimistic outcomes until you find a proof that some action can guarantee that outcome). I think that the analogy can be made stronger if the modal agent uses an increasingly strong proof system during the search, which IIRC was also considered before. The strength of the proof system then plays the role of "logical time", and the pacing of increasing the strength is analogous to the (inverse function of the) temporal pacing in which an IBH agent lowers its target payoff.
Assuming that they start out already knowing the rules of the game. Otherwise, they might start from trying to achieve payoffs which are impossible even with the cooperation of other players. So, this is a good model if learning the rules is much faster than learning anything to do with the behavior of other players, which seems like a reasonable assumption in many cases.
It is not that surprising that two sufficiently dissimilar agents can defect. After all, the original argument for superrational cooperation was: "if the other agent is similar to you, then it cooperates iff you cooperate".
I wish we had good names for the two version of Chicken.
This seems nicely reflectively consistent: soft/hard-IBH in the outer loop produces soft/hard-IBH respectively in the inner loop. However, two hard hard-IBH agents in the outer loop produce two soft-IBH agents in the inner loop. On the other hand, comparing absolute hardness between outer and inner loop seems not very meaningful, whereas comparing relative-between-players hardness between outer and inner loop is meaningful.
Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to "plain" IBRL regret bounds when we consider the core and the envelope as the "inside" of the agent.
Assume that the action and observation sets factor as A=A0×A1 and O=O0×O1, where (A0,O0) is the interface with the external environment and (A1,O1) is the interface with the envelope.
Let Λ:Π→□(Γ×(A×O)ω) be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:
However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t Λ:
Therefore, metacognitive regret bounds hit a "sweep spot" of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1].
More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there's a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.