Most work on AI safety starts with a broad, vague problem (“How can we make an AI do good things?”) and relatively quickly moves to a narrow, precise problem (e.g. “What kind of reasoning process trusts itself?“).
Precision facilitates progress, and many serious thinkers are skeptical of imprecision. But in narrowing the problem too far we do most of the work (and have most of the opportunity for error).
I am interested in more precise discussion of the big-picture problem of AI control. Such discussion could improve our understanding of AI control, help us choose the right narrow questions, and be a better starting point for engaging other researchers. To that end, consider the following problem:
The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?
I recently wrote this document, which defines this problem much more precisely (in section 2) and considers a few possible approaches (in section 4). As usual, I appreciate thoughts and criticism. I apologize for the proliferation of nomenclature, but I couldn’t get by without a new name.