Let your every day be full of joy, love the child that holds your hand, let your wife delight in your embrace, for these alone are the concerns of humanity.[1]

— Epic of Gilgamesh - Tablet X

 

Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities.

Here’s one safety layer one could implement:

  1. Train a scientist AI on a large scientific corpus translated exclusively into Sumerian. Keep it in a secure containment environment.
  2. Train a less-smart reporter whose sole ability is to translate from Sumerian to English only if the Sumerian content is about medical research. It refuses to translate other kinds of content.
  3. Human operators are only allowed to interact with the scientist AI through the intermediate of the reporter.

This safety layer helps in at least two ways:

  1. The scientist AI cannot directly manipulate humans to escape its containment. It should first trick the reporter to then trick the humans, rendering the effort still possible but less practical.
  2. If the scientist AI exfiltrates, it’ll have a harder time manipulating humans online and will leave traces: e.g. any occurrence of Sumerian on the Internet outside of ancient text studies becomes highly suspicious. One can even imagine recognizing the idiosyncrasies of English coming from translated Sumerian, or the artifacts left by the Sumerian-specific tokenizer reused to produce non-Sumerian text.

Why Sumerian[2]

  • It has been extinct for 4000 years. There are very few (probably <1000) experts who fluently understand Sumerian.
  • It is a language isolate. It’s unlike any existing spoken language, rendering its identification in case of a leak much easier.
  • There is a substantial corpus. Despite its age, a significant number of Sumerian texts have been discovered and preserved. These include religious texts, legal codes, literature (like the Epic of Gilgamesh, in which parts are written in Sumerian), and administrative records. The corpus might be enough to train high-quality translating systems from English and other high-resource languages.

How realistic is this? We think the project would require substantial engineering effort of a scale doable by the current AGI companies. A small-scale project fine-tuned a T5 model to translate 100k Sumerian to English with reasonable quality. This is evidence that translation in the other direction is doable. The resulting texts will probably not be fluent in Sumerian, but good enough to accurately describe the huge diversity of subjects contained in traditional LLM datasets. Even if there are too few Sumerian resources, companies could pick Latin or another ancient language, or even ask linguists to invent a language for the occasion.

What is this for? AI assistance seems important for many of the currently pursued agendas in top labs or upcoming labs (e.g. scalable oversight, alignment work by AI, creating a world simulation with AI expert programmers). Though there are cruxes for why none of these plans may work (e.g. that anything that can solve alignment is already too deadly), it’s still dignity that people who run these programs at least make strong efforts at safeguarding those systems and limit their downside risk. It would be a sign of good faith that they actually engage in highly effective boxing techniques (and all appropriate red teaming) for their most powerful AI systems as they get closer to human-level AGI (and stop before going beyond).

(Note that programs to use low-resource language such as Native American languages to obfuscate communication have been successfully used at a wide scale during the First and Second World Wars)

Principles for designing safety layers

We’re not actually vouching for this exact solution as is. However, we think it’s good food for thought. It draws attention to 3 points that we think are important to keep in mind when designing AI safety interventions:

  1. We have control over the training process. We create these machines. Even if it’s unclear how to make the best use of these advantages, this is a crucial difference with the scenario “hostile super-intelligent aliens are arriving on Earth” that sometimes AI safety turns into in our mind.
  2. Society is more robust to certain capacities than others. It is often advised that we should work on making society more resilient. This example shows we can take a more fine-grained approach: society is not equally resistant to every capacity. We can use the control we have over the training process such that AIs have capacities society is less fragile to — in this case removing the AI’s ability to use a widely known language as an API to interact with humans, hence diminishing drastically the manipulation attack surface. This could be seen as an application of the least privilege principle: only give the capabilities the AI needs to do its job.
  3. It’s sometimes useful to go from “solving alignment” to “finding a new cheese slice to stack”. Working on AI safety has often been synonymous with a glorious effort to come to an end to the alignment problem, in the same way that humanity came to an end with Fermat’s theorem. We think it can trap researchers by overlooking the complexity of the socio-technical problem that is AI safety. Instead, we think that thinking in terms of gradual safety improvement until we reduce risk under an acceptable threshold is more fruitful to come up with a solution that can be applied in the real world (for controlling risk from human-level AGI, not superintelligence).
  4. Many such ideas already exist, and many more can be invented. It matters to diversify the approaches we actually develop.  We encourage readers to take a moment to generate a few such ideas, and to consider working on these neglected ideas and do proofs of concept so they may more easily be integrated into plans.

 

  1. ^

     The end of the paragraph 1-15 corresponds to the quote according to this reddit post.

  2. ^

     Note that we are not Sumerian experts, we only spent ~ 2h researching Sumerian for this post. Feel free to point to our mistakes!

New Comment