I (and many others) did not realize this before, but: text-davinci-002 and text-davinci-001, the InstructGPT models on the OpenAI API, were not trained with RLHF (reinforcement learning from human feedback) as described in the InstructGPT paper, but a "similar but slightly different"[1] method that uses the same human feedback data. Apparently, this other method is not technically RLHF.

Since this update has potentially nontrivial implications for interpreting the phenomena exhibited by text-davinci-002 described in Mysteries of mode collapse (formerly titled "Mysteries of mode collapse due to RLHF"), I'm making this separate post for a signal boost.

I have not corrected the original text of "Mysteries of mode collapse due to RLHF", but I've added a section at the beginning with further details on this update, copied here:

I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF.

The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly "Mysteries of mode collapse due to RLHF") is affected: just mentally substitute "mystery method" every time "RLHF" is invoked as the training method of text-davinci-002. The observations of its behavior otherwise stand alone.

This is kind of fascinating from an epistemological standpoint. I was quite surprised to learn that text-davinci-002 was probably not trained with RLHF. I don't remember exactly how "text-davinci-002 is RLHF" got elevated to an unquestioned assumption in my mind. I might have mistook not being contradicted by people who I assumed were in the know as confirmation. I certainly did not expect to talk for months to dozens of people about odd behaviors I've observed in a well-known model "due to RLHF" without being contradicted in a world where the model in question wasn't trained with RLHF, but that's what happened.[2] It wasn't just me either: the assumption that text-davinci-002(/text-davinci-001) is InstructGPT is RLHF seems ambient (e.g. search "text-davinci-002 rlhf" on Twitter, this LW post, this article, and many others). I contributed to perpetuating this misinformation cascade, and for that I apologize.

text-davinci-002's behaviors described in this post also contributed to my confidence because RLHF seemed to be a likely and potentially satisfying explanation. Its apparently unsubstantiated confidence in very specific outcomes seems antithetical to the outer objective of self-supervised learning, which is optimized by epistemic calibration, meaning the model's entropy should be as high as possible while fitting the data. In contrast, as several comments have pointed out, it makes sense that RL kills entropy. The presence of "attractors" made me additionally suspect that optimization from non-myopic outcome-supervision was formative to text-davinci-002's psyche.

Mode collapse and attractors do seem to also be caused by RLHF (see Dumbass policy pls halp and Inescapable wedding parties). So the update is that some other training method also gives rise to these phenomena, as they are manifested by text-davinci-002

Whether and how speculations concerning the causes of mode collapse/attractors should be affected depends on how text-davinci-002's training method differs from RLHF.


What is known about text-davinci-002's training method

Publicly available information suggests that the mystery method may not be so different from RLHF. Just today I discovered this sidenote in OpenAI's blog post Aligning Language Models to Follow Instructions:

The InstructGPT models deployed in the API are updated versions trained using the same human feedback data. They use a similar but slightly different training method that we will describe in a forthcoming publication.  

AFAIK, this is all that OpenAI has published about the RLHF/mystery method diff. It says that the InstructGPT models (text-davinci-001 and text-davinci-002) were trained using the same human feedback data as the method described in OpenAI's RLHF paper.[3] But this "similar but slightly different" method is apparently sufficiently different to not qualify as RLHF!

Pending further revelations, I suppose the lesson here was that I should have sustained more entropy in my belief state given the partial information I had. But what a demanding thing to ask! So much easier to promote an attractive hypothesis to the status of decisive fact and collapse the remainder than to hold a superposition in the mind.

  1. ^

    Sidenote on OpenAI's blog post, Aligning Language Models to Follow Instructions

  2. ^

    the lack of epistemic vigilantes attacking an unsubstantiated assumption in the very title of this post on LessWrong is truly unbelievable!

  3. ^

    which seems to confirm my suspicion about outcome-supervision

New Comment
7 comments, sorted by Click to highlight new comments since:

OpenAI has just released a description of how their models work here

text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO). 

"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.

I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks. 

Let me know if you want help on this, I'm interested in this myself.

Very useful update, thanks.

Though I notice they don't say anything about how ada and text-ada-* models were trained.

In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7/7, and he answered "A mix of previously trained models. Probably very few samples from base models if any" (emphasis mine).

I'm curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step. 

Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only "cloned" them.

(Note that it wasn't specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)

Thanks for catching this and spreading the word!

Do we know if the following other models from OpenAI use true RLHF or also use this RLHF-like mystery method? (or something else!)

  • text-curie-001
  • text-babbage-001
  • text-ada-001

The new model index from OpenAI contains most of the answers to this. Jérémy linked to it in another comment on this post. However, the model index doesn't give info on ada and text-ada-001 yet: https://beta.openai.com/docs/model-index-for-researchers

I don't know :(

There was a recent Twitter thread about this. See here and here.