The blog post says ChatGPT is trained with proximal policy optimization. This documentation says text-davinci-003 was trained with PPO, but not text-davinci-002.
However, it is interesting what you’re saying about the request payloads, because this seems to be contradictory. So I’m not quite sure anymore. It’s possible that ChatGPT was trained with PPO on top of the non-PPO text-davinci-002.
The blog post says ChatGPT is trained with proximal policy optimization. This documentation says text-davinci-003 was trained with PPO, but not text-davinci-002.
However, it is interesting what you’re saying about the request payloads, because this seems to be contradictory. So I’m not quite sure anymore. It’s possible that ChatGPT was trained with PPO on top of the non-PPO text-davinci-002.