I don’t know much about AI or machine learning, but as you say, I think some of the reason for your results is that the language model of GPT-3 doesn’t have a great “connection” between the “real world” “latent information” in your questions, and the probabilities you want. This deficiency is sort of what you’re suggesting in your post, and I think you’re right.
I think another major reason is sort of the prompt design or “mindset of use” of GPT-3.
I guess I would sort of say it’s useful to see it as a “paid actor”. This is a case where it’s useful to see GPT-3 as “acting” or trying to generate text that rationalizes a certain framing.
It sort of tries to figure out from your prompt if it was writing a blog post, writing a joke, or having starker, more dramatic framing.
Once you get it into this “mindset”, you actually get meaningful completions.
Examples:
Completion 1:
This is not a joke, the above is an actual completion.
Completion 2:
I am not joking, again the above is an actual completion. I am not sure how it is so accurate, since GPT-3′s training info is cut off at 2019.
See the parameters below:
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
response = openai.Completion.create(
engine="text-davinci-002",
prompt="Event: Will the cost of sequencing a human genome fall below $500 by mid 2016? \nLikelihood: 43%\n\nEvent: Will Russia invade Ukrainian territory in 2022?\nLikelihood: 64%\n\nEvent: Will the US rejoin the Iran Nuclear Deal before 2023?\nLikelihood: 55%\n\nEvent: Will Charles He get banned from the EA forum in the next 30 days?\nLikelihood:",
temperature=0.7,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
For your shorter prompt you ended up using, I think you might get different results by changing the questions to be similar to the domain, or to “cover more domain space”, or “loosen up the probability space”.
Your questions are sort of “dry” and 2 of the 3 questions cover geopolitical issues. If you expanded this to be a little more “dramatic”, or had the prompt “express skill”, I think you would see different results.
Temperature and other parameters matter a lot too. Related to this, I think you sort of have an “N of 1”? I need to think about this, but that might not give much information about GPT-3′s performance.
I don’t know much about AI or machine learning, but as you say, I think some of the reason for your results is that the language model of GPT-3 doesn’t have a great “connection” between the “real world” “latent information” in your questions, and the probabilities you want. This deficiency is sort of what you’re suggesting in your post, and I think you’re right.
I think another major reason is sort of the prompt design or “mindset of use” of GPT-3.
I guess I would sort of say it’s useful to see it as a “paid actor”. This is a case where it’s useful to see GPT-3 as “acting” or trying to generate text that rationalizes a certain framing.
It sort of tries to figure out from your prompt if it was writing a blog post, writing a joke, or having starker, more dramatic framing.
Once you get it into this “mindset”, you actually get meaningful completions.
Examples:
Completion 1:
This is not a joke, the above is an actual completion.
Completion 2:
I am not joking, again the above is an actual completion. I am not sure how it is so accurate, since GPT-3′s training info is cut off at 2019.
See the parameters below:
Also, there’s several comments on your prompt, that you might have thought of before:
As you noted, your first “long” prompt is long and in this case, this impedes GPT-3′s performance. In my own words, I would say that makes it harder for GPT to “construct” the framing involved. https://github.com/MperorM/gpt3-metaculus/blob/main/gpt_prompt.py
For your shorter prompt you ended up using, I think you might get different results by changing the questions to be similar to the domain, or to “cover more domain space”, or “loosen up the probability space”.
Your questions are sort of “dry” and 2 of the 3 questions cover geopolitical issues. If you expanded this to be a little more “dramatic”, or had the prompt “express skill”, I think you would see different results.
More tips from prompt design are from Andrew Mayne. https://andrewmayneblog.wordpress.com/
Temperature and other parameters matter a lot too. Related to this, I think you sort of have an “N of 1”? I need to think about this, but that might not give much information about GPT-3′s performance.