I’m not an expert on most of the evidence in this post, but I’m extremely suspicious of the claim that GPT-4 represents AI that is “~ human level at language”, unless you mean something by this that is very different from what most people would expect.
Technically, GPT-4 is superhuman at language because whatever task you are giving it is in English, and the median human’s English proficiency is roughly nil. But a more commonsense interpretation of this statement is that a prompt-engineered AI and a trained human can do the task roughly as well.
What you link to shows the results of how GPT-4 performs on a bunch of different exams. This doesn’t really show how language is used in the real world, especially since the exams very closely match past exams that were in the training data. It’s good at some of them, but also extremely bad at others (AP English Literature and Codeforces in particular), which is an issue if you’re making a claim that it’s roughly human level.
Furthermore, language isn’t just putting words together in the right order and with the right inflection. It also includes semantic information (what the actual meaning of the sentences is) and pragmatic information (is the language conveying what it is trying to convey, not just the literal meaning). I’m not sure whether pragmatics in particular would be relevant for AI risk, but the fact that anecdotally even GPT-4 is pretty bad at pragmatics prevents a literal interpretation of your statement.
In my opinion, the best evidence for GPT-4 not being human level at language is that, in the real world, GPT-4 is much cheaper than a human but consistently unable to outcompete humans. News organizations have a strong incentive to overhype GPT-caused automation, but the examples that they’ve found are mostly of people saying that either GPT-4 or GPT-3 (it’s not always clear which) did their job much worse than them, but good enough for clients. Take https://www.washingtonpost.com/technology/2023/06/02/ai-taking-jobs/ as a typical story.
Exams aren’t exactly the real world, but the popular example of GPT-4 doing well on exams is https://www.slowboring.com/p/chatgpt-goes-to-harvard. This both ignores citations (which is a very important part of college writing, and one that GPT-3 couldn’t do whatsoever and which GPT-4 still is significantly below what I would expect from a human) and relies on the false belief that Harvard is a hard school to do well at (grade inflation!)
I still agree with two big takeaways of your post, that an AI pause would be good and that we don’t necessarily need AGI for a good future, but that’s more because it’s robust to a lot of different beliefs about AI than because I agree with the evidence provided. Again, a lot of the evidence is stuff that I don’t feel particularly knowledgeable about, I picked this claim because I’ve had to think about it before and because it just feels false from my experience using GPT-4.
the median human’s English proficiency is roughly nil.
GPT-4 is also proficient at many other languages, so I don’t think English is the appropriate benchmark! Is GPT-4 as good as the median human at language in general? I think yes. In fact it’s probably quite a lot better.
anecdotally even GPT-4 is pretty bad at pragmatics
Can you link to examples? Most examples I’ve seen on X are people criticising chatGPT-3.5 (or other models) and then someone coming along showing chatGPT-4 getting it right!
GPT-4 or GPT-3 (it’s not always clear which)
It’s nearly always GPT-3 (or 3.5). We only need to be concerned about the best AI models, not the lower tiers! I’ve heard anecdotes, in real life, of people who are using GPT-4 to do parts of their jobs—e.g. writing long emails that their boss was impressed with (they didn’t tell them it was chatGPT!)
false belief that Harvard is a hard school to do well at
Harvard is one of the best schools in the world. The average human is quite far from being smart enough to get in to it. I don’t think saying this is helping the credibility of your argument! Seems a lot like goalpost moving.
I still agree with two big takeaways of your post, that an AI pause would be good and that we don’t necessarily need AGI for a good future
GPT4 is clearly above the median human when it comes to a range of exams. Do we have examples of GPT4′s comparison to the median human in non-exam like conditions?
I’m not an expert on most of the evidence in this post, but I’m extremely suspicious of the claim that GPT-4 represents AI that is “~ human level at language”, unless you mean something by this that is very different from what most people would expect.
Technically, GPT-4 is superhuman at language because whatever task you are giving it is in English, and the median human’s English proficiency is roughly nil. But a more commonsense interpretation of this statement is that a prompt-engineered AI and a trained human can do the task roughly as well.
What you link to shows the results of how GPT-4 performs on a bunch of different exams. This doesn’t really show how language is used in the real world, especially since the exams very closely match past exams that were in the training data. It’s good at some of them, but also extremely bad at others (AP English Literature and Codeforces in particular), which is an issue if you’re making a claim that it’s roughly human level.
Furthermore, language isn’t just putting words together in the right order and with the right inflection. It also includes semantic information (what the actual meaning of the sentences is) and pragmatic information (is the language conveying what it is trying to convey, not just the literal meaning). I’m not sure whether pragmatics in particular would be relevant for AI risk, but the fact that anecdotally even GPT-4 is pretty bad at pragmatics prevents a literal interpretation of your statement.
In my opinion, the best evidence for GPT-4 not being human level at language is that, in the real world, GPT-4 is much cheaper than a human but consistently unable to outcompete humans. News organizations have a strong incentive to overhype GPT-caused automation, but the examples that they’ve found are mostly of people saying that either GPT-4 or GPT-3 (it’s not always clear which) did their job much worse than them, but good enough for clients. Take https://www.washingtonpost.com/technology/2023/06/02/ai-taking-jobs/ as a typical story.
Exams aren’t exactly the real world, but the popular example of GPT-4 doing well on exams is https://www.slowboring.com/p/chatgpt-goes-to-harvard. This both ignores citations (which is a very important part of college writing, and one that GPT-3 couldn’t do whatsoever and which GPT-4 still is significantly below what I would expect from a human) and relies on the false belief that Harvard is a hard school to do well at (grade inflation!)
I still agree with two big takeaways of your post, that an AI pause would be good and that we don’t necessarily need AGI for a good future, but that’s more because it’s robust to a lot of different beliefs about AI than because I agree with the evidence provided. Again, a lot of the evidence is stuff that I don’t feel particularly knowledgeable about, I picked this claim because I’ve had to think about it before and because it just feels false from my experience using GPT-4.
GPT-4 is also proficient at many other languages, so I don’t think English is the appropriate benchmark! Is GPT-4 as good as the median human at language in general? I think yes. In fact it’s probably quite a lot better.
Can you link to examples? Most examples I’ve seen on X are people criticising chatGPT-3.5 (or other models) and then someone coming along showing chatGPT-4 getting it right!
It’s nearly always GPT-3 (or 3.5). We only need to be concerned about the best AI models, not the lower tiers! I’ve heard anecdotes, in real life, of people who are using GPT-4 to do parts of their jobs—e.g. writing long emails that their boss was impressed with (they didn’t tell them it was chatGPT!)
Harvard is one of the best schools in the world. The average human is quite far from being smart enough to get in to it. I don’t think saying this is helping the credibility of your argument! Seems a lot like goalpost moving.
Thanks, good to know :)
GPT4 is clearly above the median human when it comes to a range of exams. Do we have examples of GPT4′s comparison to the median human in non-exam like conditions?