Transcribe human speech with a variety of accents in a noisy environment as well as a typical human can.
As a data point, it seems to me that OpenAI’s Whisper large model is probably above typical human transcription quality for standard accents in non-noisy environments. E.g. it transcribes correctly “Hyderabad” from here (while YouTube transcribes it as “hyper bus and”).[1]
For “noisy environments with a variety of accents”, it was surprisingly hard to find a sample. From this, it generates this, which does seem worse than a typical human, so I would also resolve this as “false” if OpenAI’s Whisper is the state of the art, but I wouldn’t say that it doesn’t seem close.
I counted “translate as well as bilingual humans” as true based on a few quick tests of ChatGPT; I’m curious if you have some specific source for why it’s false.
As another data point, for English <-> Italian it’s usually better than me. But it really struggles with things like idioms.
Here’s the full transcription of that talk. (It does transcribe “Jacy” as “JC”, but I still think the typical human would have made more mistakes, or at the very least it does seem close).
I think the question says:
As a data point, it seems to me that OpenAI’s Whisper large model is probably above typical human transcription quality for standard accents in non-noisy environments. E.g. it transcribes correctly “Hyderabad” from here (while YouTube transcribes it as “hyper bus and”).[1]
For “noisy environments with a variety of accents”, it was surprisingly hard to find a sample. From this, it generates this, which does seem worse than a typical human, so I would also resolve this as “false” if OpenAI’s Whisper is the state of the art, but I wouldn’t say that it doesn’t seem close.
As another data point, for English <-> Italian it’s usually better than me. But it really struggles with things like idioms.
Here’s the full transcription of that talk. (It does transcribe “Jacy” as “JC”, but I still think the typical human would have made more mistakes, or at the very least it does seem close).