My experience is similar. LLMs are powerful search engines but nearly completely incapable of thinking for themselves. I use these custom instructions for ChatGPT to make it much more useful for my purposes:
When asked for information, focus on citing sources, providing links, and giving direct quotes. Avoid editorializing or doing original synthesis, or giving opinions. Act like a search engine. Act like Google.
There are still limitations:
You still have to manually check the cited links to verify the information yourself.
ChatGPT is, for some reason, really bad at actually linking to the correct webpage itâs quoting from. This wastes time and is frustrating.
ChatGPT is limited to short quotes and often gives even shorter quotes than necessary, which is annoying. It often makes it hard to understand what the quote actually even says, which almost defeats the purpose.
Itâs common for ChatGPT to misunderstand what itâs quoting and take something out of context, or it quotes something inapplicable. This often isnât obvious until you actually go check the source (especially with the truncated quotes). You can get tricked by ChatGPT this way.
Every once in a while, ChatGPT completely fabricates or hallucinates a quote or a source.
The most one-to-one analogy for LLMs in this use case is Google. Google is amazingly useful for finding webpages. But when you Google something (or search on Google Scholar), you get a list of results, many of which are not what youâre looking for, and you have to pick which results to click on. And then, of course, you actually have to read the webpages or PDFs. Google doesnât think for you; itâs just an intermediary between you and the sources.
I call LLMs SuperGoogle because they can do semantic search on hundreds of webpages and PDFs in a few minutes while youâre doing something else. LLMs as search engines is a geniune innovation.
On the other hand, when Iâve asked LLMs to respond to the reasoning or argument in a piece of writing or even just do proofreading, they have given incoherent responses, e.g. making hallucinatory âcorrectionsâ to words or sentences that arenât in the text theyâve been asked to review. Run the same text by the same LLM twice and it will often give the opposite opinion of the reasoning or argument. The output is also often self-contradictory, incoherent, incomprehensibly vague, or absurd.
I would agree that LLMs are much stronger at finding and summarizing information than original thought (which is a big limitation for red teaming). However, weâve gotten a lot of utility out of having âSuperGoogleâ research a topic and then look for ways that our intervention reports differ from published literature. You could argue this is still SuperGoogle behavior (search + comparison) rather than genuine critical thinking, but for our purposes, thatâs been enough to surface a handful of worthwhile leads per intervention.
This is why weâve found that AI red teaming works best for well-researched interventions where we at GiveWell havenât done as much research (like syphilis) and doesnât work well for interventions where weâve done a lot of research (like insecticide-treated bed nets) or that are relatively new and donât have as much published literature (like malaria vaccines).
My experience is similar. LLMs are powerful search engines but nearly completely incapable of thinking for themselves. I use these custom instructions for ChatGPT to make it much more useful for my purposes:
There are still limitations:
You still have to manually check the cited links to verify the information yourself.
ChatGPT is, for some reason, really bad at actually linking to the correct webpage itâs quoting from. This wastes time and is frustrating.
ChatGPT is limited to short quotes and often gives even shorter quotes than necessary, which is annoying. It often makes it hard to understand what the quote actually even says, which almost defeats the purpose.
Itâs common for ChatGPT to misunderstand what itâs quoting and take something out of context, or it quotes something inapplicable. This often isnât obvious until you actually go check the source (especially with the truncated quotes). You can get tricked by ChatGPT this way.
Every once in a while, ChatGPT completely fabricates or hallucinates a quote or a source.
The most one-to-one analogy for LLMs in this use case is Google. Google is amazingly useful for finding webpages. But when you Google something (or search on Google Scholar), you get a list of results, many of which are not what youâre looking for, and you have to pick which results to click on. And then, of course, you actually have to read the webpages or PDFs. Google doesnât think for you; itâs just an intermediary between you and the sources.
I call LLMs SuperGoogle because they can do semantic search on hundreds of webpages and PDFs in a few minutes while youâre doing something else. LLMs as search engines is a geniune innovation.
On the other hand, when Iâve asked LLMs to respond to the reasoning or argument in a piece of writing or even just do proofreading, they have given incoherent responses, e.g. making hallucinatory âcorrectionsâ to words or sentences that arenât in the text theyâve been asked to review. Run the same text by the same LLM twice and it will often give the opposite opinion of the reasoning or argument. The output is also often self-contradictory, incoherent, incomprehensibly vague, or absurd.
Thanks for your comment!
I would agree that LLMs are much stronger at finding and summarizing information than original thought (which is a big limitation for red teaming). However, weâve gotten a lot of utility out of having âSuperGoogleâ research a topic and then look for ways that our intervention reports differ from published literature. You could argue this is still SuperGoogle behavior (search + comparison) rather than genuine critical thinking, but for our purposes, thatâs been enough to surface a handful of worthwhile leads per intervention.
This is why weâve found that AI red teaming works best for well-researched interventions where we at GiveWell havenât done as much research (like syphilis) and doesnât work well for interventions where weâve done a lot of research (like insecticide-treated bed nets) or that are relatively new and donât have as much published literature (like malaria vaccines).