I tried the Anthropic model on this dataset with roughly your prompt and it’s much better in terms of KL divergence between its predictions and Manifold probabilities. Giving it 10 web search results in a prompt further improves the performance. But the difference no search → search is smaller compared to GPT-3 → Anthropic, I’d say mainly because of unhelpful search results.
I tried the Anthropic model on this dataset with roughly your prompt and it’s much better in terms of KL divergence between its predictions and Manifold probabilities. Giving it 10 web search results in a prompt further improves the performance. But the difference no search → search is smaller compared to GPT-3 → Anthropic, I’d say mainly because of unhelpful search results.