LLMs seem more like low-level tools to me than direct human interfaces.
Current models suffer from hallucinations, sycophancy, and numerous errors, but can be extremely useful when integrated into systems with redundancy and verification.
We’re in a strange stage now where LLMs are powerful enough to be useful, but too expensive/slow to have rich scaffolding and redundancy. So we bring this error-prone low-level tool straight to the user, for the moment, while waiting for the technology to improve.
Using today’s LLM interfaces feels like writing SQL commands directly instead of using a polished web application. It’s functional if that’s all you have, but it’s probably temporary.
Imagine what might happen if/when LLMs are 1000x faster and cheaper.
Then, answering a question might involve:
Running ~100 parallel LLM calls with various models and prompts
Using aggregation layers to compare responses and resolve contradictions
Identifying subtasks and handling them with specialized LLM batches and other software
Big picture, I think researchers might focus less on making sure any one LLM call is great, and more that these broader setups can work effectively.
(I realize this has some similarities to Mixture of Experts)
LLMs seem more like low-level tools to me than direct human interfaces.
Current models suffer from hallucinations, sycophancy, and numerous errors, but can be extremely useful when integrated into systems with redundancy and verification.
We’re in a strange stage now where LLMs are powerful enough to be useful, but too expensive/slow to have rich scaffolding and redundancy. So we bring this error-prone low-level tool straight to the user, for the moment, while waiting for the technology to improve.
Using today’s LLM interfaces feels like writing SQL commands directly instead of using a polished web application. It’s functional if that’s all you have, but it’s probably temporary.
Imagine what might happen if/when LLMs are 1000x faster and cheaper.
Then, answering a question might involve:
Running ~100 parallel LLM calls with various models and prompts
Using aggregation layers to compare responses and resolve contradictions
Identifying subtasks and handling them with specialized LLM batches and other software
Big picture, I think researchers might focus less on making sure any one LLM call is great, and more that these broader setups can work effectively.
(I realize this has some similarities to Mixture of Experts)