I literally mean that GPT-3.5 is adequate for the task I’m referring to. I think for a research society to coordinate around a deceptive plan, it takes a full chain of communication where that plan is suggested and communicated to the rest. At that point it’ll be fairly clear to the inspectors that something seems off, I think. And then the system halts and you can inspect the messages for anything fishy. It’s fine if there are false positives as long as the rate of false negatives is really low.
An assumption here is that any individual LLM in the network is not smart enough to stage a coup and take over the world, but the research collective could be if only they could coordinate around it.
I literally mean that GPT-3.5 is adequate for the task I’m referring to. I think for a research society to coordinate around a deceptive plan, it takes a full chain of communication where that plan is suggested and communicated to the rest. At that point it’ll be fairly clear to the inspectors that something seems off, I think. And then the system halts and you can inspect the messages for anything fishy. It’s fine if there are false positives as long as the rate of false negatives is really low.
An assumption here is that any individual LLM in the network is not smart enough to stage a coup and take over the world, but the research collective could be if only they could coordinate around it.