Iām having an ongoing discussion with a couple professors and a PhD candidate in AI about āThe Alignment Problem from a Deep Learning Perspectiveā by @richard_ngo, @Lawrence Chan, and @SoerenMind. They are skeptical of ā3.2 Planning Towards Internally-Represented Goals,ā ā3.3 Learning Misaligned Goals,ā and ā4.2 Goals Which Motivate Power-Seeking Would Be Reinforced During Trainingā. Hereās my understanding of some of their questions:
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldnāt this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but whatās the exact mechanism here?)
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
How about reducing the number of catered meals while increasing support for meals outside the venue? Silly example: someone could fill a hotel room with Soylent so that everyone can grab liquid meals and go chat somewhereāsort of a ābaguettes and hummusā vibe. Or as @Matt_Sharp pointed out, we could reserve nearby restaurants. No idea if these exact plans are feasible, but I can imagine similarly scrappy solutions going well if planned by actual logistics experts.
Thanks so much for your work and this information!