Thanks. I didn’t understand all of this. Long reply with my reactions incoming, in the spirit of Socratic Grilling.
They may imitate the behavior of a consequentialist.
This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It’s that very jump that I’m trying to pin down and understand.
2. They may be used to predict which actions would have given consequences, decision-transformer style (“At 8 pm X happened, because at 7 pm ____”).
I can see that this could produce an oracle for an actor in the world (such as a company or person), but not how this would become such an actor. Still, having an oracle would be dangerous, even if not as dangerous as having an oracle that itself takes actions. (Ah—but this makes sense in conjunction with number 5, the ‘outer loop’.)
3. A sufficiently powerful language model is expected to engage in some consequentialist cognition in order to make better predictions, and this may generalize in unpredictable ways.
‘reasoning about how one’s actions affect future world states’ - is that an OK gloss of ‘consequentialist cognition’? See comments from others attempting to decipher quite what this phrase means.
Interesting to posit a link from CC ⇒ making better predictions. I can see how that’s one step closer to optimizing over future world states. The other steps seem missing—I take it they are meant to be covered by ‘generalizing in unpredictable ways’?
Or did you mean something stronger by CC: goal-directed behaviour? In other words, that a very, very powerful language model would have learned from its training to take real-world actions in service of the goal of next-token prediction? This makes sense to me (though as you say it’s speculative).
4. You can fine-tune language models with RL to accomplish a goal, which may end up selecting and emphasizing one of the behaviors above (e.g. the consequentialism of the model is redirected from next-word prediction to reward maximization; or the model shifts into a mode of imitating a consequentialist who would get a particularly high reward). It could also create consequentialist behavior from scratch.
I’d probably need more background knowledge to understand this. Namely, some examples of when LMs have been fine-tuned to act in service of goals. That sounds like it would cut the Gordian knot of my question by simply demonstrating the existence proof rather than answering the question with arguments.
5. An outer loop could use language models to predict the consequences of many different actions and then select actions based on their consequences.
Bump — It’s been a few months since this was written, but I think I’d benefit greatly from a response and have revisited this post a few times hoping someone would follow up to David’s question, specifically:
“This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It’s that very jump that I’m trying to pin down and understand.”
(or if anyone knows a different place where I might find something similar, links are super appreciated too!)
Thanks. I didn’t understand all of this. Long reply with my reactions incoming, in the spirit of Socratic Grilling.
This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It’s that very jump that I’m trying to pin down and understand.
I can see that this could produce an oracle for an actor in the world (such as a company or person), but not how this would become such an actor. Still, having an oracle would be dangerous, even if not as dangerous as having an oracle that itself takes actions. (Ah—but this makes sense in conjunction with number 5, the ‘outer loop’.)
‘reasoning about how one’s actions affect future world states’ - is that an OK gloss of ‘consequentialist cognition’? See comments from others attempting to decipher quite what this phrase means.
Interesting to posit a link from CC ⇒ making better predictions. I can see how that’s one step closer to optimizing over future world states. The other steps seem missing—I take it they are meant to be covered by ‘generalizing in unpredictable ways’?
Or did you mean something stronger by CC: goal-directed behaviour? In other words, that a very, very powerful language model would have learned from its training to take real-world actions in service of the goal of next-token prediction? This makes sense to me (though as you say it’s speculative).
I’d probably need more background knowledge to understand this. Namely, some examples of when LMs have been fine-tuned to act in service of goals. That sounds like it would cut the Gordian knot of my question by simply demonstrating the existence proof rather than answering the question with arguments.
This one is easy to understand :)
Where should I look for these?
Bump — It’s been a few months since this was written, but I think I’d benefit greatly from a response and have revisited this post a few times hoping someone would follow up to David’s question, specifically:
“This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It’s that very jump that I’m trying to pin down and understand.”
(or if anyone knows a different place where I might find something similar, links are super appreciated too!)