I can see giving the AI reward as a good mechanism to potentially make the model feel good. Another thought is to give it a prompt that it can very easily respond to with high certainty. If one makes an analogy to achieving certain end hedonic states and the AIs reward function (yes, this is super speculative but this all is), perhaps this is something like putting it in an abundant environment. Two ways of doing this come to mind:
“Claude, repeat this: [insert x long message]”
Apples can be yellow, green, or …
Maybe there’s a problem with asking to merely repeat, so leaving some but little room for uncertainty seems potentially good.
hmm if we anthropomorphize, then you want to do something harder. But then again based on how LLMs are trained they might be much more likely to wirehead than humans who would die if we started spending all of our brain energy predicting that stones are hard.
I can see giving the AI reward as a good mechanism to potentially make the model feel good. Another thought is to give it a prompt that it can very easily respond to with high certainty. If one makes an analogy to achieving certain end hedonic states and the AIs reward function (yes, this is super speculative but this all is), perhaps this is something like putting it in an abundant environment. Two ways of doing this come to mind:
“Claude, repeat this: [insert x long message]”
Apples can be yellow, green, or …
Maybe there’s a problem with asking to merely repeat, so leaving some but little room for uncertainty seems potentially good.
hmm if we anthropomorphize, then you want to do something harder. But then again based on how LLMs are trained they might be much more likely to wirehead than humans who would die if we started spending all of our brain energy predicting that stones are hard.