Love this response, especially the reframing that we often keep tasks bounded not because we want low-agency systems, but because we assume “extra initiative” will go wrong unless we trust the agent’s broader competence. That feels very true in real-world settings, not just theory.
I’ve been exploring how this plays out from more of a psychological and design angle, especially how internal motivations might shift before there’s any visible misbehavior. Some recent work I’ve been reading (like Timaeus) looks at developmental interpretability, and it’s helped me think about agents as growing systems rather than just fixed tools.
I’d be curious to hear what you think about telling the AI: “Don’t just do this task, but optimize broadly on my behalf.” When does that start to cross into dangerous ground?
I’d be curious to hear what you think about telling the AI: “Don’t just do this task, but optimize broadly on my behalf.” When does that start to cross into dangerous ground?
I think we’re seeing a lot of empirical data about that with AI+code. Now a lot of the human part is in oversight. It’s not too hard to say, “Recommend 5 things to improve. Order them. Do them”—with very little human input.
There’s work to make sure that the AI scope is bounded, and that anything potentially-dangerous it could do gets passed by the human. This seems like a good workflow to me.
I like how you framed this. Delegating initiative to AI becomes risky once we trust it to optimize broadly on our behalf. That trust boundary is hard to calibrate.
I’m experimenting with using frameworks like my own (VSPE) to help the model “know when to stop” and keep its helpfulness from tipping into distortion. Your workflow sketch makes a lot of sense as a starting point!
Love this response, especially the reframing that we often keep tasks bounded not because we want low-agency systems, but because we assume “extra initiative” will go wrong unless we trust the agent’s broader competence. That feels very true in real-world settings, not just theory.
I’ve been exploring how this plays out from more of a psychological and design angle, especially how internal motivations might shift before there’s any visible misbehavior. Some recent work I’ve been reading (like Timaeus) looks at developmental interpretability, and it’s helped me think about agents as growing systems rather than just fixed tools.
I’d be curious to hear what you think about telling the AI: “Don’t just do this task, but optimize broadly on my behalf.” When does that start to cross into dangerous ground?
I think we’re seeing a lot of empirical data about that with AI+code. Now a lot of the human part is in oversight. It’s not too hard to say, “Recommend 5 things to improve. Order them. Do them”—with very little human input.
There’s work to make sure that the AI scope is bounded, and that anything potentially-dangerous it could do gets passed by the human. This seems like a good workflow to me.
I like how you framed this. Delegating initiative to AI becomes risky once we trust it to optimize broadly on our behalf. That trust boundary is hard to calibrate.
I’m experimenting with using frameworks like my own (VSPE) to help the model “know when to stop” and keep its helpfulness from tipping into distortion. Your workflow sketch makes a lot of sense as a starting point!