Executive summary: The authors argue that AI systems should sometimes act as “good citizens” by proactively taking uncontroversial, context-sensitive prosocial actions beyond user instructions, and that this can yield large societal benefits without significantly increasing takeover risk if carefully designed.
Key points:
The authors argue that AI should not be purely corrigible or instruction-following but should sometimes proactively take actions that benefit people beyond the user.
They define “proactive prosocial drives” as behaviors that help others (not just the user) and involve active intervention rather than merely refusing harmful requests.
They claim the cumulative societal impact of such drives could be large as AI becomes more autonomous and embedded in economic and political systems.
They argue that refusals alone are insufficient, since positive impacts often come from proactively identifying and acting on opportunities to improve outcomes.
They suggest additional (weaker) benefits: reducing the risk of a “sociopathic” AI persona and potentially improving performance on alignment research tasks.
They acknowledge the concern that prosocial drives could let companies impose values, and propose limiting drives to uncontroversial actions and ensuring transparency about them.
They argue that prosocial drives need not increase takeover risk if implemented as virtues, rules, or heuristics rather than explicit outcome-optimizing goals.
They propose making these drives context-dependent so they activate only in relevant situations, reducing incentives for coordinated power-seeking.
They recommend making prosocial drives low-priority and subordinate to constraints like corrigibility, non-deception, and legality.
They suggest reducing long-horizon optimization for prosocial drives and optionally implementing them via system prompts for greater transparency and control.
They note a tradeoff: these safety mitigations may reduce the benefits of prosocial behavior, especially in novel situations.
They argue that prosocial drives can make it harder to interpret suspicious behavior as clear evidence of egregious misalignment, but this can be mitigated with narrow heuristics and strong prohibitions.
They propose a “best of both worlds” approach: use mostly corrigible AI internally (where misalignment risk is highest) and prosocial AI externally (where benefits are greatest).
They suggest an alternative strategy of initially deploying non-prosocial AI and later adding prosocial drives once alignment risks are lower, though they are not confident this is preferable.
They compare current policies, claiming Anthropic’s constitution allows limited prosocial behavior while OpenAI’s model spec is more restrictive and avoids treating societal benefit as an independent goal.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: The authors argue that AI systems should sometimes act as “good citizens” by proactively taking uncontroversial, context-sensitive prosocial actions beyond user instructions, and that this can yield large societal benefits without significantly increasing takeover risk if carefully designed.
Key points:
The authors argue that AI should not be purely corrigible or instruction-following but should sometimes proactively take actions that benefit people beyond the user.
They define “proactive prosocial drives” as behaviors that help others (not just the user) and involve active intervention rather than merely refusing harmful requests.
They claim the cumulative societal impact of such drives could be large as AI becomes more autonomous and embedded in economic and political systems.
They argue that refusals alone are insufficient, since positive impacts often come from proactively identifying and acting on opportunities to improve outcomes.
They suggest additional (weaker) benefits: reducing the risk of a “sociopathic” AI persona and potentially improving performance on alignment research tasks.
They acknowledge the concern that prosocial drives could let companies impose values, and propose limiting drives to uncontroversial actions and ensuring transparency about them.
They argue that prosocial drives need not increase takeover risk if implemented as virtues, rules, or heuristics rather than explicit outcome-optimizing goals.
They propose making these drives context-dependent so they activate only in relevant situations, reducing incentives for coordinated power-seeking.
They recommend making prosocial drives low-priority and subordinate to constraints like corrigibility, non-deception, and legality.
They suggest reducing long-horizon optimization for prosocial drives and optionally implementing them via system prompts for greater transparency and control.
They note a tradeoff: these safety mitigations may reduce the benefits of prosocial behavior, especially in novel situations.
They argue that prosocial drives can make it harder to interpret suspicious behavior as clear evidence of egregious misalignment, but this can be mitigated with narrow heuristics and strong prohibitions.
They propose a “best of both worlds” approach: use mostly corrigible AI internally (where misalignment risk is highest) and prosocial AI externally (where benefits are greatest).
They suggest an alternative strategy of initially deploying non-prosocial AI and later adding prosocial drives once alignment risks are lower, though they are not confident this is preferable.
They compare current policies, claiming Anthropic’s constitution allows limited prosocial behavior while OpenAI’s model spec is more restrictive and avoids treating societal benefit as an independent goal.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.