Risk Alignment in Agentic AI Systems

Link post

Agentic AIs—AIs that are capable and permitted to undertake complex actions with little supervision—mark a new frontier in AI capabilities and raise new questions about how to safely create and align such systems with users, developers, and society. Because agents’ actions are influenced by their attitudes toward risk, one key aspect of alignment concerns the risk profiles of agentic AIs. What risk attitudes should guide an agentic AI’s decision-making? What guardrails, if any, should be placed on the range of permissible risk attitudes? What are the ethical considerations involved when designing systems that make risky decisions on behalf of others?

Risk alignment will matter for user satisfaction and trust, but it will also have important ramifications for society more broadly, especially as agentic AIs become more autonomous and are allowed to control key aspects of our lives. AIs with reckless attitudes toward risk (either because they are calibrated to reckless human users or are poorly designed) may pose significant threats. They might also open “responsibility gaps” in which there is no agent who can be held accountable for harmful actions.

In this series of reports, we consider ethical and technical issues that bear on designing agentic AI systems with acceptable risk attitudes.

In the first report, we examine the relationship between agentic AIs and their users.

  • People do not often act as expected utility maximizers. Most people are at least moderately risk averse, though there is considerable diversity across individuals.

  • We consider two candidate models for how we should view and hence create agentic AIs:

    • Proxy Agent model: agentic AIs are representatives of their users and should be designed to replicate their users’ risk attitudes.

    • Off-the-Shelf Tool model: agentic AIs are tools for achieving desirable outcomes. Their risk attitudes should be set or highly constrained in order to achieve these outcomes.

  • The choice between these two models depends on normative questions such as:

    • What risk attitudes should we adopt when acting on another’s behalf? Should we defer to their risk attitudes?

    • What are the limits of reasonableness? When does an attitude toward risk become reckless?

    • Why are risk attitudes important to people? If you act on my behalf with much more (or less) riskiness than I find acceptable, but you achieve a good outcome for me, have you wronged me?

    • What is the nature of the relationship between a user and an AI? Is my AI a representative of me or merely a tool that I use?

In the second report, we examine the duties and interests of AI developers:

  • Developers must decide how much they will calibrate agentic AIs to users’ risk attitudes and to what extent they will place constraints on them.

  • If agentic AIs behave in reckless or otherwise unacceptable ways, they expose AI developers to legal, reputational, and ethical liability.

    • The landscape of legal liability for actions taken by AIs is murky and in flux.

    • Reputation will be important in securing trust from users and regulators.

    • Developers have duties of care toward users and society to avoid foreseeable harms resulting from their products.

  • Getting alignment right is largely about navigating shared responsibility among developers, users, and AIs.

    • Agentic AIs threaten to open responsibility gaps, situations in which an action is taken but there is no agent that seems to bear any responsibility for it. This can occur when an agentic AI acts in a way that users and developers did not predict or intend.

    • We can avoid responsibility gaps by designing systems of shared agency in which everyone’s roles are clearly defined, communicated, and appropriate.

  • We propose concrete steps that developers can take to successfully create systems of shared responsibility.

In the third report, we examine technical questions about how we might develop proxy AIs that are calibrated to the risk attitudes of their users:

  • Calibration to a user’s risk attitudes would involve:

    • Eliciting user behaviors or judgments about actions under uncertainty.

    • Fitting or constructing a model of the underlying risk attitudes that give rise to those behaviors or judgments.

    • Using that model to design appropriate actions.

  • We focus on three families of learning processes, the kinds of data that they take as input, and where this data would come from in the case of user risk attitude elicitation.

Learning processInput to learning processRisk data
Imitation learningObserved behaviorsActual choice behavior
PromptingNatural language instructionSelf-report
Preference modelingRatings of optionsLottery preferences
  • An examination of the literature in behavioral economics on methods of risk attitude elicitation suggests that:

    • People’s actual behaviors are more valid indicators of their risk attitudes than are hypothetical choices.

    • Self-reports about general risk attitudes and track records are more reliable indicators than are elicited rankings or preferences among lotteries.

    • These two considerations raise concerns for using preference modeling to calibrate models to users.

  • Given that the best data about user risk preferences will be relatively coarse-grained and based on user self-reports, we suggest that methods that match users to pre-existing risk classes (as suggested by the Off-the-Shelf Tool model) may outperform learning-based calibration methods (as suggested by the Proxy model).

Acknowledgments

This report is a project of Rethink Priorities. The authors are Hayley Clatterbuck, Clinton Castro, and Arvo Muñoz Morán. Thanks to Jamie Elsey, Bob Fischer, David Moss, Mattie Toma and Willem Sleegers for helpful discussions and feedback. This work was supported by funding from OpenAI under a Research into Agentic AI Systems grant. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.