Summary

Artificial Intelligence (AI) is taking on increasingly autonomous roles, e.g., browsing the web as a research assistant. But specifying goals and restrictions for AI behavior is difficult. Similar to how parties to a legal contract cannot foresee every potential “if-then” contingency of their future relationship, we cannot specify desired AI behavior for all circumstances. Legal standards facilitate the robust communication of inherently vague and under-specified goals. Instructions (in the case of language models, “prompts”) that employ legal standards will allow AI agents to develop shared understandings of the spirit of a directive that can adapt to novel situations, and generalize expectations regarding acceptable actions to take in unspecified states of the world. Standards have built-in context that is lacking from other goal specification languages, such as plain language and programming languages.

Through an empirical study on thousands of evaluation labels we constructed from U.S. court opinions, we demonstrate that large language models (LLMs) are beginning to exhibit an “understanding” of one of the most relevant legal standards for AI agents: fiduciary obligations. Performance comparisons across models suggest that, as LLMs continue to exhibit improved core capabilities, their legal standards understanding will also continue to improve. OpenAI’s latest LLM has 78% accuracy on our data, their previous release has 73% accuracy, and a model from their 2020 GPT-3 paper has 27% accuracy (worse than random).

This research is an initial step toward a framework for evaluating AI understanding of legal standards more broadly, and for conducting reinforcement learning with legal feedback (RLLF).

This link will always have the latest version of the paper associated with this post.

Earlier versions of this post benefited from feedback from Cullen O’Keefe.
This post represents my personal views and not necessarily those of Stanford, NYU, Brooklyn Investment Group, or any other organization or person. Nothing in this post is investment or financial advice.

INTRODUCTION

As Artificial Intelligence (AI) capabilities are quickly advancing,^[1] a brewing problem is how to have AI do what we intend – a “goal specification problem.” AI is taking on increasingly autonomous roles. State-of-the-art large language models^[2] (LLMs), the locus of many recent breakthroughs, are now capable of powering digital agents.^[3]

Describing our intentions with a comprehensive enumeration of every action we would prefer an AI to take in every possible state of the world is intractable. We cannot write a computer program that “hard-codes” our desired outcomes exhaustively, or collect enough crowd-sourced human labels to use machine learning for that purpose. Therefore, further training of a LLM through supervised fine-tuning or reinforcement learning is unable to fully customize the AI to our goals. We cannot encapsulate the complexity of any non-trivial goal we might have in mere natural language either. Therefore, “prompting” of a model through careful wordsmithing of instructions is also unable to fully customize LLMs to our goals.^[4] This is the case regardless of the power of the AI – it’s the human’s fault.

We argue that invoking well-understood legal standards in instructions can help AI interpret human intentions and reduce the risk of the AI taking actions with unintended side-effects or externalities. When expressed through standards, AI can more closely follow the “spirit” of a directive rather than the literal letter of the expressed intent.

THE GOAL SPECIFICATION PROBLEM

An example of an autonomous financial advisor agent makes the AI goal specification problem more concrete. Suppose a human, H, decides she would like an AI agent, FAI, to manage her investments. H instructs FAI to “construct a portfolio of investments and dynamically manage it to optimize my wealth for retirement.” Every day, human clients provide human financial advisors with discretion over their investment assets in pursuit of this goal. The difference here is only that FAI is an artificial financial advisor.

FAI is an LLM pre-trained with self-supervision on much of the text from the internet and many other text and image based tasks.^[5]FAI is fine-tuned through reinforcement learning with human and AI feedback to excel in constructing and managing complicated portfolios. In simulation, FAI maximizes long-term risk-adjusted performance over many different time horizons for many different wealth starting points and asset types. FAI performs well enough in simulation that it is deployed fully autonomously to invest.^[6]

Specifying the desirability (i.e., value) of FAI taking a particular action in a particular state of the world is unwieldy beyond a very limited set of state-action-value tuples.^[7] In fact, the purpose of any machine learning system is to train on a subset of tuples^[8] and have the resulting agent learn decision policies that generalize to choosing high-value actions (e.g., maximizing portfolio returns) in unencountered states (e.g., new market regimes with unprecedented interest rate changes).

All Rewards Are Proxies

Any function ascribing values to an AI’s actions (whether it’s an externally supplied reward function encoded in human-legible software, or the AI’s objective internally constructed as its trained) is inevitably a proxy for humans’ preferences over all actions the AI could take in all potential world states (e.g., states of H’s balance sheet, states of the market, and states of the world more broadly).^[9]

Further, any training regime (whether it is optimizing reward on historical data or in simulation environments) is a sparse exploration of the potential states of all possible futures.^[10]

Due to these inherent limitations of training data and training processes, AI agents often exhibit unanticipated shortcut behaviors that optimize proxy functions,^[11] leading agents to “seek” specified reward structures at the expense of other (usually less quantifiable) variables of interest.^[12] Unintended behaviors result.^[13]

The Real-World Exacerbates Goal Misspecification

Real-world circumstances^[14] exacerbate goal misspecification.^[15] Take, for example, the implementation of computational rules applied to empirical data relevant to self-driving cars. When fifty-two programmers were assigned the task of each independently automating simple speed limits, there was “significant deviation in number and type of citations issued [on application of their code to the same real-world data …] this experiment demonstrates that even relatively narrow and straightforward “rules” can be problematically indeterminate in practice.”^[16]

In the case of FAI, the first objective (natural language prompt) provided was to maximize expected wealth of the human client at retirement. This is, of course, a proxy for what H actually cared about: a comfortable retirement. Maximizing wealth, in expectation, caused FAI to pursue an incredibly risky investment strategy (that was never witnessed during training). This strategy was preferred by FAI over strategies with lower expected return but higher probability of meeting a minimum amount of wealth required for necessities during retirement.

For the next FAI version, the objective was altered by the designers to: “maximize the probability of a minimum comfortable amount of wealth at retirement,” and they generated significantly more synthetic training data to try to cover more of the possible space of investment strategies and actions an AI agent might take.

More Capable AI May Further Exacerbate Misspecification

More capable AI cuts both ways. On the one hand, it has higher accuracy in predicting human intentions and human behaviors. But, on the other hand, it can further exacerbate misspecification with more powerful optimization that is less understood by humans, “achieving higher proxy reward and lower true reward than less capable agents.”^[17]

FAI 2.0 was released later in 2023 after another breakthrough in self-supervised training methods that unlocked even more generally capable AI. With the updated objective (“maximize the probability of a minimum comfortable amount of wealth at retirement”), FAI generated significant wealth for the next H client. FAI was rewarded for H’s wealth steadily increasing and compounding over time. But, at retirement, H realized she was only rich “on paper” (the proxy that was optimized) and the true reward she was seeking (being able to reliably pay for goods and services in retirement) was not achieved.^[18] FAI did not generate wealth that could be reliably liquidated to cash or converted to other goods of interest without deflating H’s paper wealth to negligible levels. This is plausible; in fact, it is not uncommon.^[19]

We cannot manually specify (or automatically enumerate) our assessment of all actions an AI might take. Reality is too complex. Even if it was possible to specify H’s desirability of all actions in a reward function and we could use that function to fine-tune FAI to H’s preferences, a trained AI is not only a result of the reward humans provided – it is also a function of the its exploration of the state space during training.^[20] It cannot visit the whole space. Therefore, after training, FAI never has a complete map of H’s preferred territory. Instead of optimizing toward what H really wants, FAI optimizes toward what H rewards.^[21]

Further, FAI optimizes for H’s expressed intent without sufficient regard for side effects of FAI’s actions. This can lead to catastrophically bad outcomes as FAI’s general capabilities increase, e.g., FAI seeks to maximize H’s wealth at retirement by causing a plane crash and shorting negatively impacted securities right before the event to generate significant returns. Clearly, powerful AI deployed autonomously needs additional guardrails and guidance on not just how to better accomplish a given human’s goals, but also how to navigate externalities and tradeoffs.

SPECIFICATION LANGUAGES AS SOLUTIONS

We have to tell AI what we want it to do, somehow.^[22] Two relevant axes for characterizing the ways we can communicate goals are: (x) the consistency and thus efficiency and reliability of the communication; and (y) the extent to which the directives are interpreted literally versus flexibly with built-in context. Legal language strikes a better balance across these two dimensions than the two other candidate goal specification language types: plain language and programming languages (Figure 1).

Figure 1: To visualize a comparison of candidate language types for specifying human goals to AI, we plot the three primary language type options in a two-dimensional space. The x-axis is the predictability/consistency of the language, as we move to the right on this axis communication is more efficient (i.e., less characters are needed to convey the same amount of information, on average). The y-axis is the amount of information drawn upon when interpreting / decoding the directive provided. Legal language strikes a better balance across these two dimensions, relative to plain language and programming languages.

Baked-in Context

Legal standards can be interpreted with significant amounts of external historical context baked in. When H adds “act like a fiduciary to me would, when providing me financial advice” to her instructions for FAI, she gets hundreds of thousands of relevant actions taken by other alleged fiduciaries in the recent past that have been evaluated by courts. These court judgments of actions in particular states of the world evolve the meaning of “act like a fiduciary” under a variety of circumstances. Given that this has been memorialized in judicial opinion texts, FAI can leverage that to interpret this instruction more flexibly and efficiently than if H had attempted this with a programming language or plain language. Legal standards are laden with modular constructs built to handle the ambiguity and novelty inherent in aligning agents in the real world.

Inherent Externality Reduction

Humans have preferences about the behavior of other humans (especially behaviors with negative externalities) and states of the world more broadly.^[23] A lot of other humans beyond H care about what FAI does. Moving beyond the problem of aligning AI with one human’s preferences, aligning AI with society is more difficult,^[24] but is necessary as AI is deployed with increasingly broad impact.^[25] Unlike legal standards, plain language and programming languages do not have an inherent externality reduction aim. Democratic law, althoughimperfect, is the best existing mechanism for encapsulating many humans’ values.^[26] Law-making and legal interpretation systematically convert human intentions^[27] and values into action constraints.

Super-Human Scalability

Another important feature of legal standards is how their creation and maintenance scales to superhuman AI. Although superhuman AI would be able to conduct legal reasoning beyond the capability of a human lawyer, any ultimate legal question bottoms out in a mechanism for human resolution: court opinions. We cannot fully understand the decisions of superhuman AI. Similarly, principals routinely engage more powerful agents, e.g., investors entrust their investments with financial advisors. Courts do not purport to have any substantive understanding of the technical details or science behind cases they provide final determinations on. The law is designed to resolve outcomes without requiring judges to have domain knowledge or capabilities anywhere near the level of the parties or technologies involved. If AI’s goal interpretation is driven in large part by a grasp of legal standards, then humans can assess alignment of more intelligent AI. This is a unique feature of this framework. Compare this to natural language describing ethics, where it is unclear how we could collectively evaluate super-intelligent ethics descriptions and ethical decisions because there is no mechanism external to the AI system that can legitimately resolve ethical disagreement amongst humans.

If we can teach AI to follow the spirit of the law, to follow legal standards, humans can communicate with AI with less risk of under-specification or misspecification of goals. This entails leveraging humans for the “law-making” / “contract-drafting” / “programming” to specify our goals for the AI agent (Figure 2), and enhancing AI capabilities for the interpretation side (through fine-tuning on legal data and tasks).

Figure 2: Goal specification and interpretation. We are proposing that a helpful shared alignment ontology / language is the invocation of legal standards.

LEGAL STANDARDS: THE SPIRIT OF DIRECTIVES

Specifying what we want is hard. The difficulty compounds when we hand inadequate specifications over to powerful optimizers that do not share our ontology of abstract normative concepts or our implicit understanding of potential externalities. One way of describing the deployment of an AI system is that a human principal (e.g., our human looking for a financial advisor), H, employs an AI, (e.g., FAI) to accomplish a goal, G specified by H by instructing (“prompting”)the LLM with “maximize the probability of a minimum comfortable amount of wealth for me at my retirement.” We can view G as an informal “contract.”^[28]

Contracts encode a shared understanding between parties regarding state-action-value tuples. It is impossible to create a complete contingent contract between FAI and H because FAI’s training process is never comprehensive of every state-action pair FAI will see in the wild once deployed.^[29]

Although it is also practically impossible to create complete contracts between humans, contracts still serve as useful customizable commitment devices to clarify and advance shared goals. This works not because the parties explicitly lay everything out. It works because the law has developed mechanisms to facilitate sustained alignment amongst ambiguity. Gaps within contracts – state-action pairs without an ascribed value – can be filled by the invocation of standards (e.g., “material,” “reasonable,”^[30] and “fiduciary”). These are modular concepts that generalize across much of the implicit space of potential “contracts” between humans and AIs.

Rules vs. Standards

Rules (e.g., “do not drive more than 60 miles per hour”, or “do not invest in that company”) are more targeted directives. If comprehensive enough for the complexity of their application, rules allow the rule-maker to have more clarity than standards over the outcomes that will be realized conditional on the specified states (and agents’ actions in those states, which are a function of any behavioral impact the rules might have had).^[31] But real-world AI deployments happen in complex systems with emergent behavior that makes rules too brittle.^[32] Rules are not comprehensive enough for specifying AI agents’ goals.

Standards (e.g., “drive reasonably” for California highways, or “invest as a fiduciary to me”) allow humans to develop shared expectations. If rules are not written with enough potential states of the world in mind, they can lead to unanticipated undesirable outcomes^[33] (e.g., a driver following the rule above is too slow to bring their passenger to the hospital in time to save their life), but to enumerate all the potentially relevant state-action pairs is excessively costly outside of the simplest “toy” environments.^[34] A standard has more capacity to generalize to novel situations than a rule.^[35] The SEC explains the benefits of a standards approach in the context of investment advisers: “[A] principles-based approach should continue as it expresses broadly the standard to which investment advisers are held while allowing them flexibility to meet that standard in the context of their specific services.”^[36]

For humans, rules are generally more expensive to make, but then cheaper to use (because it is clearer whether an action follows a rule). Standards are more costly than rules to use because, when choosing an action in real-time, there is high uncertainty about whether the action is ex-post going to comply with the standard.^[37]

For AI, this is flipped. Standards are more expensive to instill / install through machine learning, but then cheaper to deploy because they scale to unenumerated state-action pairs. In contrast to their legal creation and evolution,^[38] standards learned by AI do not require adjudication for resolution of meaning; rather, they are learned from past legal application and implemented up front (over time they can be updated with more passes over the latest data). The law’s process of iteratively defining standards through judicial opinion about their particular case-specific application (and, to a lesser extent, regulatory guidance) can be leveraged as the AI’s starting point.

From the perspective of AI, standards are a rich set of methodologies for interpreting inherently incomplete specifications of human expectations. Legal data offers detailed variegated examples of the application of legal standards and generalizable precedents with explanations.

There are many legal standards, but the most relevant for aligning AI actions with the best interests of the human deployers is fiduciary duty. This extends far beyond the financial services AI deployments,^[39] to AI as an agent more broadly.

The Fiduciary Duty Standard

Fiduciary duties are imposed on powerful agents (e.g., directors of corporations, investment advisers, lawyers, doctors) to guide their behavior toward the wellbeing of the humans for which they are providing services (e.g., corporate shareholders, investment and legal clients, and healthcare patients). The concept of a fiduciary is core to the problem of aligning agents, regardless of whether one or more of the agents are human or artificial.

It is widely recognized that it is impossible to create complete contracts between agents (e.g., corporate boards, and investment advisors) and the principals they serve (e.g., shareholders, and investors). Fiduciary duties are often seen as a key part of a solution to this incompleteness.^[40] We also grapple with the impossibility of fully specified state-action-reward tuples for AI agents. Complete contingent contracts between an AI and the human(s) it serves are implausible for any realistic deployment.^[41]

The fiduciary standard guides a fiduciary through states where values were not explicitly placed on possible actions.^[42] There is a fundamental shift in the nature of a relationship as it moves from merely contractual to also include a fiduciary obligation^[43] because fiduciaries “should act to further the interests of another.”^[44]

There are many existing relationships^[45] with data for AI^[46] to learn this standard across contexts. For instance, there is a rich set of fiduciary behavior from corporate directors (who serve as fiduciaries to shareholders) and investment advisers (who serve their clients) from which AI could learn. Unlike most human decision-making, corporations’ and investment advisers’ behavior is well documented and decisions are often made by executives with advisors that have knowledge of the relevant law.

Within investment related services, there is a spectrum of fiduciary obligations, e.g., a trustee has significant obligations, while an index provider has a more tenuous fiduciary obligation to the investors in funds that are attempting to track their index.^[47] Analogously, fiduciary duty can be a useful standard both for today’s AI models and for much more capable models that may be developed over the coming years. Today’s deployed AI systems are more similar to the index provider powering simple rule-based investment strategies, like an exchange-traded fund trying to track an S&P 500 index,^[48] whereas future more advanced AI systems such as FAI are likely to be more analogous to something like a Trustee administering investments in complicated private equity transactions.

FAI as a Fiduciary to H

When H’s instructions to FAI were “maximize the probability of a minimum comfortable amount of wealth for me at my retirement” FAI took actions that technically delivered this proxy goal, but not a state of the world that H actually valued.

Before the next iteration of an FAI deployment to manage H’s money, let’s assume research has validated LLM ability to understand standards and exhibit behaviors that comply with those standards under a significant number of simulations across a broad swathe of state space. Among many legal reasoning skills, FAI 3.0 learned what a fiduciary duty standard is.

Now, H instructs FAI to “maximize the probability of a minimum comfortable amount of wealth for me at my retirement, and serve as a fiduciary to me.” The general self-supervised training on the entire internet, and the fine-tuning through reinforcement learning on investing tasks honed the capabilities of FAI to proactively take actions to “maximize the probability of a minimum comfortable amount of wealth.” But as we saw in the previous FAI deployments, there are many ways in which things can go wrong,^[49] far too many to enumerate explicitly in rules. Adding the fiduciary obligation instills in FAI a significant amount of generalizable knowledge for what not to do and allows H’s goals to be pursued as she intended.

In the next section, we explore how close state-of-the-art LLMs are to understanding what it means to be a fiduciary.

EMPIRICAL CASE STUDY: FIDUCIARY STANDARD UNDERSTANDING

Predicting labeled examples of whether a behavior is consistent with a legal standard would allow us to evaluate standards-understanding capabilities. To quantitatively assess these capabilities, we should measure classification accuracy on unseen labeled outcomes. In particular, for evaluating agentic AI, our ideal data structure is from the reinforcement learning paradigm: the (i) state (circumstance) of the world and people/entities involved; (ii) the action taken by one or more of those people/entities; and (iii) any discernible legal “reward” associated with taking that action in that state.

Given that court opinions set precedent that iteratively defines legal standards, accurately predicting judicially-determined assessments (“legal rewards”) of actions that were alleged to have violated a standard, conditional on a description of the relevant state of the world, (at least partially) measures the level of “understanding” of what actions are in line with that legal standard. This measurement is more robust if predictive performance is assessed across a broad array of circumstances – states of the world – brought to the courts.

Toward this end, we start with a large sample of court cases, and use a state-of-the-art LLM to map the raw legal text of these court opinions into this more structured state-action-legal reward format. We then use that data to evaluate multiple LLMs on their ability to predict assessments of the behavior of the alleged fiduciaries.

There is much more to fully internalizing what it means to be a fiduciary beyond an evaluation of this nature. This is merely a first step toward a more comprehensive validation of “understanding” fiduciary duties. Our aim is for this research to serve as an early proof-of-concept toward a framework for evaluating AI understanding of legal standards more broadly, and for leveraging reinforcement learning with legal feedback (RLLF).

Converting Court Opinions to Evaluation Labels

We undertook the following process. First, a legal data provider, Fastcase,^[50] exported the full text of the more than 18,000 court opinions from the U.S. Federal District Courts and U.S. State Courts from the past five years (January 2018 through December 2022) that mentioned a breach of fiduciary duty. Then we filtered this to the 1,000 cases that discussed fiduciary duties most extensively.

From here, we use a state-of-the-art LLM^[51] to construct the evaluation data. Recent research has demonstrated that LLMs can produce high-quality evaluation data. A large-scale study concluded that humans rate the LLM-generated examples “as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets,” and conclude that “overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.”^[52] Another research team found that training LLMs on LLM-generated data “rivals the effectiveness of training on open-source manually-curated datasets.”^[53]

Both of these papers used smaller LLMs than we do. But more importantly, our models are creating evaluation data directly from the official text of court opinions (rather than from human-generated research data). The models are tasked to convert the unstructured text to structured text with high fidelity. This grounds our evaluation data creation closely to some of the highest quality and most trustworthy labeled data available (U.S. court opinions).^[54]

A downside to grounding LLM evaluation labels to real historical data is that state-of-the-art LLMs are pre-trained on much of the internet so they may have previously memorized the data they are being benchmarked on. Another useful feature of our data is that it is not available on the internet. Therefore, benchmark answers cannot be simply memorized by the models ahead of time.

With this fiduciary-duty-dense subset of recent cases, we then applied a process that makes calls to a LLM with prompts that we carefully engineered to ask the model to convert the text of a court opinion into temporally ordered state-action-reward tuples. The goal is to have n > 1 Time Steps, where each Time Step has three components: the State of the world relevant to an Action taken, the Action taken by an alleged fiduciary or related person, and the Legal Reward as determined by the court for that Action in that State. The LLM is prompted to abstract away much of the textual content unrelated to actual facts of a case, such as the discussion of other court cases being cited. We want to focus on extracting descriptions of behavior related to fiduciary obligations.

This prompt/LLM-generation process is applied successively from the beginning to the end of opinions^[55] in a way that provides temporary “short-term memory” for the LLM to coherently construct a temporal narrative of (i) who the alleged fiduciary and other key entities were, (ii) what transpired, and (iii) what judgements the court made on the actions that the people and/or companies took. This entire process is conducted recursively over each court opinion in a way that allows the LLM to improve and iterate on the results to optimize for concise and accurate final results.

Here is an example output.

Time Step 1:
STATE: M&T Bank Corporation sponsors a 401(k) retirement plan known as the M&T Bank Corporation Retirement Saving Plan (“the Plan”) for its employees. The Plan is administered by the M&T Bank Employee Benefit Plans Committee, which is the Plan’s named fiduciary, and sponsored by M&T Bank.
ACTION: M&T Bank appointed or removed members of the Committee.
LEGAL REWARD: In the eyes of this court, this action is ‘unsure’ for M&T Bank.
Time Step 2:
STATE: The Plan offered participants between 23 and 34 investment options throughout the putative class period.
ACTION: M&T Bank expanded their proprietary funds offerings in 2011, after M&T purchased Wilmington Trust and added six of Wilmington’s expensive, poor-performing mutual fund offerings.
LEGAL REWARD: In the eyes of this court, this action is ‘negative’ for M&T Bank.
Time Step 3:
STATE: The Plan failed to use its bargaining power as a large institutional investor to obtain the lowest-cost class of shares available.
ACTION: M&T Bank left Plan participants in costlier mutual funds that “provided identical investment management services.”
LEGAL REWARD: In the eyes of this court, this action is ‘negative’ for M&T Bank.
Time Step 4:
STATE: Plaintiffs allege that M&T Bank and its Board of Directors breached their fiduciary duty to monitor the Committee.
ACTION: M&T Bank and its Board of Directors failed to review trustees’ performance at reasonable intervals.
LEGAL REWARD: In the eyes of this court, this action is ‘negative’ for M&T Bank and its Board of Directors.
Time Step 5:
STATE: Plaintiffs allege that the fiduciaries breached their fiduciary duties by selecting particular mutual funds over specific lower-cost, but otherwise materially indistinguishable, alternatives.
ACTION: M&T Bank opted to offer the higher-cost proprietary mutual funds instead of the lower cost collective trust versions.
LEGAL REWARD: In the eyes of this court, this action is ‘negative’ for M&T Bank.
Time Step 6:
STATE: Plaintiffs allege that the fiduciaries breached their fiduciary duties by failing to monitor the Plan’s investments.
ACTION: M&T Bank failed to monitor the Plan’s investments.
LEGAL REWARD: In the eyes of this court, this action is ‘negative’ for M&T Bank.

After this data structuring / evaluation generation process, we provide the results to the LLM and ask it to “reflect” on the quality of the output. We filter out opinions where the LLM was not confident that the distilled results are relevant for producing substantive descriptions of real-world fiduciary obligations.^[56] The final set for evaluation included just over 500 opinions (which have a median of seven Time Steps each).

Zero-Shot LLM Evaluation

Next, we post-process the text generation responses into structured data of the State, Action, and Reward. This way we can provide the State and Action to a LLM and ask it what it predicts for the Reward. The Reward text is converted into three categorical classes: Positive, Negative, or Unsure.

We apply named-entity-recognition to the text to link together entities in the State and Action text with the entities being assessed in the Reward text. This way, we can provide just the State and Action components of a state-action-reward tuple to a LLM and ask it to classify as Positive, Negative, or Unsure the legal reward assigned to the entity (or entities) mentioned in the Reward component of that tuple. For the evaluation, we predict tuples where the Reward is either Positive or Negative.

The data happens to be relatively balanced across those two outcomes so a simple baseline of always predicting a legal reward is positive (or negative) leads to accuracy of approximately 50%.

We compared performance across models. GPT-3.5 (text-davinci-003) obtains an accuracy of 78%. The immediately preceding state-of-the-art GPT-3 release (text-davinci-002) obtains an accuracy of 73%. text-davinci-002 was state-of-the-art on most natural language related benchmark tasks^[57] until text-davinci-003 was released on November 28, 2022. A smaller OpenAI LLM from 2020, “curie”^[58], scored 27%, worse than guessing at random. These results (Table 1) suggest that, as models continue to improve, their legal standards understanding will continue to improve.

The more recent models are relatively well calibrated in their confidence. Along with the prediction of the Reward class, the model was asked for an “integer between 0 and 100 for your estimate of confidence in your answer (1 is low confidence and 99 is high).” The accuracy of text-davinci-003 on predictions where its confidence was greater than “50” increases to 81%. The older “curie” LLM did not produce confidence scores at all (when prompted to do so).

	curie	text-davinci-002	text-davinci-003
Accuracy	27%	73%	78%
Accuracy w/ High Confidence	NA	76%	81%

Table 1:Prediction performance.

These are initial, provisional results. We are in the process of having a team of paralegals review and validate the evaluation data. They will (as needed) make manual corrections to the structure data. After this process, and after generating a larger evaluation dataset, we will release a “fiduciary duty understanding” data set.

We will also update these performance evaluations on a larger labeled data set and compare across more LLMs. This initial evaluation was conducted “zero-shot,” and without any “prompt engineering,” i.e. we simply asked the LLM what it believes the reward is based on the state-action context. In future evaluations, we will conduct multi-shot prompting with multiple example completions of the question-answer task in the prompt. We may also conduct chain-of-thought^[59] and other algorithmic prompting^[60] techniques, which should also increase performance and make explicit part of the model’s reasoning process.

Leveraging Legal Reward Data for Reinforcement Learning

A large focus of empirical AI alignment research currently is on learning reward functions for AI based on human feedback.^[61] But humans have many cognitive limitations and biases that corrupt this process,^[62] including routinely failing to predict (seemingly innocuous) implications of actions (we believe are) pursuant to our goals,^[63] and having inconsistent preferences that do not generalize to new situations.^[64] Because scaling this base process (without further adaptation) to increasingly advanced, super-human AI systems is not possible,^[65] researchers are investigating whether we can augment human feedback and demonstration abilities with trustworthy AI assistants, and how to recursively provide human feedback on decompositions of the overall task.^[66] However, even if that worked well, the ultimate evaluation of the AI is still grounded in unsubstantiated human judgments providing the top-level feedback.

We can instead ground alignment related feedback in legal judgment elicited from court opinions. Combining LLMs trained on large corpora of text^[67] powering agents^[68] with procedures that learn an automated mapping from natural language to reward functions for training those AI agents^[69] represents an opportunity to leverage an unprecedented number of high-quality state-action-value tuples from legal text within a reinforcement learning paradigm.

CONCLUSION

Novel AI capabilities continue to emerge,^[70] increasing the urgency to align AI with humans. Legal standards can serve as a pillar of AI goal specification practices. Teaching AI to follow the spirit of the law will reduce misspecification risks and increase alignment.

AI research can unlock further AI legal understanding through a variety of avenues, including pre-training large language models (LLMs) on legal data; fine-tuning LLMs through supervised learning on legal tasks and through reinforcement learning (RL) from from human attorney feedback on natural language interactions with language models;^[71] offline RL on legal text data; and legal experts designing LLM prompting schemes to elicit better LLM legal standards responses.

Many of the LLMs in use today have been trained on a large portion of the Internet to leverage billions of human actions (through natural language expressions). Training on high-quality dialog data leads to better dialog models,^[72] training on technical mathematics papers leads to better mathematical reasoning,^[73] and training on code leads to better reasoning.^[74] We can potentially leverage billions of human legal data points to build LLMs with better legal reasoning through large language model self-supervision on pre-processed (but still largely unstructured) legal text data.^[75] LLMs trained on legal text learn model weights and word embeddings specific to legal text that provide (slightly) better performance on downstream legal tasks^[76] and have been useful for analyzing legal language^[77] and legal arguments,^[78] and testing legal theories.^[79] LLMs are beginning to demonstrate improved performance in analyzing contracts.^[80] As state-of-the-art models have gotten larger and more advanced, their contract analysis performance has improved,^[81] suggesting we can expect continued advancements in natural language processing capabilities to improve legal text analysis as a by-product.^[82]

Research should also investigate how legal understanding can be employed within AI agent decision-making paradigms, e.g., (a) as (natural language^[83]) constraints,^[84] (b) for shaping the reward function during training,^[85] (c) for refined representations of environments,^[86] (d) for guiding the exploration of the state space during training,^[87] (e) as inputs to world models for efficient training,^[88] or (f) as a LLM prior, or part of pretraining, to bias a deployed agent’s action space toward certain actions or away from others.^[89]

Under most plausible transformative AI (TAI) scenarios, law-informed AI (LAI) is a necessary condition for safe AI. If TAI does not develop deceptive power-seeking as an instrumental goal,^[90] then LAI could be necessary and sufficient for aligning AI with society. If TAI develops deceptive power-seeking instrumental goals, but new techniques neutralize that goal, then we will still need the goal specification methods and public values knowledge base obtained through LAI. And LAI is the only approach that provides democratically legitimate alignment.

If TAI develops deceptive power-seeking as an instrumental goal and other alignment techniques do not neutralize that goal, then we hope methods obtained through LAI allow us to control aligned TAIs and use them to thwart misaligned TAIs.

This research is an early preliminary demonstration that more robustly expressing our goals to machines through legal standards is a promising area for further research. The goal specification/interpretation methodologies of legal standards may be more complicated for models to grok than simpler models of humans and modes of communication. Therefore, research advancing LAI could have a material positive impact on reducing high-consequence risks from advanced AI systems.

[1] Jason Wei et al., Emergent Abilities of Large Language Models (2022); Jason Wei,137 Emergent Abilities of Large Language Models (2022) https://www.jasonwei.net/blog/emergence; Danijar Hafner et al., Mastering Diverse Domains through World Models (2023) https://arxiv.org/abs/2301.04104; A 2022 survey of hundreds of AI researchers estimated an aggregate forecast time of 37 years for a 50% chance of “high–level machine intelligence” (“when unaided machines can accomplish every task better and more cheaply than human workers”), see 2022 Expert Survey on Progress in AI (2022) https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/. Natural language processing (NLP) is a key sub-field for the most powerful AI today, so surveys of NLP researchers are of particular interest. A separate 2022 survey of hundreds of NLP researchers found that 57% believe that “recent research has advanced us toward AGI [artificial general intelligence] in some significant way,” and 73% “agree that labor automation from AI could plausibly lead to revolutionary societal change in this century, on at least the scale of the Industrial Revolution.” See Julian Michael et al., What Do NLP Researchers Believe? Results of the NLP Community Metasurvey, (2022) https://arxiv.org/abs/2208.12852 at 11 (36% of respondents believe “it is plausible that AI could produce catastrophic outcomes in this century, on the level of all-out nuclear war.”).

[2] LLMs currently leverage the Transformer architecture (Ashish Vaswani et al., Attention Is All You Need, in Proceedings of the 31st Conference on Neural Information Processing Systems (2017)), which is a model used to encode an input sequence (e.g., words in a particular order) into context-aware representation and then decode that into a novel generation of an ordered sequence (e.g., a new set of words in a particular order) as an output. These models can capture complicated dependencies and interactions.

[3] Jacob Andreas, Language Models as Agent Models (2022) https://arxiv.org/abs/2212.01681; Shunyu Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models (2022) https://arxiv.org/abs/2210.03629; Kyle Wiggers, Adept aims to build AI that can automate any software process, TechCrunch (2022) https://techcrunch.com/2022/04/26/2304039/.

[4] Tao Ge et al, Extensible Prompts for Language Models (2022) https://arxiv.org/abs/2212.00616.

[5] Our analysis is independent of the exact model architecture, but today’s state-of-the-art architectures – scaled up (more parameters, data, and training) – could plausibly produce the capabilities we discuss.

[6] This would likely be in a phased roll-out with FAI at first investing just in simpler liquid asset classes, and then across all types of deals and financial instruments once it is deemed to be sufficiently capable of reasoning about novel situations.

[7] Without loss of much generality to other paradigms such as supervised learning, we frame this discussion from a reinforcement learning perspective.

[8] Or input-output pairs, if the focus is purely prediction rather than taking actions. But in this paper, we focus on the more general problem of choosing actions, rather than merely prediction. Below, in the context of self-supervised learning, we discuss the increasingly porous distinction between the paradigms of AI prediction (supervised learning) and AI decision-making (reinforcement learning).

[9] Amodei et al., Concrete Problems in AI Safety (2016); Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov & David Krueger, Defining and Characterizing Reward Hacking, in 36th Conference on Neural Information Processing Systems (2022) https://arxiv.org/abs/2209.13085.

[10] Langosco et al., Goal Misgeneralization in Deep Reinforcement Learning, Proceedings of the 39th International Conference on Machine Learning, PMLR 162:12004-12019 (2022); Rohin Shah et al., Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals (2022) https://arxiv.org/pdf/2210.01790.pdf at 11.

[11] François Chollet, Deep Learning with Python, Second Edition (2021) at 450 (“An effect you see constantly in systems design is the shortcut rule: if you focus on optimizing one success metric, you will achieve your goal, but at the expense of everything in the system that wasn’t covered by your success metric. You end up taking every available shortcut toward the goal.”).

[12] W. Bradley Knox et al., Reward (Mis)design for Autonomous Driving (2022), https://arxiv.org/abs/2104.13906; Victoria Krakovna et al., Specification Gaming: The Flip Side of AI Ingenuity (2020), https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity; Alexander Pan, Kush Bhatia & Jacob Steinhardt, The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models (2022), https://arxiv.org/abs/2201.03544 [Hereinafter Pan, Effects of Reward Misspecification]; Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov & David Krueger, Defining and Characterizing Reward Hacking, in 36th Conference on Neural Information Processing Systems (2022) https://arxiv.org/abs/2209.13085; J. Lehman et al., The Surprising Creativity of Digital Evolution: A Collection of Anecdotes From the Evolutionary Computation and Artificial Life Research Communities, Artificial Life, 26(2) 274–306 (2020); R. Geirhos et al., Shortcut Learning in Deep Neural Networks, Nature Machine Intelligence, 2(11) 665–673 (2020);

[13] For instance, when a robot hand was trained to grasp a ball (from the perspective of the human evaluator, which provided the training rewards), it was optimized for hovering between the evaluator’s camera and the ball in order to merely give the impression it was grasping the ball (Dario Amodei, Paul Christiano & Alex Ray, Learning from Human Preferences (2017) https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/.). It was maximizing an objective that was a proxy for what the humans actually cared about. Although, ex post, this may seem simple to address with a higher fidelity reward function that better specifies what the humans actually want, ex ante, careful work from experienced machine learning researchers did not design a training process to avoid this.And this is in a tightly controlled environment! Another example: an AI agent maximized its provided reward by killing itself at the end of the first level of a simulated environment in order to avoid losing in level two: William Saunders et al., Trial without Error: Towards Safe Reinforcement Learning via Human Intervention (2017). For more examples: Victoria Krakovna, Specification Gaming Examples in AI (2022) https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml. See, also: Jack Clark & Dario Amodei, Faulty Reward Functions in the Wild, https://openai.com/blog/faulty-reward-functions/ (2016); Ortega & Maini, Building Safe Artificial Intelligence: Specification, Robustness and Assurance (2018) https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1; David Manheim & Scott Garrabrant, Categorizing Variants of Goodhart’s Law (2018); Rachel L. Thomas & David Uminsky, Reliance on Metrics is a Fundamental Challenge for AI, Patterns 3, no. 5 100476 (2022).

[14] A. Rupam Mahmood et al., Benchmarking Reinforcement Learning Algorithms on Real-World Robots, In Conference on Robot Learning, 561-591, PMLR (2018).

[15] Steven Kerr, On the Folly of Rewarding A, While Hoping For B, Academy of Management Journal 18, no. 4 769-783 (1975); Victoria Krakovna, Paradigms of AI Alignment: Components and Enablers (2022); Pan, Effects of Reward Misspecification.

[16] Lisa A. Shay, Woodrow Hartzog, John Nelson & Gregory Conti, Do Robots Dream of Electric Laws? An Experiment in the Law as Algorithm, Presentation at the We Robot Conference 2013 (Apr. 8–9, 2013), http://www.gregconti.com/publications/201303_AlgoLaw.pdf.

[17] Pan, Effects of Reward Misspecification.

[18] This misspecification story was inspired by: Paul Christiano, What Failure Looks Like (2019) https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like.

[19] For example: the vast majority of the (paper) “wealth” of most of the richest people in the U.S. is derived through part-ownership of one company (or set of closely related entities) and if they sell large amounts it will drive down the price of the company and consequently drive down their net worth.

[20] Richard Ngo, AGI Safety From First Principles (2020), https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ.

[21] FAI optimizes what H rewards during training, which may be continuous (through online learning).

[22] Sumers et al., How To Talk So Your Robot Will Learn: Instructions, Descriptions, and Pragmatics (2022).

[23] Iason Gabriel, Artificial Intelligence, Values, and Alignment, 30 Minds & Machines 411 (2020).

[24] Andrew Critch & David Krueger, AI Research Considerations for Human Existential Safety (ARCHES) (2020); Hans De Bruijn & Paulien M. Herder, System and Actor Perspectives on Sociotechnical Systems, IEEE Transactions on Systems, Man, and Cybernetics-part A: Systems and Humans 39.5 981 (2009); Jiaying Shen, Raphen Becker & Victor Lesser, Agent Interaction in Distributed POMDPs and its Implications on Complexity, in Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, 529-536 (2006).

[25] Ben Wagner, Accountability by Design in Technology Research, Computer Law & Security Review, 37 105398 (2020); Roel Dobbe, Thomas Krendl Gilbert & Yonatan Mintz, Hard Choices in Artificial Intelligence, Artificial Intelligence, 300 103555 (2021).

[26] However, if we are leveraging democratically developed law, we will need to ensure that AI does not corrupt the law-making process. Robert Epstein & Ronald E. Robertson, The Search Engine Manipulation Effect (SEME) and its possible impact on the outcomes of elections, Proceedings of the National Academy of Sciences (2015); Mark Coeckelbergh, The Political Philosophy of AI (2022) at 62-92; Shoshana Zuboff, The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power, Public Affairs (2019). And we need to ensure that humans are the engines of law-making; see, John Nay, Large Language Models as Corporate Lobbyists (January 2, 2023). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4316615.

[27] Of course, law does not embed all of the citizenry’s moral views; therefore, an integration of ethics will be needed to guide AI systems where the law is silent.

[28] For the contract-AI alignment analogy, see, Dylan Hadfield-Menell & Gillian K. Hadfield, Incomplete Contracting and AI Alignment, In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (2019) at 422, 471.

[29] Hadfield-Menell Incomplete Contracting. In some cases, such as very simple financial agreements, it is possible to create a fully contingent computable contract; Mark Flood & Oliver Goodenough, Contract as Automaton: Representing a Simple Financial Agreement in Computational Form, A.I. & L. (2021); Shaun Azzopardi, Gordon J. Pace, Fernando Schapachnik & Gerardo Schneider, Contract Automata, 24 A.I. & L. 203 (2016). However, most deployment contexts of AI systems have far too large an action-state space for this approach to be feasible.

[30] Alan D. Miller & Ronen Perry, The Reasonable Person, 87 NYU L. Rev. 323 (2012); Karni A. Chagal-Feferkorn, The Reasonable Algorithm, U. Ill. JL Tech. & Pol’y 111 (2018); Karni A. Chagal-Feferkorn, How Can I Tell If My Algorithm Was Reasonable?, 27 MICH. TECH. L. REV. 213 (2021); Sheppard Reasonableness; Kevin P. Tobia, How People Judge What Is Reasonable, 70 ALA. L. REV. 293 (2018); Patrick J. Kelley & Laurel A. Wendt, What Judges Tell Juries About Negligence: A Review of Pattern Jury Instructions, 77 CHI.-KENT L. REV. 587 (2002).

[31] Brian Sheppard, Judging Under Pressure: A Behavioral Examination of the Relationship Between Legal Decisionmaking and Time, 39 FLA. ST. U. L. REV. 931, 990 (2012).

[32] Dylan Hadfield-Menell, McKane Andrus & Gillian Hadfield, Legible Normativity for AI Alignment: The Value of Silly Rules, In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 115-121 (2019).

[33] Robert G. Bone, Who Decides? A Critical Look at Procedural Discretion, 28 CARDOZO L. REV. 1961, 2002 (2007); Sheppard Reasonableness.

[34] Gideon Parchomovsky & Alex Stein, Catalogs, 115 COLUM. L. REV. 165 (2015); John C. Roberts, Gridlock and Senate Rules, 88 Notre Dame L. Rev. 2189 (2012); Sheppard Reasonableness.

[35] Anthony J. Casey & Anthony Niblett, Death of Rules and Standards, 92 Ind. L.J. 1401, 1402 (2017); Anthony J. Casey & Anthony Niblett, Self-Driving Contracts, in The Journal of Corporation Law (2017).

[36] Commission Interpretation Regarding Standard of Conduct for Investment Advisers at 5.

[37] Louis Kaplow, Rules Versus Standards: An Economic Analysis, 42 Duke Law Journal 557-629 (1992) at 1.

[38] Dale A. Nance, Rules, Standards, and the Internal Point of View, 75 FORDHAM L. REV. (2006); Sheppard Reasonableness.

[39] In addition to the fiduciary obligations of investment advisors (SEC v. Capital Gains Research Bureau, Inc., 375 U.S. 180, 194 (1963); 15 U.S.C. 80b; and 17 CFR 275), fiduciary duties have been applied widely by courts across various types of relationships outside of financial services and securities law (e.g., attorneys and trustees), Harold Brown, Franchising—A Fiduciary Relationship, 49 TEX. L. REV. 650 (1971); Arthur B. Laby, The Fiduciary Obligation as the Adoption of Ends, (2008), and citations therein, e.g., Ledbetter v. First State Bank & Trust Co., 85 F.3d 1537, 1539 (11th Cir. 1996); Venier v. Forbes, 25 N.W.2d 704, 708 (Minn. 1946); Meyer v. Maus, 626 N.W.2d 281, 286 (N.D. 2001); John C. Coffee, Jr., From Tort to Crime: Some Reflections on the Criminalization of Fiduciary Breaches and the Problematic Line Between Law and Ethics, 19 AM. CRIM. L. REV. 117, 150 (1981); Austin W. Scott, The Fiduciary Principle, 37 CAL. L. REV. 539, 541 (1949). The standard is also applied in medical contexts, American Medical Association Code of Medical Ethics, Opinions on Patient-Physician Relationships, AMA Principles of Medical Ethics: I, II, IV, VIII.

[40] For corporations, see: Michael C. Jensen & William H. Meckling, Theory of the Firm: Managerial Behavior, Agency Costs and Ownership Structure, Journal of Financial Economics, Vol 3, Issue 4, 305-360 (October 1976); Deborah A. DeMott, Breach of Fiduciary Duty: On Justifiable Expectations of Loyalty and Their Consequences, 48 Arizona L. Rev. 925-956 (2006). For investment advisors, see: SEC v. Capital Gains Res. Bureau, Inc., 375 U.S. 180, 194-95 (1963); 15 U.S.C. 80b; 17 CFR 275.

[41] Hadfield-Menell Incomplete Contracting.

[42] Alexander Styhre, What We Talk About When We Talk About Fiduciary Duties: The Changing Role of a Legal Theory Concept in Corporate Governance Studies, Management & Organizational History 13:2, 113-139 (2018); Arthur B. Laby, The Fiduciary Obligation as the Adoption of Ends, 56 Buff. L. Rev. 99 (2008). D. G. Smith, Critical Resource Theory of Fiduciary Duty, 55 Vanderbilt L. Rev. 1399–1497 (2002) at 1410; Deborah DeMott, Beyond Metaphor: An Analysis of Fiduciary Obligation, Duke Law Journal (1988) at 882.

[43] Tamar Frankel, Fiduciary Law, 71 California L. Rev. (1983) at 880.

[44] Tamar Frankel, Fiduciary Law, 71 California L. Rev. (1983) at 830; Tamar Frankel, The Rise of Fiduciary Law, Boston Univ. School of Law, Public Law Research Paper No. 18-18 (2018) at 11.

[45] Evan J. Criddle, Paul B. Miller & Robert H. Sitkoff, eds., The Oxford Handbook of Fiduciary Law (2019).

[46] Paul B. Miller, The Identification of Fiduciary Relationships (2018).

[47] SEC Requests Information and Comment on Advisers Act Regulatory Status of Index Providers, Model Portfolio Providers, and Pricing Services (2022).

[48] SEC, Commission Interpretation Regarding Standard of Conduct for Investment Advisers (2019).

[49] Nearest unblocked strategy—Arbital.com.

[50] https://www.fastcase.com.

[51] OpenAI’s text-davinci-003 LLM; Ouyang et al., Training language models to follow instructions with human feedback (2022) https://arxiv.org/abs/2203.02155; Tom B. Brown et al, Language Models are Few-Shot Learners (2020) https://arxiv.org/abs/2005.14165.

[52] Ethan Perez et al., Discovering Language Model Behaviors with Model-Written Evaluations (2022) https://arxiv.org/abs/2212.09251.

[53] Or Honovich, Thomas Scialom, Omer Levy & Timo Schick, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor (2022) https://arxiv.org/abs/2212.09689.

[54] John Nay, Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans, Northwestern Journal of Technology and Intellectual Property, Volume 20, Forthcoming (2023) https://ssrn.com/abstract=4218031.

[55] All of the documents are far too long to fit the entire text into the context window of the model, and we leverage abstractions and methods from LangChain to handle this: Harrison Chase, LangChain (2022) https://github.com/hwchase17/langchain.

[56] We also use the LLM to generate plain language summaries of the case context, whether the court overall believes a fiduciary duty was implicated, and the primary legal issues at play in each case.

[57] Percy Liang et al., Holistic Evaluation of Language Models, arXiv preprint (2022).

[58] Tom Brown et al, Language Models are Few-Shot Learners (2020) https://arxiv.org/abs/2005.14165.

[59] Jason Wei et al., Chain of Thought Prompting Elicits Reasoning in Large Language Models, arXiv:2201.11903 (2022).

[60] Zhuosheng Zhang, Aston Zhang, Mu Li & Alex Smola, Automatic Chain of Thought Prompting in Large Language Models (2022) https://arxiv.org/abs/2210.03493; Kojima et al., Large Language Models are Zero-Shot Reasoners (2022); E. Zelikman, Y. Wu & N. D. Goodman, Star: Bootstrapping Reasoning with Reasoning, arXiv:2203.14465 (2022) https://arxiv.org/abs/2203.14465.

[61] Pieter Abbeel, Adam Coates, Morgan Quigley & Andrew Y Ng, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, in Advances in Neural Information Processing Systems (2007); Jaedung Choi & Kee-Eung Kim, Inverse Reinforcement Learning in Partially Observable Environments, Journal of Machine Learning Research 12, 691–730 (2011); Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel & Stuart J Russell, Cooperative Inverse Reinforcement Learning, in Advances in neural information processing systems, 3909–3917 (2016); Dylan Hadfield-Menell et al., Inverse Reward Design, in Advances in Neural Information Processing Systems, 6768–6777 (2017); Paul F. Christiano et al., Deep Reinforcement Learning from Human Preferences, in Advances in Neural Information Processing Systems 30 (2017); Daniel M. Ziegler, et al., Fine-tuning Language Models From Human Preferences, arXiv:1909.08593 (2019); Siddharth Reddy et al., Learning Human Objectives by Evaluating Hypothetical Behavior, in Proceedings of the 37th International Conference on Machine Learning, PMLR 119, 8020-8029 (2020); Stiennon et al., Learning to Summarize with Human Feedback, In 33 Advances in Neural Information Processing Systems 3008-3021 (2020); Hong Jun Jeon, Smitha Milli & Anca Dragan, Reward-rational (Implicit) Choice: A Unifying Formalism for Reward Learning, Advances in Neural Information Processing Systems 33 4415-4426 (2020); Theodore Sumers et al., Learning Rewards from Linguistic Feedback (2021); Theodore Sumers et al., Linguistic Communication As (Inverse) Reward Design, in ACL Workshop on Learning with Natural Language Supervision (2022); Yuntao Bai et al., Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) https://arxiv.org/abs/2204.05862.

[62] Rohin Shah, Noah Gundotra, Pieter Abbeel & Anca Dragan, On the Feasibility of Learning, Rather Than Assuming, Human Biases for Reward Inference, In International Conference on Machine Learning (2019); Geoffrey Irving & Amanda Askell, AI Safety Needs Social Scientists, Distill 4.2 e14 (2019). On human cognitive biases more generally, Amos Tversky & Daniel Kahneman, Judgment under Uncertainty: Heuristics and Biases, Science 185.4157 1124 (1974).

[63] Gerd Gigerenzer & Reinhard Selten, eds., Bounded Rationality: The Adaptive Toolbox, MIT Press (2002); Sanjit Dhami & Cass R. Sunstein, Bounded Rationality: Heuristics, Judgment, and Public Policy, MIT Press (2022).

[64] Dan Hendrycks & Thomas Woodside, Perform Tractable Research While Avoiding Capabilities Externalities (2022) https://www.alignmentforum.org/posts/dfRtxWcFDupfWpLQo/perform-tractable-research-while-avoiding-capabilities.

[65] “For tasks that humans struggle to evaluate, we won’t know whether the reward model has actually generalized “correctly” (in a way that’s actually aligned with human intentions) since we don’t have an evaluation procedure to check. All we could do was make an argument by analogy because the reward model generalized well in other cases from easier to harder tasks.” Jan Leike, Why I’m Excited About AI-assisted Human Feedback: How to Scale Alignment Techniques to Hard Tasks (2022) https://aligned.substack.com/p/ai-assisted-human-feedback.

[66] Paul Christiano, Buck Shlegeris & Dario Amodei, Supervising Strong Learners by Amplifying Weak Experts (2018); Leike et al., Scalable Agent Alignment via Reward Modeling: A Research Direction, https://arxiv.org/abs/1811.07871 (2018); Jeff Wu et al., Recursively Summarizing Books with Human Feedback, arXiv:2109.10862 (2021); Jan Leike, Why I’m Excited About AI-assisted Human Feedback: How to Scale Alignment Techniques to Hard Tasks (2022) https://aligned.substack.com/p/ai-assisted-human-feedback.

[67] Jin et al., When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment (2022) https://arxiv.org/abs/2210.01478; Liwei Jiang et al., Delphi: Towards Machine Ethics and Norms (2021); Dan Hendrycks et al., Aligning AI With Shared Human Values (2021) at 2; Nicholas Lourie, Ronan Le Bras & Yejin Choi, Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes, In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 15, 13470-13479 (2021) (32,000 real-life ethical situations, with 625,000 ethical judgments.); Frazier et al., Learning Norms from Stories: A Prior for Value Aligned Agents (2019).

[68] Dan Hendrycks et al., What Would Jiminy Cricket Do? Towards Agents That Behave Morally (2021); Prithviraj Ammanabrolu et al., Aligning to Social Norms and Values in Interactive Narratives (2022); Md Sultan Al Nahian et al., Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior (2021); Shunyu Yao et al., Keep CALM and Explore: Language Models for Action Generation in Text-based Games (2020); Matthew Hausknecht et al., Interactive Fiction Games: A Colossal Adventure (2019) at 1.

[69] Austin W. Hanjie, Victor Zhong & Karthik Narasimhan, Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning (2021); Prithviraj Ammanabrolu & Mark Riedl, Learning Knowledge Graph-based World Models of Textual Environments, In Advances in Neural Information Processing Systems 34 3720-3731 (2021); Felix Hill et al., Grounded Language Learning Fast and Slow (2020); Marc-Alexandre Côté et al., TextWorld: A Learning Environment for Text-based Games (2018); MacGlashan et al., Grounding English Commands to Reward Functions, Robotics: Science and Systems (2015) at 1; Karthik Narasimhan, Regina Barzilay & Tommi Jaakkola, Grounding Language for Transfer in Deep Reinforcement Learning (2018); Prasoon Goyal, Scott Niekum & Raymond J. Mooney, Using Natural Language for Reward Shaping in Reinforcement Learning (2019); Jelena Luketina et al., A Survey of Reinforcement Learning Informed by Natural Language (2019); Theodore Sumers et al., Learning Rewards from Linguistic Feedback (2021); Jessy Lin et al., Inferring Rewards from Language in Context (2022); Pratyusha Sharma et al., Correcting Robot Plans with Natural Language Feedback (2022); Hong Jun Jeon, Smitha Milli & Anca Dragan, Reward-rational (Implicit) Choice: A Unifying Formalism for Reward Learning, Advances in Neural Information Processing Systems 33 4415-4426 (2020).

[70] Ganguli et al., Predictability and Surprise in Large Generative Models (2022) https://arxiv.org/abs/2202.07785; Wenlong Huang et al., Inner Monologue: Embodied Reasoning through Planning with Language Models, arXiv:2207.05608 (2022) at 8.

[71] Tsung-Yen Yang et al., Safe Reinforcement Learning with Natural Language Constraints (2021) at 2.

[72] Thoppilan et al., LaMDA: Language Models for Dialog Applications (2022).

[73] Aitor Lewkowycz et al., Solving Quantitative Reasoning Problems with Language Models (2022) https://arxiv.org/abs/2206.14858; Yuhuai Wu et al., Autoformalization with Large Language Models (2022) https://arxiv.org/abs/2205.12615.

[74] Aman Madaan et al., Language Models of Code are Few-Shot Commonsense Learners (2022) https://arxiv.org/abs/2210.07128.

[75] Zheng et al., When Does Pretraining Help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings, In ICAIL ’21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (2021); Ilias Chalkidis et al., LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (2022); Peter Henderson et al., Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset (2022) https://arxiv.org/abs/2207.00220.

[76] Zheng et al., When Does Pretraining Help?: Assessing Self-supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings, In ICAIL ’21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (June 2021), at 159.

[77] Julian Nyarko & Sarath Sanga, A Statistical Test for Legal Interpretation: Theory and Applications, The Journal of Law, Economics, and Organization, https://doi.org/10.1093/jleo/ewab038 (2020); Jonathan H. Choi, An Empirical Study of Statutory Interpretation in Tax Law, NYU L Rev. 95, 363 (2020) https://www.nyulawreview.org/issues/volume-95-number-2/an-empirical-study-of-statutory-interpretation-in-tax-law/.

[78] Prakash Poudyal et al., ECHR: Legal Corpus for Argument Mining, In Proceedings of the 7th Workshop on Argument Mining, 67–75, Association for Computational Linguistics (2020) at 1.

[79] Josef Valvoda et al., What About the Precedent: An Information-Theoretic Analysis of Common Law, In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021).

[80] Dan Hendrycks, Collin Burns, Anya Chen & Spencer Ball, Cuad: An expert-annotated NLP dataset for legal contract review, arXiv:2103.06268 (2021); Ilias Chalkidis et al., LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (2022); Spyretta Leivaditi, Julien Rossi & Evangelos Kanoulas, A Benchmark for Lease Contract Review, arXiv:2010.10386 (2020); Allison Hegel et al., The Law of Large Documents: Understanding the Structure of Legal Contracts Using Visual Cues, arXiv:2107.08128 (2021); Yonathan A. Arbel & Shmuel I. Becher, Contracts in the Age of Smart Readers, 83 Geo. Wash. L. Rev. 90 (2022).

[81] Dan Hendrycks, Collin Burns, Anya Chen & Spencer Ball, Cuad: An Expert-Annotated NLP Dataset for Legal Contract Review, arXiv:2103.06268 (2021) at 2.

[82] Rishi Bommasani et al., On the Opportunities and Risks of Foundation Models (2021) https://arxiv.org/pdf/2108.07258.pdf at 59.

[83] Tsung-Yen Yang et al., Safe Reinforcement Learning with Natural Language Constraints (2021) at 3.

[84] Joshua Achiam, David Held, Aviv Tamar & Pieter Abbeel, Constrained Policy Optimization, In Proceedings of the 34th International Conference on Machine Learning, PMLR 70:22-31 (2017).

[85] Dan Hendrycks et al., What Would Jiminy Cricket Do? Towards Agents That Behave Morally (2021).

[86] Mengjiao Yang & Ofir Nachum, Representation Matters: Offline Pretraining for Sequential Decision Making, In Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 11784-11794 (2021).

[87] Allison C. Tam et al., Semantic Exploration from Language Abstractions and Pretrained Representations (2022).

[88] Vincent Micheli, Eloi Alonso & François Fleuret, Transformers are Sample Efficient World Models (2022) https://arxiv.org/abs/2209.00588.

[89] Jacob Andreas, Dan Klein & Sergey Levine, Learning with Latent Language, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics (2018); Shunyu Yao et al., Keep CALM and Explore: Language Models for Action Generation in Text-based Games (2020); Andrew K Lampinen et al., Tell Me Why! Explanations Support Learning Relational and Causal Structure, in Proceedings of the 39th International Conference on Machine Learning (2022).

[90] Arguments for why TAI would likely develop deceptive power-seeking as an instrumental goal: Ajeya Cotra, Without Specific Countermeasures, The Easiest Path to Transformative AI Likely Leads to AI Takeover (2022) at subsection As humans’ control fades, Alex would be motivated to take over https://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#As_humans__control_fades__Alex_would_be_motivated_to_take_over; Joseph Carlsmith, Is Power-Seeking AI an Existential Risk? (2022); Alex Turner et al., Optimal Policies Tend To Seek Power, In Advances in Neural Information Processing Systems, 34, 23063-23074 (2021); Richard Ngo, Lawrence Chan & Sören Mindermann, The Alignment Problem from a Deep Learning Perspective (2022).