Old: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Whereas reward modelling can reward agents for unexpected behaviour that leads to good outcomes (as long as humans can recognise them) - but this also means that those agents might find and be rewarded for manipulative or deceptive actions. Christiano et al. (2017) provide an example of an agent learning to deceive the human evaluator; and Stiennon et al. (2020) provide an example of an agent learning to “deceive” its reward model. Lastly, while IRL could in theory be used even for tasks that humans can’t evaluate, it relies most heavily on assumptions about human rationality in order to align agents.”
New: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Reward modelling, by contrast, can reward agents for unexpected behaviour that leads to good outcomes—but also rewards agents for manipulative or deceptive actions. (Although deliberate deception is likely beyond the capabilities of current agents, there are examples of simpler behaviours have a similar effect: Christiano et al. (2017) describes an agent learning behaviour which misled the human evaluator; and Stiennon et al. (2020) describes an agent learning behaviour which was misclassified by its reward model.) Lastly, while IRL can potentially be used even for tasks that humans can’t evaluate, the theoretical justification for why this should work relies on implausibly strong assumptions about human rationality.”
Old: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Whereas reward modelling can reward agents for unexpected behaviour that leads to good outcomes (as long as humans can recognise them) - but this also means that those agents might find and be rewarded for manipulative or deceptive actions. Christiano et al. (2017) provide an example of an agent learning to deceive the human evaluator; and Stiennon et al. (2020) provide an example of an agent learning to “deceive” its reward model. Lastly, while IRL could in theory be used even for tasks that humans can’t evaluate, it relies most heavily on assumptions about human rationality in order to align agents.”
New: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Reward modelling, by contrast, can reward agents for unexpected behaviour that leads to good outcomes—but also rewards agents for manipulative or deceptive actions. (Although deliberate deception is likely beyond the capabilities of current agents, there are examples of simpler behaviours have a similar effect: Christiano et al. (2017) describes an agent learning behaviour which misled the human evaluator; and Stiennon et al. (2020) describes an agent learning behaviour which was misclassified by its reward model.) Lastly, while IRL can potentially be used even for tasks that humans can’t evaluate, the theoretical justification for why this should work relies on implausibly strong assumptions about human rationality.”