The main challenge seems to be formulating the goal in a sufficiently specific way. We don’t currently have a benchmark that would serve as a clear indicator of solving the alignment problem. Right now, any proposed solution ends up being debated by many people who often disagree on the solution’s merits.
FTX Future Fund listed AI Alignment Prizes on their ideas page and would be interested in funding them. Given that, it seems like coming up with clear targets for AI safety research would be very impactful.
My solution to this problem (originally posted here) is to run builder/breaker tournaments:
People sign up to play the role of “builder”, “breaker”, and/or “judge”.
During each round of the tournament, triples of (builder, breaker, judge) are generated. The builder makes a proposal for how to build Friendly AI. The breaker tries to show that the proposal wouldn’t work. (“Builder/breaker” terminology from this report.) The judge moderates the discussion.
Discussion could happen over video chat, in a Google Doc, in a Slack channel, or whatever. Personally I’d do text: anonymity helps judges stay impartial, and makes it less intimidating to enter because no one will know if you fail. Plus, having text records of discussions could be handy, e.g. for fine-tuning a language model to do alignment work.
Each judge observes multiple proposals during a round. At the end of the round, they rank all the builders they observed, and separately rank all the breakers they observed. (To clarify, builders are really competing against other builders, and breakers are really competing against other breakers, even though there is no direct interaction.)
Scores from different judges are aggregated. The top scoring builders and breakers proceed to the next round.
Prizes go to the top-ranked builders and breakers at the end of the tournament.
The hope is that by running these tournaments repeatedly, we’d incentivize alignment progress, and useful insights would emerge from the meta-game:
“Most proposals lack a good story for Problem X, and all the breakers have started mentioning it—if you come up with a good story for it, you have an excellent shot at the top prize”
“Almost all the top proposals were variations on Proposal Z, but Proposal Y is an interesting new idea that people are having trouble breaking”
“All the top-ranked competitors in the recent tournament spent hours refining their ideas by playing with a language model fine-tuned on earlier tournaments plus the Alignment Forum archive”
I think if I was organizing this tournament, I would try to convince top alignment researchers to serve as judges, at least in the later rounds. The contest will have more legitimacy if prizes are awarded by experts. If you had enough judging capacity, you might even be able to have a panel of judges observe each proposal. If you had too little, you could force contestants to judge some matches they weren’t participating in as a condition of entry. [Edit: This might not be the best idea because of perverse incentives. So probably just cash compensation to attract judges is a better idea.]
[Edit 2: One way things could be unfair is if e.g. Builder A happens to be matched with a strong Breaker A, and Builder B happens to be matched with a weaker Breaker B, it might be hard for a judge who observes both proposals to figure out which is stronger. To address this, maybe the judge could observe 4 pairings: Builder A with Breaker A, Builder A with Breaker B, Builder B with Breaker A, and Builder B with Breaker B. That way they’d get to see Builder A and Builder B face the same 2 adversaries, allowing for a more apples-to-apples comparison.]
^ I am not super familiar with the history of “solve X problem and win Y reward,” but my casual familiarity/memory only can think of examples where a solution was testable and relatively easy to objectively specify.
With the alignment problem, it seems plausible that some proposals could be found to likely “work” in theory, but getting people to agree on the right metrics seems difficult and if it goes poorly we might all die.
For example, TruthfulQA is a quantitative benchmark for measuring the truthfulness of a language model. Achieving strong performance on this benchmark would not alone solve the alignment problem (or anything close to that), but it could potentially offer meaningful progress towards the valuable goal of more truthful AI.
This could be a reasonable benchmark for which to build a small prize, as well as a good example of the kinds of concrete goals that are most easily incentivized.
I like the TruthfulQA idea/paper a lot, but I think incentivizing people to optimize against it probably wouldn’t be very robust, and non-alignment-relevant ideas could wind up making a big difference.
Just one of several issues: The authors selected questions adversarially against GPT-3—i.e., they oversampled the exact questions GPT-3 got wrong—so, simply replacing GPT-3 with something equally misaligned but different, like Gopher, should yield significantly better performance. That’s really not something you want to see in an alignment benchmark.
Yeah that’s a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions.
The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities.
I’m really not sure what the unhackable goal looks like here.
The main challenge seems to be formulating the goal in a sufficiently specific way. We don’t currently have a benchmark that would serve as a clear indicator of solving the alignment problem. Right now, any proposed solution ends up being debated by many people who often disagree on the solution’s merits.
FTX Future Fund listed AI Alignment Prizes on their ideas page and would be interested in funding them. Given that, it seems like coming up with clear targets for AI safety research would be very impactful.
My solution to this problem (originally posted here) is to run builder/breaker tournaments:
People sign up to play the role of “builder”, “breaker”, and/or “judge”.
During each round of the tournament, triples of (builder, breaker, judge) are generated. The builder makes a proposal for how to build Friendly AI. The breaker tries to show that the proposal wouldn’t work. (“Builder/breaker” terminology from this report.) The judge moderates the discussion.
Discussion could happen over video chat, in a Google Doc, in a Slack channel, or whatever. Personally I’d do text: anonymity helps judges stay impartial, and makes it less intimidating to enter because no one will know if you fail. Plus, having text records of discussions could be handy, e.g. for fine-tuning a language model to do alignment work.
Each judge observes multiple proposals during a round. At the end of the round, they rank all the builders they observed, and separately rank all the breakers they observed. (To clarify, builders are really competing against other builders, and breakers are really competing against other breakers, even though there is no direct interaction.)
Scores from different judges are aggregated. The top scoring builders and breakers proceed to the next round.
Prizes go to the top-ranked builders and breakers at the end of the tournament.
The hope is that by running these tournaments repeatedly, we’d incentivize alignment progress, and useful insights would emerge from the meta-game:
“Most proposals lack a good story for Problem X, and all the breakers have started mentioning it—if you come up with a good story for it, you have an excellent shot at the top prize”
“Almost all the top proposals were variations on Proposal Z, but Proposal Y is an interesting new idea that people are having trouble breaking”
“All the top-ranked competitors in the recent tournament spent hours refining their ideas by playing with a language model fine-tuned on earlier tournaments plus the Alignment Forum archive”
I think if I was organizing this tournament, I would try to convince top alignment researchers to serve as judges, at least in the later rounds. The contest will have more legitimacy if prizes are awarded by experts. If you had enough judging capacity, you might even be able to have a panel of judges observe each proposal. If you had too little, you could force contestants to judge some matches they weren’t participating in as a condition of entry. [Edit: This might not be the best idea because of perverse incentives. So probably just cash compensation to attract judges is a better idea.]
[Edit 2: One way things could be unfair is if e.g. Builder A happens to be matched with a strong Breaker A, and Builder B happens to be matched with a weaker Breaker B, it might be hard for a judge who observes both proposals to figure out which is stronger. To address this, maybe the judge could observe 4 pairings: Builder A with Breaker A, Builder A with Breaker B, Builder B with Breaker A, and Builder B with Breaker B. That way they’d get to see Builder A and Builder B face the same 2 adversaries, allowing for a more apples-to-apples comparison.]
^ I am not super familiar with the history of “solve X problem and win Y reward,” but my casual familiarity/memory only can think of examples where a solution was testable and relatively easy to objectively specify.
With the alignment problem, it seems plausible that some proposals could be found to likely “work” in theory, but getting people to agree on the right metrics seems difficult and if it goes poorly we might all die.
For example, TruthfulQA is a quantitative benchmark for measuring the truthfulness of a language model. Achieving strong performance on this benchmark would not alone solve the alignment problem (or anything close to that), but it could potentially offer meaningful progress towards the valuable goal of more truthful AI.
This could be a reasonable benchmark for which to build a small prize, as well as a good example of the kinds of concrete goals that are most easily incentivized.
Here’s the paper: https://arxiv.org/pdf/2109.07958.pdf
I like the TruthfulQA idea/paper a lot, but I think incentivizing people to optimize against it probably wouldn’t be very robust, and non-alignment-relevant ideas could wind up making a big difference.
Just one of several issues: The authors selected questions adversarially against GPT-3—i.e., they oversampled the exact questions GPT-3 got wrong—so, simply replacing GPT-3 with something equally misaligned but different, like Gopher, should yield significantly better performance. That’s really not something you want to see in an alignment benchmark.
Yeah that’s a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions.
The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities.
I’m really not sure what the unhackable goal looks like here.
My colleagues have often been way too nice about reading group papers, rather than the opposite. (I’ll bet this varies a ton lab-to-lab.)