My solution to this problem (originally posted here) is to run builder/breaker tournaments:
People sign up to play the role of “builder”, “breaker”, and/or “judge”.
During each round of the tournament, triples of (builder, breaker, judge) are generated. The builder makes a proposal for how to build Friendly AI. The breaker tries to show that the proposal wouldn’t work. (“Builder/breaker” terminology from this report.) The judge moderates the discussion.
Discussion could happen over video chat, in a Google Doc, in a Slack channel, or whatever. Personally I’d do text: anonymity helps judges stay impartial, and makes it less intimidating to enter because no one will know if you fail. Plus, having text records of discussions could be handy, e.g. for fine-tuning a language model to do alignment work.
Each judge observes multiple proposals during a round. At the end of the round, they rank all the builders they observed, and separately rank all the breakers they observed. (To clarify, builders are really competing against other builders, and breakers are really competing against other breakers, even though there is no direct interaction.)
Scores from different judges are aggregated. The top scoring builders and breakers proceed to the next round.
Prizes go to the top-ranked builders and breakers at the end of the tournament.
The hope is that by running these tournaments repeatedly, we’d incentivize alignment progress, and useful insights would emerge from the meta-game:
“Most proposals lack a good story for Problem X, and all the breakers have started mentioning it—if you come up with a good story for it, you have an excellent shot at the top prize”
“Almost all the top proposals were variations on Proposal Z, but Proposal Y is an interesting new idea that people are having trouble breaking”
“All the top-ranked competitors in the recent tournament spent hours refining their ideas by playing with a language model fine-tuned on earlier tournaments plus the Alignment Forum archive”
I think if I was organizing this tournament, I would try to convince top alignment researchers to serve as judges, at least in the later rounds. The contest will have more legitimacy if prizes are awarded by experts. If you had enough judging capacity, you might even be able to have a panel of judges observe each proposal. If you had too little, you could force contestants to judge some matches they weren’t participating in as a condition of entry. [Edit: This might not be the best idea because of perverse incentives. So probably just cash compensation to attract judges is a better idea.]
[Edit 2: One way things could be unfair is if e.g. Builder A happens to be matched with a strong Breaker A, and Builder B happens to be matched with a weaker Breaker B, it might be hard for a judge who observes both proposals to figure out which is stronger. To address this, maybe the judge could observe 4 pairings: Builder A with Breaker A, Builder A with Breaker B, Builder B with Breaker A, and Builder B with Breaker B. That way they’d get to see Builder A and Builder B face the same 2 adversaries, allowing for a more apples-to-apples comparison.]
My solution to this problem (originally posted here) is to run builder/breaker tournaments:
People sign up to play the role of “builder”, “breaker”, and/or “judge”.
During each round of the tournament, triples of (builder, breaker, judge) are generated. The builder makes a proposal for how to build Friendly AI. The breaker tries to show that the proposal wouldn’t work. (“Builder/breaker” terminology from this report.) The judge moderates the discussion.
Discussion could happen over video chat, in a Google Doc, in a Slack channel, or whatever. Personally I’d do text: anonymity helps judges stay impartial, and makes it less intimidating to enter because no one will know if you fail. Plus, having text records of discussions could be handy, e.g. for fine-tuning a language model to do alignment work.
Each judge observes multiple proposals during a round. At the end of the round, they rank all the builders they observed, and separately rank all the breakers they observed. (To clarify, builders are really competing against other builders, and breakers are really competing against other breakers, even though there is no direct interaction.)
Scores from different judges are aggregated. The top scoring builders and breakers proceed to the next round.
Prizes go to the top-ranked builders and breakers at the end of the tournament.
The hope is that by running these tournaments repeatedly, we’d incentivize alignment progress, and useful insights would emerge from the meta-game:
“Most proposals lack a good story for Problem X, and all the breakers have started mentioning it—if you come up with a good story for it, you have an excellent shot at the top prize”
“Almost all the top proposals were variations on Proposal Z, but Proposal Y is an interesting new idea that people are having trouble breaking”
“All the top-ranked competitors in the recent tournament spent hours refining their ideas by playing with a language model fine-tuned on earlier tournaments plus the Alignment Forum archive”
I think if I was organizing this tournament, I would try to convince top alignment researchers to serve as judges, at least in the later rounds. The contest will have more legitimacy if prizes are awarded by experts. If you had enough judging capacity, you might even be able to have a panel of judges observe each proposal. If you had too little, you could force contestants to judge some matches they weren’t participating in as a condition of entry. [Edit: This might not be the best idea because of perverse incentives. So probably just cash compensation to attract judges is a better idea.]
[Edit 2: One way things could be unfair is if e.g. Builder A happens to be matched with a strong Breaker A, and Builder B happens to be matched with a weaker Breaker B, it might be hard for a judge who observes both proposals to figure out which is stronger. To address this, maybe the judge could observe 4 pairings: Builder A with Breaker A, Builder A with Breaker B, Builder B with Breaker A, and Builder B with Breaker B. That way they’d get to see Builder A and Builder B face the same 2 adversaries, allowing for a more apples-to-apples comparison.]