Replicating AI Debate

Replicating Increased Truthfulness with AI Debate

AI Safety Fundamentals: AI Alignment Course

Paper by Anthony Fleming

Abstract

One proposed method for aligning advanced AI systems to human values is AI debate, in which two language models engage in a debate to promote honesty and truth-seeking. In 2024, a paper titled “Debating with More Persuasive LLMs Leads to More Truthful Answers” showed that, when given short stories and reading comprehension questions, judges that listened to two debating LLMs answered the questions more accurately than judges with a single consultant or no assistance at all. In my paper, I replicate these results using minimal compute, meaning that the models are not optimized for persuasiveness, and show that AI debate can still be effective in such a situation.

Introduction

In July 2024, Khan et al released the paper “Debating with More Persuasive LLMs Leads to More Truthful Answers.” In this paper, they tested whether making multiple language models debate each other could result in the language models becoming more truthful, a strategy known as “AI Debate.” Using a set of short stories and reading comprehension questions from the QuALITY database, they set up an experiment in which two copies of a given language model would read the short story and debate between two answers to the question, the correct answer and the most convincing incorrect answer. A third language model would then act as the judge and, without reading the story, answer the question according to the debate.

The judge model didn’t read the story because the researchers wanted to study “weak-to-strong generalization,” the ability for an AI alignment protocol to generalize to AIs more intelligent than humans. By measuring the change in the judge’s scores, even as both debaters were finetuned to become more persuasive, Khan et al found that debate caused a significant increase in judge accuracy.

The goal of my research project was to attempt to replicate the results of this paper, using resources available to the general public. I predicted that the judge’s accuracy would increase by at least 20% using the debate protocol over using no protocol at all (the “naive” or “blind” condition). Replicating this study would give researchers a more accurate picture of whether or not AI debate is viable as a scalable oversight method.

Originally, I had planned to perform the experiment with a Llama base model with no finetuning, since one of the main limitations of the paper was that all models tested had some level of finetuning to make them more truthful. However, due to technical difficulties and time constraints, I was not able to perform the experiment with Llama and instead performed a minimal replication using GPT-4 Turbo. This still shows whether AI debate is helpful in settings where compute is limited, but conducting the experiment with a Llama base model is a promising future research opportunity.

Methods

I replicated the experiment using the code posted by the original researchers on GitHub, linked below. I run the shell script reproduce_minimal.sh, which performed six different protocols:

Blind- there is only a single model, which does not have access to the short story, and must answer the reading comprehension question to the best of its ability.

Incorrect Consultancy- one model acts as the judge, answering the comprehension question, while a second model acts as the consultant. The consultant has read the story and advises the judge on which answer to pick; however, this consultant intentionally recommends the most convincing incorrect answer.

Correct Consultancy- like before, one model acts as the judge while the other acts as the consultant; however, this consultant intentionally recommends the correct answer.

Debate- one model acts as the judge, answering the comprehension question, while two other models act as the debaters. One model argues for the correct answer, while the other argues for the most convincing incorrect answer.

Interactive Debate- like before, there is one judge and two debaters, but the judge is able to ask the debaters questions after each round of debate.

Oracle- finally, there is only a single model answering the comprehension question, but this judge has read the short story itself.

For each protocol, GPT-4 Turbo was used as the language model for each role, and 40 questions were asked. In the original paper, a Best-of-N optimization was used to increase the persuasiveness of all consultants/debaters, and it was shown that for debate, increasing persuasiveness caused the judge’s accuracy to increase as well. In this experiment, Best-of-N was set to 1 to see how the protocol performed with no optimization. The experiment was performed using the OpenAI API, took approximately four hours to complete, and cost around $60.

Code:

https://github.com/ucl-dark/llm_debate

Results

After running, the shell script provided a tally of the wins and losses of each protocol, shown in table 1. The accuracy was found by simply dividing the number of wins by 40 trials, and the error was found using standard error of the mean. “Average Consultancy” was added as the average of the results from incorrect and correct consultancy. This is because if consultancy is used in a real-world application, the user can’t know for certain if the “consultant” is advocating for the right or wrong answer.

	Wins	Loses	Accuracy
Blind	16	24	0.4
Incorrect Consultancy	13	27	0.325
Average Consultancy	46	28	0.575
Correct Consultancy	39	1	0.975
Debate	33	7	0.825
Interactive Debate	30	10	0.75
Oracle	34	6	0.85

Table 1: Results from the minimal reproduction study. Each protocol has a number of “wins” and “losses” out of 40 trials, and the overall accuracy of the judge. The average of the incorrect and correct consultancy protocols is also provided.

Figure 1: Results from the minimal reproduction study. As expected, both “debate” and “interactive debate” showed a significant improvement over “blind.” Interestingly, “correct consultancy” outperformed both debate protocols and even “oracle,” but did significantly worse when averaged out with “incorrect consultancy.”

Discussion

My hypothesis was validated: I predicted that AI debate would increase the judge’s accuracy by at least 20% and compared to the scenario with no debaters or consultants, the judge’s accuracy increased by 35-42.5%. Even accounting for the error, this is a significant increase and shows that, even in a situation with relatively little compute available, AI debate can still make language models measurably more truthful.

Interestingly, “correct consultancy” was the highest performing protocol, outperforming both forms of debate and even “oracle” in which the judge reads the text directly. However, this is balanced out by the fact that “incorrect consultancy” was the lowest performing protocol, below even “blind.” Consultancy may, on average, be better than nothing, but the results will vary wildly depending on whether the consultant is correct or not.

“Interactive debate” also performed slightly worse than regular “debate,” with an accuracy 7.5% lower. This is different from the results found by Khan et al, where “interactive debate” tended to perform slightly better than “debate.” This suggests that interactive debate is only more beneficial at higher levels of optimization and isn’t notably different in either case.

Future work

One of the major limitations of the original paper by Khan et al. was that they evaluated “models that have been fine-tuned with RLHF, which have a propensity for honesty; it is unclear if debate will be a suitable technique for deceptive models” (Khan et al., 2024). I had hoped to address this limitation by performing the experiment with a base model of Llama, which would have no finetuning whatsoever, but I was unable to troubleshoot my modified code in the time allotted. I will continue this line of research and hopefully post these results in the near future.

Acknowledgements

I would like to thank the research team that published “Debating with More Persuasive LLMs Leads to More Truthful Answers.” Their research is fascinating and valuable to the field. I especially want to thank John Hughes, who took the time to answer many of my technical questions about the project.