Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)
Abstract
Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players’ beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, et al. 2022. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science, November, eade9097. https://doi.org/10.1126/science.ade9097.
I think it’s very impressive! It’s worth noting that this has won a small-scale press diplomacy tournament in the past: https://www.thenadf.org/tournament/captain-meme-runs-first-blitzcon/ (playing under the name Franz Broseph), and also commentated footage of a human vs all cicero bot game here:
That being said, it’s worth noting that they built quite a complicated, specialized AI system (ie they did not take an LLM and finetune a generalist agent that also can play diplomacy):
First, they train a dialogue-conditional action model by behavioral cloning on human data to predict what other players will do.
They they do joint RL planning to get action intentions of the AI and other payers using the outputs of the conditional action model and a learned dialogue-free value model. (They use also regularize this plan using a KL penalty to the output of the action model.)
They also train a conditional dialogue model that maps by finetuning a small LM (a 2.7b BART) to map intents + game history > messages.
They train a set of filters to remove hallucinations, inconsistencies, toxicity, etc from the output messages, before sending them to other players.
The intents are updated after every message. At the end of each turn, they output the final intent as the action.
I do expect someone to figure out how to avoid all these dongles and do it with a more generalist model in the next year or two, though.
I think people who are freaking out about Cicero moreso than foundational model scaling/prompting progress are wrong; this is not much of an update on AI capabilities nor an update on Meta’s plans (they were publically working on diplomacy for over a year). I don’t think they introduce any new techniques in this paper either?
It is an update upwards on the competency of this team of Meta, a slight update upwards on the capabilities of small LMs, and probably an update upwards on the amount of hype and interest in AI.
But yes, this is the sort of thing that you’d see more of in short timelines rather than long.
This is pretty astounding. It seems to me that it’s a result that’s consistent with all the other recent progress AI/ML systems have been making in playing competitive and cooperative strategy games, as well as in using language, but it’s still a really impressive outcome. My sense is that this is the kind of result that you’d tend to see in a world with shorter rather than longer timelines.
As for my personal feelings on the matter, I think they’d best be summed up by this image.
Agreed. I’m working on this area and I thought it would be another few years before we saw human-level AI in Diplomacy with natural language communication. Diplomacy without natural language was already at human level, but we don’t have many examples of game-playing agents successfully using natural language. This seems to show there’s not much of a challenge there.
Wow, this sounds impressive on the face of it, though I also wonder how well the best No Press Diplomacy AI would do here (perhaps with a very mediocre chat capability added). Maybe the dialogue isn’t actually doing that much work? I’d be interested in more information on this.
There’s a commentated video by someone who plays as the only human in an otherwise all-Cicero game, which at least makes it seem like the dialogue is doing a lot.
A random thought I had while watching the video was related to how the commentator pointed out that the bots predictably seem to behave in their own self-interest, whereas human players in bad/losing positions will generally throw all their forces against whoever backstabbed them rather than try to salvage a hopeless position. My personal style of play is much more bot-like than what the commentator described as what human professionals typically do. If the game wasn’t anonymous then I see the incentive to retaliate to build an incentive for future games, but given that players’ usernames are anonymous then it seems to me like the bots’ approach to always try to improve their position seems best.
gwern on /r/machinelearning:
Helpful, thanks!
I watched the commentated video you and Lawrence shared, and it still wasn’t clear to me from seeing the gameplay how much the press-component was actually helping the Diplomacy agents. (e.g. I wasn’t sure if the bots were cooperating/backstabbing or if they were just always set on playing the moves that they did regardless of what was being said in the Press.) In a game with just one human and the rest bots obviously the human wouldn’t have an advantage of the bots all behaved like No Press bots. I think a mixed game with multiple humans and multiple bots would provide more insightful.
Worth noting that this was “Blitz” Diplomacy with only five-minute negotiation rounds. Still very impressive though.
(I spent a couple of hours thinking about and discussing with a friend and want to try and write some tentative conclusions down.)
I think the mere fact that Meta was able to combine strategic thinking and dialog to let an AI achieve its goals in a collaborative and competitive environment with humans should give us some pause. On the technical level, I think the biggest contribution are two things: having the RL-trained strategy model that explicitly models the other human agents and finds the balance between optimal play and “human play”; and establishing an interface between that module and the language model via encoding the intent.
I think the Cicero algorithm is bad news for AI safety. It demonstrates yet another worrying capability of AI, and Meta’s open publishing norm is speeding up this type of work and making it harder to control.
Of course, the usual rebuttal to concerns about this sort of paper is still valid: This is just a game, which makes two important things easier:
There’s a clear state space, action space and reward function. So RL is easy to set up.
The dialog is pretty simple, consisting of short sentences in a very specific domain.
One of the most interesting questions is where things will go from here. I think adding deception to the model should allow it to play even better and I’d give it 50% that somebody will make that work within the next year. Beyond that, now that the scaffolding is established, I expect superhuman level agents to be developed in the next year as well. (I’m not quite sure how “superhuman” you can get in Diplomacy – maybe the current model is already there – but maybe the agent could win any tournament it enters.)
Beyond that, I’m curious when and where we will see some of the techniques established in the paper in real-world applications. I think the bottleneck here will be finding scenarios that can be modeled with RL and benefit from talking with humans. This seems like a difficult spot to identify. When talking with humans is required, this usually indicates a messy environment where actions and goals aren’t clearly defined. A positive example I can think of is a language coach. The goal could be to optimize test scores, the action would be picking from a set of exercises. This alone could already work well, but if you add in human psychology and the fact that e.g. motivation can be a key-driver in learning, then dialog becomes important as well.