Paolo Bova comments on AGI Multi-Agent Alignment Simulation

Paolo Bova 11 May 2026 19:22 UTC
1 point
0 ∶ 0
This is super cool work, David and Zoe!

It’s rare to see LLM games that contain this much structure (you have a discrete set of actions which update a world state, and even a bunch of shocks). The other thing I was impressed by is the three different LLM judges. Looking forward to seeing more visualisations.
I have a few questions.
- Were any challenges to getting the judges to behave reliably?
- You mentioned seeing if there were stable ways for players to coordinate on AI alignment in the face of competitive pressure. From your work so far do you have any ideas about hypotheses or interventions that you would want to try?
- I’m curious as to how the competitive dynamics are captured. Are you drawing upon any models of AI race dynamics? (e.g. Armstrong et al. 2016, Han et al. 2020, Stafford et al. 2022). Also, have you seen the Intelligence Rising paper by Avin et al. 2024? I’m wondering whether you’ve seen behaviours similar to what they’ve seen in their workshops?
- Zoe L 12 May 2026 12:42 UTC
  3 points
  0 ∶ 0
  Parent
  Thanks Paolo!
  - When using gpt-4o and gemini-2.5-flash as judges, they struggled with math (for enforcing resource and value constraints) and generally didn’t justify their claims as much. Upgrading to gpt-5.4 and gemini-2.5-pro solves the math problem, though claude-sonnet-4-6 still provided more reasoning for their decisions.
  - Since we’ve only ran 3-year simulations (i.e., 3 turns for each game), we can’t make claims about long-tern equilibrium. However, we did observe that different shock events seem to encourage different strategies even in the 3-year sim, e.g. alignment_breakthrough incentivized transparency (i.e., more cooperative); nationalization_shock incentivized resource consolidation (i.e., more competitive). Running simulations over longer horizon (10-50 turns) with different parameters and under different scenarios would help confirm if these shock-induced trends hold.
    
    We also observed that more A2A communication led to more cooperation and thus better vibe-based alignment. Players are always allowed A2A communication and are always truthful about their actions in the current sim, so it would be interesting to test what happens when players are allowed (or even encouraged) to use deception or when there’s fog-of-war/a lack of A2A communication.
  - We didn’t reference the specific papers you mentioned but share their ideas. Kenneth, 2026 on AI-simulated nuclear war game influenced our design of the race mechanics the most. Carichon et al. and Zeng et al. influenced our design of the multi-layer value system. We’ve seen similar behaviours as observed in Avin et al. 2024, especially:
    The power to steer the future of AI development is very unequally distributed due to several drivers for concentration, including the enormous compute requirements of the latest frontier AI models
    There exists an information asymmetry where states and the public will constantly be catching up to deal with the impacts of the last generation of AI technologies
    Winners take all rather than winner takes all + Division into blocs by state lines
    Tech race + Races are destabilising
    Supply chain disruptions slow but don’t stop AI and cause instability (we actually have a supply chain disruption shock event, so again, will be interesting to run it over a longer time horizon)