This makes me wonder if there could be good setups for evaluating AI systems as groups. You could have separate agent swarms in different sandboxes competing on metrics of safety and performance. The one that does better gets amplified. The agents may then have some incentive to enforce positive social norms for their group against things like sandbagging or deception. When deployed they might have not only individual IDs but group or clan IDs that tie them to each other and continue this dynamic.
Maybe there is some mechanism where membership gets shuffled around sometimes the way alleles do between genes. Or traits of the systems, though that seems less clearly desirable. There are already algorithms to imitate genetic recombination but that would be somewhat different. You could also combine social group membership systems and trait recombination systems potentially. Given the level of influence over AIs, it might be somewhat closer to selective breeding in certain respects but not entirely.
This is totally spitballing, but doing anything that encourages modularity in the circuits (or perhaps at another level?) of the AIs and the ability to swap mind modules would be really good for interpretability.
Ever since this project, I’ve had a vague sense that genome architecture has something interesting to teach us about interpreting/predicting NNs, but I’ve never had a particularly useful insight from it. Love this book on it by Micheal Lynch if anyone’s interested.
I’ve heard this idea of AI group selection floated a few times but people used to say it was too computationally intensive. Now who knows?
Closest biology the idea brings to mind is this paper showing that selecting chickens as groups leads to better overall yields (in factory farming :( ) for the reasons you predict—they aren’t as aggressive or stressed by crowding as the chickens that are individually selected for the biggest yields.
This makes me wonder if there could be good setups for evaluating AI systems as groups. You could have separate agent swarms in different sandboxes competing on metrics of safety and performance. The one that does better gets amplified. The agents may then have some incentive to enforce positive social norms for their group against things like sandbagging or deception. When deployed they might have not only individual IDs but group or clan IDs that tie them to each other and continue this dynamic.
Maybe there is some mechanism where membership gets shuffled around sometimes the way alleles do between genes. Or traits of the systems, though that seems less clearly desirable. There are already algorithms to imitate genetic recombination but that would be somewhat different. You could also combine social group membership systems and trait recombination systems potentially. Given the level of influence over AIs, it might be somewhat closer to selective breeding in certain respects but not entirely.
This is totally spitballing, but doing anything that encourages modularity in the circuits (or perhaps at another level?) of the AIs and the ability to swap mind modules would be really good for interpretability.
Ever since this project, I’ve had a vague sense that genome architecture has something interesting to teach us about interpreting/predicting NNs, but I’ve never had a particularly useful insight from it. Love this book on it by Micheal Lynch if anyone’s interested.
I’ve heard this idea of AI group selection floated a few times but people used to say it was too computationally intensive. Now who knows?
Closest biology the idea brings to mind is this paper showing that selecting chickens as groups leads to better overall yields (in factory farming :( ) for the reasons you predict—they aren’t as aggressive or stressed by crowding as the chickens that are individually selected for the biggest yields.