I wonder if it is possible to derive expected utility maximisation type results from assumptions of “fitness” (as in, evolutionary fitness). This seems more relevant to the AI safety agenda—after all, we care about which kinds of AI are successful, not whether they can be said to be “rational”. It might also be a pathway to the kind of result AI safety people implicitly use—not that agents maximise some expected utility, but that they maximise utilities which force a good deal of instrumental convergence (i.e. describing them as expected utility maximisers is not just technically possible, but actually parsimonious). Actually, if we get the instrumental convergence then it doesn’t matter a great deal if the AIs aren’t strictly VNM rational.
In conclusion, I think we’re interested in results like fitness → instrumental convergence, not rationality → VNM utility.
I largely endorse the position that a number of AI safety people have seen theorems of the latter type and treated them as if that they imply theorems of the former type.
I wonder if it is possible to derive expected utility maximisation type results from assumptions of “fitness” (as in, evolutionary fitness). This seems more relevant to the AI safety agenda—after all, we care about which kinds of AI are successful, not whether they can be said to be “rational”. It might also be a pathway to the kind of result AI safety people implicitly use—not that agents maximise some expected utility, but that they maximise utilities which force a good deal of instrumental convergence (i.e. describing them as expected utility maximisers is not just technically possible, but actually parsimonious). Actually, if we get the instrumental convergence then it doesn’t matter a great deal if the AIs aren’t strictly VNM rational.
In conclusion, I think we’re interested in results like fitness → instrumental convergence, not rationality → VNM utility.
I largely endorse the position that a number of AI safety people have seen theorems of the latter type and treated them as if that they imply theorems of the former type.
I agree fitness is a more useful concept than rationality (and more useful than an individual agent’s power), so here’s a document I wrote about it: https://drive.google.com/file/d/1p4ZAuEYHL_21tqstJOGsMiG4xaRBtVcj/view