Charles He comments on On Deference and Yudkowsky’s AI Risk Estimates

Charles He 20 Jun 2022 7:06 UTC
2 points
0 ∶ 0
Eh, there seems like a connection to interpretability.

For example, if the ML architecture “were modular+categorized or legible to the agents”, they would more quickly and effectively swap weights or models.

So there might be some way where legibility can emerge by selection pressure in an environment where say, agents had limited capacity to store weights or data, and had to constantly and extensively share weights with each other. You could imagine teams of agents surviving and proliferating by a shared architecture that let them pass this data fluently in the form of weights.

To make sure the transmission mechanism itself isn’t crazy baroque you can, like, use some sort of regularization or something.

I’m 90% sure this is a shower thought but like it can’t be worse than “The Great Reflection”.