A better open-source human-legible world-model, to be incorporated into future ML interpretability systems
Artificial intelligence
[UPDATE 3 MONTHS LATER: Better description and justification is now available in Section 15.2.2.1 here.]
It is probable that future powerful AGI systems will involve a learning algorithm that builds a common-sense world-model in the form of a giant unlabeled black-box data structure—after all, something like this is true in both modern machine learning and (I claim) human brains. Improving our ability, as humans, to look inside and understand the contents of such a black box is overwhelmingly (maybe even universally) viewed by AGI safety experts as an important step towards safe and beneficial AGI.
A future interpretability system will presumably look like an interface, with human-legible things on one side of the interface, and things-inside-the-black-box on the other side of the interface. For the former (i.e., human-legible) side of the interface, it would be helpful to have access to an open-source world-model / knowledge-graph data structure with the highest possible quality, comprehensiveness, and especially human-legibility, including clear and unambiguous labels. We are excited to fund teams to build, improve, and open-source such human-legible world-model data structures, so that they may be freely used as one component of current and future interpretability systems.
~
Note 1: For further discussion, see my post Let’s buy out Cyc, for use in AGI interpretability systems? I still think that a hypothetical open-sourcing of Cyc would be a promising project along these lines. But I’m open-minded to the possibility that other approaches are even better (see the comment section of that post for some possible examples). As it happens, I’m not personally familiar with what open-source human-legible world-models are out there right now. I’d just be surprised if they’re already so good that it wouldn’t be helpful to make them even better (more human-legible, more comprehensive, fewer errors, uncertainty quantification, etc.). After all, there are people building knowledge webs right now, but nobody is doing it for the purpose of future AGI interpretability systems. So it would be quite a coincidence if they were already doing everything exactly right for that application.
Note 2: Speaking of which, there could also be a separate project—or a different aspect of this same project—which entails trying to build an automated tool that matches up (a subset of) the entries of an existing open-source human-legible world-model / web-of-knowledge data structure with (a subset of) the latent variables in a language model like GPT-3. (It may be a fuzzy, many-to-many match, but that would still be helpful.) I’m even less of an expert there; I have no idea if that would work, or if anyone is currently trying to do that. But it does strike me as the kind of thing we should be trying to do.
Note 3: To be clear, I don’t think of myself as an interpretability expert. Don’t take my word for anything here. :-) [However, later in my post series I’ll have more detailed discussion of exactly where this thing would fit into an AGI control system, as I see it. Check back in a few weeks. Here’s the link.]
A better open-source human-legible world-model, to be incorporated into future ML interpretability systems
Artificial intelligence
[UPDATE 3 MONTHS LATER: Better description and justification is now available in Section 15.2.2.1 here.]
It is probable that future powerful AGI systems will involve a learning algorithm that builds a common-sense world-model in the form of a giant unlabeled black-box data structure—after all, something like this is true in both modern machine learning and (I claim) human brains. Improving our ability, as humans, to look inside and understand the contents of such a black box is overwhelmingly (maybe even universally) viewed by AGI safety experts as an important step towards safe and beneficial AGI.
A future interpretability system will presumably look like an interface, with human-legible things on one side of the interface, and things-inside-the-black-box on the other side of the interface. For the former (i.e., human-legible) side of the interface, it would be helpful to have access to an open-source world-model / knowledge-graph data structure with the highest possible quality, comprehensiveness, and especially human-legibility, including clear and unambiguous labels. We are excited to fund teams to build, improve, and open-source such human-legible world-model data structures, so that they may be freely used as one component of current and future interpretability systems.
~
Note 1: For further discussion, see my post Let’s buy out Cyc, for use in AGI interpretability systems? I still think that a hypothetical open-sourcing of Cyc would be a promising project along these lines. But I’m open-minded to the possibility that other approaches are even better (see the comment section of that post for some possible examples). As it happens, I’m not personally familiar with what open-source human-legible world-models are out there right now. I’d just be surprised if they’re already so good that it wouldn’t be helpful to make them even better (more human-legible, more comprehensive, fewer errors, uncertainty quantification, etc.). After all, there are people building knowledge webs right now, but nobody is doing it for the purpose of future AGI interpretability systems. So it would be quite a coincidence if they were already doing everything exactly right for that application.
Note 2: Speaking of which, there could also be a separate project—or a different aspect of this same project—which entails trying to build an automated tool that matches up (a subset of) the entries of an existing open-source human-legible world-model / web-of-knowledge data structure with (a subset of) the latent variables in a language model like GPT-3. (It may be a fuzzy, many-to-many match, but that would still be helpful.) I’m even less of an expert there; I have no idea if that would work, or if anyone is currently trying to do that. But it does strike me as the kind of thing we should be trying to do.
Note 3: To be clear, I don’t think of myself as an interpretability expert. Don’t take my word for anything here. :-) [However, later in my post series I’ll have more detailed discussion of exactly where this thing would fit into an AGI control system, as I see it.
Check back in a few weeks. Here’s the link.]