Harvard student and community builder interested in AI xrisk reduction
NickGabs
Getting Actual Value from “Info Value”: Example from a Failed Experiment
Distillation of “How Likely is Deceptive Alignment?”
Informal outreach via personal networks is underrated as a community building strategy
I think you can stress the “ideological” implications of externalities to lefty audiences while having a more neutral tone with more centrist or conservative audiences. The idea that externalities exist and require intervention is not IMO super ideologically charged.
I think the results being surprising are indicative of EAs underestimating how likely this is. AI has many bad effects; social media, bias + discrimination, unemployment, deepfakes, etc. Plus I think sufficiently competent AI will seem scary to people; a lot of people aren’t really aware of recent developments but I think would be freaked out if they were. I think we should position ourselves to utilize this backlash if it happens.
It is true that private developers internalize some of the costs of AI risk. However, this is also true in the case of carbon emissions; if a company emits CO2, its shareholders do pay some costs in terms of having a more polluted atmosphere. The problem is that the private developer only pays a very small fraction of the total costs which, while still quite large in absolute terms, js plausibly worth paying for the upside. For example, if I were entirely selfish and I thought AI risk was somewhat less likely than I actually do (let’s say 10%), I would probably be willing to risk a 10% chance of death for a 90% chance of massive resource acquisition and control over the future. However, if I internalized the full costs of that 10% chance (everyone else dying and all future generations being wiped out), then I would not be willing to take that gamble.
Stress Externalities More in AI Safety Pitches
Lessons from Three Mile Island for AI Warning Shots
I mostly agree with you that if we get AGI and not ANI, the AGI will be able to learn the skills relevant to taking over the world. However, I think that due to inductive biases and quasi-innate intuitions, different generally intelligent systems are differently able to learn different domains. For example, it is very difficult for autistic people (particularly severely autistic people) to learn social skills. Similarly, high-quality philosophical thinking seems to be basically impossible for most humans. Applying this to AGI, it might be very hard to AGI to learn how to make long term plans or social skills.
The key here is transparency. Partially because people openly discuss their moral views, and partially because even when not explicitly stating their views, other people are good enough at reading them to get at least weak evidence about whether they are trustworthy, consequentialists may be unable to seem like perfect deontologists without actually being deontologists.
I think the key thing here is that the criterion by which EA intellectuals decide whether something is interesting is significantly related to it being useful. Firstly, because a lot of EA’s are intellectually interested in things that are at least somewhat relevant to EA, lots of these fields seem useful at least at a high level; moral philosophy, rationality, and AI alignment are all clearly important things for EA. Moreover, many people actually don’t find these topics interesting at all, and they are thus actually highly neglected. This is compounded by the fact that they are very hard, and thus probably only quite smart people with good epistemics can make lots of progress on them. These two features in turn contribute to the work being more suspiciously theoretical than it would be if the broad domains in question (formal ethics, applied rationality, alignment) were less neglected, as fields become increasingly applied as they become better theorized. In other words, it seems prima facie plausible that highly talented people should work in domains that are initially selected partially for relevance to EA and that are highly neglected due to being quite difficult and also not as interesting to people who aren’t interested in topics related to EA, and thus more theoretical than they would be if more people worked on them.
A key crux here seems to be your claim that AI’s will attempt these plans before they have the relevant capacities because they are on short time scales. However, given enough time and patience, it seems clear to me that the AI could succeed simply by not taking risky actions that it knows it might mess up on until it self improves to be able to take those actions. The question then becomes how long the AI think it has until another AI that could dominate it is built, as well as how fast self improvement is.
I think the OP’s argument depends on the idea that “Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.” If AI’s have human level or above capacities in the domains relevant to forming an initial plan to attempt to take over the world and beginning that plan, but have subhuman capacities/bugs in the further stages of that plan, then assuming that at least human level capacities are needed in the latter domains in order to succeed, the threshold could be pretty large, as AI’s could keep getting smarter at domains related to the initial stages of the plan which are presumably closer to the distributions it has been trained on (e. g. social manipulation/text outputting to escape a box) while failing to make as much progress in the more OOD domains.
This is very true. However, the OP’s point still helps us, as AI that is simultaneously smart enough to be useful in a narrow domain, misaligned, but also too stupid to take over the world could help us reduce xrisk. In particular, if it is superhumanly good at alignment research, then it could output good alignment research as part of its deception phase. This would help reduce the risk from future AI’s significantly without causing xrisk as, ex hypothesi, the AI is too stupid to take over. The main question here is whether an AI could be smart enough to do very good alignment research and also too stupid to take over the world if it tried. I am skeptical but pretty uncertain, so I would give it at least a 10% chance of being true, and maybe higher.
Good post! I think that addressing these concerns is definitely important; I have recently updated based on some similar conversations that I have had that the non-EA perception of EA is worse than I thought for reasons like these. However, on the object level, while I think that the mental health and social skills claims are probably true, I would be very surprised if EA’s were particularly bad at paying attention to their own feelings. Particularly in the Bay Area EA community, but AFAIK more broadly as well, I feel like there is a lot of focus on mental health, techniques like rationalist “focusing,” meditation, IFS etc. to get you in touch with your feelings, and lots of community discourse about these topics. Similarly, there is a lot of community attentions to problems like EA being a social bubble, burnout, and other ways in which EA can become too all encompassing. Am I missing something?
I think you should be substantially more optimistic about the effects of aligned AGI. Once we have aligned AGI, this basically means high end cognitive labor becomes very cheap, as once an AI system is trained, it is relatively cheap to deploy it en masse. Some of these AI scientists would presumably work on making AI’s at least cheaper if not more capable, which limits to a functionally infinite supply of high end scientists. Given a functionally infinite supply of high end scientists, we will quickly discover basically everything that can be discovered through parallelizable scientific labor which is, if not everything, I think at least quite a few things (e. g. I have pretty high confidence that we could solve aging, develop extremely good vaccines to prevent against biorisk, etc.). Moreover, this is only a lower bound; I think AGI will probably relatively quickly become significantly smarter than the smartest human, so we will probably do even better than the aforementioned scenario.