Global moratorium on AGI, now (Twitter). Founder of CEEALAR (née the EA Hotel; ceealar.org)
Greg_Colbourn ⏸️
Ok, I take your point. But no one seems to be actually doing this (seems like it would be possible to do already, for this example; yet it hasn’t been done.)
What do you think a good resolution criteria for judging a system as being AGI should be?Most relevant to X-risk concerns would be the ability to do A(G)I R&D as good as top AGI company workers. But then of course we run into the problem of crossing the point of no return in order to resolve the prediction market. And we obviously shouldn’t do that (unless superalignment/control is somehow solved).
The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundation’s explanation for the 40-point discrepancy is that many of the testers just didn’t feel that motivated to solve the tasks and gave up [my emphasis]. (I vaguely remember this being mentioned in a talk or interview somewhere.)
I’m sceptical of this when they were able to earn $5 for every couple of minutes’ work (time to solve a task). This is far above the average hourly wage.
100% is the score for a “human panel”, i.e. a set of at least two humans.
Also seems very remarkable (suspect, in fact) - this would mean almost no overlap between the questions that the humans were getting wrong—i.e. if each human averages 60% right, then for 2 humans to get 100% there can only be 20% of questions where both get it right! I think in practice the panels that score 100% have to contain many more than 2 humans on average.
EDIT: looks like “at least 2 humans” means at least 2 humans solved every problem in the set, out of the 400 humans that attempted them!
See the quote in the footnote: “a provision that the system not simply be cobbled together as a set of sub-systems specialized to tasks like the above, but rather a single system applicable to many problems.”
the forecasts do not concern a kind of system that would be able to do recursive self-improvements (none of the indicators have anything to do with it)
The indicators are all about being human level at ~everything kind of work a human can do. That includes AI R&D. And AIs are already known to think (and act) much faster than humans, and that will only become more pronounced as the AGI improves itself; hence the “rapid recursive self-improvement”.
Even if it takes a couple of years, we would probably cross a point of no return not long after AGI.
None of these indicators actually imply that the “AGI” meeting them would be dangerous or catastrophic to humanity
Thanks of pointing this out. There was indeed a reasoning step missing from the text. Namely: such AGI would be able to automate further AI development, leading to rapid recursive self-improvement to ASI (Artificial Superintelligence). And it is ASI that will be lethally intelligent to humanity (/all biological life). I’ve amended the text.
there is nothing to indicate that such a system would be good at any other task
The whole point of having the 4 disparate indicators is that they have to be done by a single unified system (not specifically trained for only those tasks)[1]. Such a system would implicitly be general enough to do many other tasks. Ditto with the Strong AGI question.
While an ideal adversarial Turing test would be a very difficult task for an AI system, ensuring these ideal conditions is often not feasible. Therefore, I’m certainly going to expect news that AI systems will pass some form of the adversarial test
That is what both the Turing Test questions are all about! (Look at the success conditions in the fine print.)
- ^
Metaculus: “By “unified” we mean that the system is integrated enough that it can, for example, explain its reasoning on an SAT problem or Winograd schema question, or verbally report its progress and identify objects during videogame play. (This is not really meant to be an additional capability of “introspection” so much as a provision that the system not simply be cobbled together as a set of sub-systems specialized to tasks like the above, but rather a single system applicable to many problems.)”
- ^
It’s only 8 months later, and the top score on ARC-AGI-2 is now 54%.
One option, if you want to do a lot more about it than you currently are, is Pause House. Another is donating to PauseAI (US, Global). In my experience, being pro-active about the threat does help.
I have to think holding such a belief is incredibly distressing.
Have you considered that you might be engaging in motivated reasoning because you don’t want to be distressed about this? Also, you get used to it. Humans are very adaptable.
The 10% comes from the linked aggregate of forecasts, from thousands of people’s estimates/bets on Metaculus, Manifold and Kalshi; not the EA community.
I think this is pretty telling. I’ve also had a family member say a similar thing. If your reasoning is (at least partly) motivated by wanting to stay sane, you probably aren’t engaging with the arguments impartially.
I would bet a decent amount of money that you would not in fact, go crazy. Look to history to see how few people went crazy over the threat of nuclear annihilation in the Cold War (and all the other things C.S. Lewis refers to in the linked quote).
But a lot of informed people do (i.e. an aggregation of forecasts). What would you do if you did believe both of those things?
AI Risk timelines: 10% chance (by year X) should be the headline (and deadline), not 50%. And 10% is _this year_!
See also (somewhat ironically), the AI roast:
its primary weakness is underexploring how individual rationalization might systematically lead safety-concerned researchers to converge on similar justifications for joining labs they believe pose existential threats.
That’s possible, but the responses really aren’t good. For example:
some of the ethics (and decision-theory) can get complicated (see footnote for a bit more discussion[10]
And then there’s a whole lot of moral philosophical-rationalist argument in the footnote. But he completely ignores an obvious option—working to oppose the potentially net-negative organisation. Or in this case: working towards getting an international treaty on AGI/ASI, that can rein in Anthropic and all the others engaged in the suicide race. I think Carlsmith could actually be highly impactful here, if he worked as a lobbyist or diplomat, and a public communicator (perhaps focused on an academic audience).
Meta note: it’s odd that my comment has got way more disagree votes than agree votes (16 vs 3 as of writing), but OP has also got more disagree votes than agree votes (6 vs 3). I guess it’s different people? Or most the people disagreeing with my comment can’t quite get themselves to agree with the main post?
Some choice quotes:
The first concern is that Anthropic as an institution is net negative for the world (one can imagine various reasons for thinking this, but a key one is that frontier AI companies, by default, are net negative for the world due to e.g. increasing race dynamics, accelerating timelines, and eventually developing/deploying AIs that risk destroying humanity – and Anthropic is no exception), and that one shouldn’t work at organizations like that.
...
Another argument against working for Anthropic (or for any other AI lab) comes from approaches to AI safety that focus centrally/exclusively on what I’ve called “capability restraint” – that is, finding ways to restrain (and in the limit, indefinitely halt) frontier AI development, especially in a coordinated, global, and enforceable manner. And the best way to work on capability restraint, the thought goes, is from a position outside of frontier AI companies, rather than within them (this could be for a variety of reasons, but a key one would be: insofar as capability restraint is centrally about restraining the behavior of frontier AI companies, those companies will have strong incentives to resist it).
...
Another argument against AI-safety-focused people working at Anthropic is that it’s already sucking up too much of the AI safety community’s talent. This concern can take various forms (e.g., group-think and intellectual homogeneity, messing with people’s willingness to speak out against Anthropic in particular, feeding bad status dynamics, concentrating talent that would be marginally more useful if more widely distributed, general over-exposure to a particular point of failure, etc). I do think that this is a real concern – and it’s a reason, I think, for safety-focused talent to think hard about the marginal usefulness of working at Anthropic in particular, relative to non-profits, governments, other AI companies, and so on. [It’s also one of the arguments for thinking that Anthropic might be net negative, and a reason that thought experiments like “imagine the current landscape without Anthropic” might mislead.]
...
Another concern about AI-safety-focused people working at AI companies is that it will restrict/distort their ability to accurately convey their views to the public – a concern that applies with more force to people like myself who are otherwise in the habit of speaking/writing publicly.
...
A different concern about working at AI companies is that it will actually distort your views directly – for example, because the company itself will be a very specific, maybe-echo-chamber-y epistemic environment, and people in general are quite epistemically permeable.
...
And of course, there are also concerns about direct financial incentives distorting one’s views/behavior – for example, ending up reliant on a particular sort of salary, or holding equity that makes you less inclined to push in directions that could harm an AI company’s commercial success
...
A final concern about AI safety people working for AI companies is that their doing so will signal an inaccurate degree of endorsement of the company’s behavior, thereby promoting wrongful amounts of trust in the company and its commitment to safety.
...
relative to some kind of median Anthropic view, both amongst the leadership and the overall staff, I am substantially more worried about classic existential risk from misalignment
...
[14] There is at least some evidence that early investors in Anthropic got the impression that Anthropic was initially committed to not pushing the frontier – a commitment that would be odds with their current policy and behavior (though: I think Anthropic has in fact taken costly steps in the past to not push the frontier – see e.g. discussion in this article). If Anthropic made and then broke commitments in this respect, I do think this is bad and a point against expecting them to keep safety-relevant commitments in the future. And it’s true, regardless, that some of Anthropic’s public statements suggested reticence about pushing the frontier (see e.g. quotes here), and it seems plausible that the company’s credibility amongst safety-focused people and investors benefited from cultivating this impression. That said, the fact that Anthropic in fact took costly steps not to push the frontier suggests that this reticence was genuine – albeit, defeasible. And I think benefiting from stated and genuine reticence that ended up defeated is different from breaking a promise.
People have expressed concerns about Anthropic quietly revising/weakening the commitments in its Responsible Scaling Policy (see e.g. here on failing to define “warning sign evaluations” by the time they trained ASL-3 models, and here on weakening ASL-3 weight-theft security requirements so that they don’t cover employees with weight-access). I haven’t looked into this in detail, and I think it’s plausible that Anthropic’s choices here were reasonable, but I do think that the possibility of AI companies revising RSP-like policies, even in a manner that abides by the amendment procedure laid out in those policies (e.g., getting relevant forms of board/LTBT approval), highlights the limitations of relying on these sorts of voluntary policies to ensure safe behavior, especially as the stakes of competition increase.
I think it was bad that Anthropic used to have secret non-disparagement agreements (though: these have been discontinued and previous agreements are no longer being enforced). It also looks to me like Sam McCandlish’s comment on behalf of Anthropic here suggested a misleading picture in this respect, though he has since clarified.
I’ve heard concerns that Anthropic’s epistemic culture involves various vices – e.g. groupthink, over-confidence about how much the organization is likely to prioritize safety when it deviates importantly from standard commercial incentives, over-confidence about the degree of safety the organization’s RSP is likely to ultimately afford, general miscalibration about the extent to which Anthropic is especially ethically-driven vs. more of a standard company – and that the leadership plays an important role in causing this. This one feels hard for me to assess from the outside (and if true, some of the vices at stake are hardly unique to Anthropic in particular). I’m planning to see what I think once I actually see the culture up close.
I also think it’s true, in general, that Anthropic’s researchers have played a meaningful role in accelerating capabilities in the past – e.g. Dario’s work on early GPTs.
...
I think Anthropic itself has a serious chance of causing or playing an important role in the extinction or full-scale disempowerment of humanity – and for all the good intentions of Anthropic’s leadership and employees, I think everyone who chooses to work there should face this fact directly.
...
I do not think that Anthropic or any other actor has an adequate plan for building superintelligence in a manner that brings the risk of catastrophic, civilization-ending misalignment to a level that a prudent and coordinated civilization would accept.
...
I think this plan is quite a bit more promising than some of its prominent critics do. But it is nowhere near good enough, and thinking it through in such detail has increased my pessimism about the situation. Why? Well, in brief: the plan is to either get lucky, or to get the AIs to solve the problem for us. Lucky, here, means that it turns out that we don’t need to rapidly make significant advances in our scientific understanding in order to learn how to adequately align and control superintelligent agents that would otherwise be in a position to disempower humanity – luck that, for various reasons, I really don’t think we can count on. And absent such luck, as far as I can tell, our best hope is to try to use less-than-superintelligent AIs – with which we will have relatively little experience, whose labor and behavior might have all sorts of faults and problems, whose output we will increasingly struggle to evaluate directly, and which might themselves be actively working to undermine our understanding and control – to rapidly make huge amounts of scientific progress in a novel domain that does not allow for empirical iteration on safety-critical failures, all in the midst of unprecedented commercial and geopolitical pressures. True, some combination of “getting lucky” and “getting AI help” might be enough for us to make it through. But we should be trying extremely hard not to bet the lives of every human and the entire future of our civilization on this. And as far as I can tell, any actor on track to build superintelligence, Anthropic included, is currently on track to make either this kind of bet, or something worse.
...
I do not believe that the object-level benefits of advanced AI[18] – serious though they may be – currently justify the level of existential risk at stake in any actor, Anthropic included, developing superintelligence given our current understanding of how to do so safely.
...
I think that in a wiser, more prudent, and more coordinated world, no company currently aiming to develop superintelligence – Anthropic included – would be allowed to do so given the state of current knowledge.
...
I think it’s possible that there will, in fact, come a time when Anthropic should basically just unilaterally drop out of the race – pivoting, for example, entirely to a focus on advocacy and/or doing alignment research that it then makes publicly available. And I wish I were more confident that in circumstances where this is the right choice, Anthropic will do it despite all the commercial and institutional momentum to the contrary.
...
if, as a result, I end up concluding that working at Anthropic is a mistake, I aspire to simply admit that I messed up, and to leave.
...
When I think ahead to the kind of work that this role involves, especially in the context of increasingly dangerous and superhuman AI agents, I have a feeling like: this is not something that we are ready to do. This is not a game humanity is ready to play. A lot of this concern comes from intersections with the sorts of misalignment issues I discussed above. But the AI moral patienthood piece looms large for me as well, as do the broader ethical and political questions at stake in our choices about what sorts of powerful AI agents to bring into this world, and about who has what sort of say in those decisions.
Joe goes on to provide counters to these, but imo those counters are much weaker than the initial considerations against Anthropic. It’s like he’s tying himself in knots to justify taking the job, when he already knows, deep down, that it’s unconscionable.
This is incredible: it reads as a full justification for not working at Anthropic, yet the author concludes the opposite!
That’s good to see, but the money, power and influence is critical here[1], and that seems to be far too corrupted by investments in Anthropic, or just plain wishful techno-utopian thinking.
- ^
The poll respondents are not representative of that for EA. There is no one representing OpenPhil, CEA or 80k, no large donors, and only one top 25 karma account.
- ^
Just thinking: surely to be fair, we should be aggregating all the AI results into an “AI panel”? I wonder how much overlap there is between wrong answers amongst the AIs, and what the aggregate score would be?
Right now, as things stand with the scoring, “AGI” in ARC-AGI-2 means “equivalent to the combined performance of a team of 400 humans”, not “(average) human level”.