AI safety governance/strategy research & field-building.
Formerly a PhD student in clinical psychology @ UPenn, college student at Harvard, and summer research fellow at the Happier Lives Institute.
AI safety governance/strategy research & field-building.
Formerly a PhD student in clinical psychology @ UPenn, college student at Harvard, and summer research fellow at the Happier Lives Institute.
Thanks for writing this, Emma! Upvoted :)
Here’s one heuristic I heard at a retreat several months ago: “If you’re ever running an event that you are not excited to be part of, something has gone wrong.”
Obviously, it’s just a heuristic, but I actually found it to be a pretty useful one. I think a lot of organizers spend time hosting events that feel more like “teaching” rather than “learning together or working on interesting unsolved problems together.”
And my impression is that the groups that have fostered more of a “let’s learn together and do things together” mentality have tended to have the most success.
This seems like a good time to amplify Ashley’s We need alternatives to intro EA Fellowships, Trevor’s University groups should do more retreats, Lenny’s We Ran an AI Timelines Retreat, and Kuhan’s Lessons from Running Stanford EA and SERI.
Congrats to Zach! I feel like this is mostly supposed to be a “quick update/celebratory post”, but I feel like there’s a missing mood that I want to convey in this comment. Note that my thoughts mostly come from an AI Safety perspective, so these thoughts may be less relevant for folks who focus on other cause areas.
My impression is that EA is currently facing an unprecedented about of PR backlash, as well as some solid internal criticisms among core EAs who are now distancing from EA. I suspect this will likely continue into 2024. Some examples:
EA has acquired several external enemies as a result of the OpenAI coup. I suspect that investors/accelerationists will be looking for ways to (further) damage EA’s reputation.
EA is acquiring external enemies as a result of its political engagements. There have been a few news articles recently criticizing EA-affiliated or EA-influenced fellowship programs and think-tanks.
EA is acquiring an increasing number of internal critics. Informally, I feel like many people I know (myself included) have become increasingly dissatisfied with the “modern EA movement” and “mainstream EA institutions”. Examples of common criticisms include “low integrity/low openness”, “low willingness to critique powerful EA institutions”, “low willingness to take actions in the world that advocate directly/openly for beliefs”, “cozyness with AI labs”, “general slowness/inaction bias”, and “lack of willingness to support groups pushing for concrete policies to curb the AI race.” (I’ll acknowledge that some of these are more controversial than others and could reflect genuine worldview differences, though even so, my impression is that they’re meaningfully contributing to a schism in ways that go beyond typical worldview differences).
I’d be curious to know how CEA is reacting to this. The answer might be “well, we don’t really focus much on AI safety, so we don’t really see this as our thing to respond to.” The answer might be “we think these criticisms are unfair/low-quality, so we’re going to ignore them.” Or the answer might be “we take X criticism super seriously and are planning to do Y about it.”
Regardless, I suspect that this is an especially important and challenging time to be the CEO of CEA. I hope Zach (and others at CEA) are able to navigate the increasing public scrutiny & internal scrutiny of EA that I suspect will continue into 2024.
Thank you for writing this, Ben. I think the examples are a helpful and I plan to read more about several of them.
With that in mind, I’m confused about how to interpret your post and how much to update on Eliezer. Specifically, I find it pretty hard to assess how much I should update (if at all) given the “cherry-picking” methodology:
Here, I’ve collected a number of examples of Yudkowsky making (in my view) dramatic and overconfident predictions concerning risks from technology.
Note that this isn’t an attempt to provide a balanced overview of Yudkowsky’s technological predictions over the years. I’m specifically highlighting a number of predictions that I think are underappreciated and suggest a particular kind of bias.
If you were apply this to any EA thought leader (or non-EA thought leader, for that matter), I strongly suspect you’d find a lot clearcut and disputable examples of them being wrong on important things.
As a toy analogy, imagine that Alice is widely-considered to be extremely moral. I hire an investigator to find as many examples of Alice doing Bad Things as possible. I then publish my list of Bad Things that Alice has done. And I tell people “look—Alice has done some Bad Things. You all think of her as a really moral person, and you defer to her a lot, but actually, she has done Bad Things!”
And I guess I’m left with a feeling of… OK, but I didn’t expect Alice to have never done Bad Things! In fact, maybe I expected Alice to do worse things than the things that were on this list, so I should actually update toward Alice being moral and defer to Alice more.
To make an informed update, I’d want to understand your balanced take. Or I’d want to know some of the following:
How much effort did the investigator spend looking for examples of Bad Things?
Given my current impression of Alice, how many Bad Things (weighted by badness) would I have expected the investigator to find?
How many Good Things did Alice do (weighted by goodness)?
Final comment: I think this comment might come across as ungrateful—just want to point out that I appreciate this post, find it useful, and will be more likely to challenge/question my deference as a result of it.
Adding this comment over from the LessWrong version. Note Evan and others have responded to it here.
Thanks for writing this, Evan! I think it’s the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I’ll express for now is that I’m pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I’ll detail this below:
What would a good RSP look like?
Clear commitments along the lines of “we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities.”
Clear commitments regarding what happens if the evals go off (e.g., “if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.”)
Clear commitments regarding the safeguards that will be used once evals go off (e.g., “if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.”)
Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
Some way of handling race dynamics (such that Bad Guy can’t just be like “haha, cute that you guys are doing RSPs. We’re either not going to engage with your silly RSPs at all, or we’re gonna publish our own RSP but it’s gonna be super watered down and vague”).
What do RSPs actually look like right now?
Fairly vague commitments, more along the lines of “we will improve our information security and we promise to have good safety techniques. But we don’t really know what those look like.
Unclear commitments regarding what happens if evals go off (let alone what evals will even be developed and what they’ll look like). Very much a “trust us; we promise we will be safe. For misuse, we’ll figure out some way of making sure there are no jailbreaks, even though we haven’t been able to do that before.”
Also, for accident risks/AI takeover risks… well, we’re going to call those “ASL-4 systems”. Our current plan for ASL-4 is “we don’t really know what to do… please trust us to figure it out later. Maybe we’ll figure it out in time, maybe not. But in the meantime, please let us keep scaling.”
Extremely high uncertainty about what safeguards will be sufficient. The plan essentially seems to be “as we get closer to highly dangerous systems, we will hopefully figure something out.”
No strong evidence that these evals will exist in time or work well. The science of evaluations is extremely young, the current evals are more like “let’s play around and see what things can do” rather than “we have solid tests and some consensus around how to interpret them.”
No way of handling race dynamics absent government intervention. In fact, companies are allowed to break their voluntary commitments if they’re afraid that they’re going to lose the race to a less safety-conscious competitor. (This is explicitly endorsed in ARC’s post and Anthropic includes such a clause.)
Important note: I think several of these limitations are inherent to current gameboard. Like, I’m not saying “I think it’s a bad move for Anthropic to admit that they’ll have to break their RSP if some Bad Actor is about to cause a catastrophe.” That seems like the right call. I’m also not saying that dangerous capability evals are bad—I think it’s a good bet for some people to be developing them.
Why I’m disappointed with current comms around RSPs
Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC’s, Anthropic’s, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don’t expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + “we’ll figure things out later”ness, etc.
On top of that, the posts seem to have this “don’t listen to the people who are pushing for stronger asks like moratoriums—instead please let us keep scaling and trust industry to find the pragmatic middle ground” vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like “well yes, we totally think it’d great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime”, and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC explicitly tries to paint the moratorium folks as “extreme”.
(There’s also an underlying thing here where I’m like “the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim “oh that’s not realistic”, the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)
I’ll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I’m not yet convinced this is the case, and I really hope it’s not. But I’d really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say “hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we’ll brand it as this nice catchy thing called Responsible Scaling.”
When should someone who cares a lot about GCRs decide not to work at OP?
I agree that there are several advantages of working at Open Phil, but I also think there are some good answers to “why wouldn’t someone want to work at OP?”
Culture, worldview, and relationship with labs
Many people have an (IMO fairly accurate) impression that OpenPhil is conservative, biased toward inaction, generally prefers maintaining the status quo, and is generally in favor of maintaining positive relationships with labs.
As I’ve gotten more involved in AI policy, I’ve updated more strongly toward this position. While simple statements always involve a bit of gloss/imprecision, I think characterizations like “OpenPhil has taken a bet on the scaling labs”, “OpenPhil is concerned about disrupting relationships with labs”, and even “OpenPhil sometimes uses its influence to put pressure on orgs to not do things that would disrupt the status quo” are fairly accurate.
The most extreme version of this critique is that perhaps OpenPhil has been net negative through its explicit funding for labs and implicit contributions to a culture that funnels money and talent toward labs and other organizations that entrench a lab-friendly status quo.
This might change as OpenPhil hires new people and plans to spend more money, but by default, I expect that OpenPhil will continue to play the “be nice with labs//don’t disrupt the status quo” role in the space. (In contrast to organizations like MIRI, Conjecture, FLI, the Center for AI Policy, perhaps CAIS).
Lots of people want to work there; replaceability
Given OP’s high status, lots of folks want to work there. Some people think the difference between the “best applicant” and the “2nd best applicant” is often pretty large, but this certainly doesn’t seem true in all cases.
I think if someone EG had an opportunity to work at OP vs. start their own organization or do something that requires more agency/entrepreneurship, there might be a strong case for them to do the latter, since it’s much less likely to happen by default.
What does the world need?
I think this is somewhat related to the first point, but I’ll flesh it out in a different way.
Some people think that we need more “rowing”– like, OP’s impact is clearly good, and if we just add some more capacity to the grantmakers and make more grants that look pretty similar to previous grants, we’re pushing the world into a considerably better direction.
Some people think that the default trajectory is not going so well, and this is (partially or largely) caused or maintained by the OP ecosystem Under this worldview, one might think that adding some additional capacity to OP is not actually all that helpful in expectation.
Instead, people with this worldview believe that projects that aim to (for example) advocate for strong regulations, engage with the media, make the public more aware about AI risk, and do other forms of direct work more focused on folks outside of the core EA community might be more impactful.
Of course, part of this depends on how open OP will be to people “steering” from within. My expectation is that it would be pretty hard to steer OP from within (my impression is that lots of smart people have tried, and folks like Ajeya and Luke have clearly been thinking about things for a long time, and the culture has already been shaped by many core EAs, and there’s a lot of inertia, so a random new junior person is pretty unlikely to substantially shift their worldview, though I of course could be wrong).
Congratulations on the new role– I agree that engaging with people outside of existing AI risk networks has a lot of potential for impact.
Besides RSPs, can you give any additional examples of approaches that you’re excited about from the perspective of building a bigger tent & appealing beyond AI risk communities? This balancing act of “find ideas that resonate with broader audiences” and “find ideas that actually reduce risk and don’t merely serve as applause lights or safety washing” seems quite important. I’d be interested in hearing if you have any concrete ideas that you think strike a good balance of this, as well as any high-level advice for how to navigate this.
Additionally, how are you feeling about voluntary commitments from labs (RSPs included) relative to alternatives like mandatory regulation by governments (you can’t do X or you can’t do X unless Y), preparedness from governments (you can keep doing X but if we see Y then we’re going to do Z), or other governance mechanisms?
(I’ll note I ask these partially as someone who has been pretty disappointed in the ultimate output from RSPs, though there’s no need to rehash that debate here– I am quite curious for how you’re reasoning through these questions despite some likely differences in how we think about the success of previous efforts like RSPs.)
I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn’s recent “If your model is going to sell, it has to be safe” piece. Let me try to unpack this:
On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.
My biggest reservations can be boiled down into two points:
I don’t think that commercial incentives will be enough to motivate people to solve the hardest parts of alignment. Commercial incentives will drive people to make sure their system appears to do what users want, which is very different than having systems that actually do what users want or robustly do what users want even as they become more powerful. Or to put it another way: near-term commercial incentives don’t really cause me to put appropriate amounts of attention on things like situational awareness or deceptive alignment. I think commercial incentives will be sufficient to reduce the odds of Bingchat fiascos, but I don’t think they’ll motivate the kind of alignment research that’s trying to handle deception, sharp left turns, or even the most ambitious types of scalable oversight work.
The research that is directly incentivized by commercial interests is least likely to be neglected. I expect the most neglected research to be research that doesn’t have any direct commercial benefit. I expect AGI labs will invest a substantial amount of resources to prevent future Bingchat scenarios and other instances of egregious deployment harms. The problem is that I expect many of these approaches (e.g., getting really good at RLHFing your model such that it no longer displays undesirable behaviors) will not generalize to more powerful systems. I think you (and many others) agree with this, but I think the important point here is that the economic incentives will favor RLHFy stuff over stuff that tackles problems that are not as directly commercially incentivized.
As a result, even though I agree with many of your subclaims, I’m still left thinking, “huh, the message I want to spread is not something like “hey, in order to win the race or sell your product, you need to solve alignment.”
But rather something more like “hey, there are some safety problems you’ll need to figure out to sell/deploy your product. Cool that you’re interested in that stuff. There are other safety problems—often ones that are more speculative—that the market is not incentivizing companies to solve. On the margin, I want more attention paid to those problems. And if we just focus on solving the problems that are required for profit/deployment, we will likely fool ourselves into thinking that our systems are safe when they merely appear to be safe, and we may underinvest in understanding/detecting/solving some of the problems that seem most concerning from an x-risk perspective.”
This is great! Here are a few more (though some of these overlap a lot with the ones you’ve listed):
Explore—Do something that allows you to get exposed to different kinds of tasks, skills, and people. (Seems especially useful early on when thinking about fit. Also lets people find things that they might not have been able to brainstorm or might have prematurely ruled out). Exploring in sprints may be better than exploring 3-5 things at once (consistent with “optimize one thing at a time”).
Leaveability—Do something that allows you to leave if you find something better.
Anticipate the bottlenecks of the future—Think about which skills will be the bottleneck in 3-5 years. Learn those. (This is a theme explored in High Output Management).
The average of five—Consider the heuristic “you are the average of the five people you spend the most time with.” Who are the people you would be spending the most time with, and how would you feel about becoming more like them? (Shoutout to Jake McKinnon for discussing this with me recently).
Location—Do something that allows you to work in a location that satisfies you professionally and emotionally. I think it’s easy to underestimate how much location can affect people (especially when location is tied so strongly to community/mentorship—e.g., EA hubs).
Personally, I see this as a misunderstanding, i.e. that OP helped OpenAI to come into existence and it might not have happened otherwise.
I think some people have this misunderstanding, and I think it’s useful to address it.
With that in mind, much of the time, I don’t think people who are saying “do those benefits outweigh the potential harms” are assuming that the counterfactual was “no OpenAI.” I think they’re assuming the counterfactual is something like “OpenAI has less money, or has to take somewhat less favorable deals with investors, or has to do something that it thought would be less desirable than ‘selling’ a board seat to Open Phil.”
(I don’t consider myself to have strong takes on this debate, and I think there are lots of details I’m missing. I have spoken to some people who seem invested in this debate, though.)
My current ITT of a reasonable person who thinks the harms outweighed the benefits says something like this: “OP’s investment seems likely to have accelerated OpenAI’s progress and affected the overall rate of AI progress. If OP had not invested, OpenAI likely would have had to do something else that was worse for them (from a fundraising perspective) which could have slowed down OpenAI and thus slowed down overall AI progress.”
Perhaps this view is mistaken (e.g., maybe OpenAI would have just fundraised sooner and started the for-profit entity sooner). But (at first glance), giving up a board seat seems pretty costly, which makes me wonder why OpenAI would choose to give up the board seat if they had some less costly alternatives.
(I also find it plausible that the benefits outweighed the costs, though my ITT of a reasonable person on the other side says something like “what were the benefits? Are there any clear wins that are sharable?”)
+1 on questioning/interrogating opinions, even opinions of people who are “influential leaders.”
I claim people who are trying to use their careers in a valuable way should evaluate organizations/opportunities for themselves
My hope is that readers don’t come away with “here is the set of opinions I am supposed to believe” but rather “ah here is a set of opinions that help me understand how some EAs are thinking about the world.” Thank you for making this distinction explicit.
Disagree that these are mostly characterizing the Berkeley community (#1 and #2 seem the most Berkeley-specific, though I think they’re shaping EA culture/funding/strategy enough to be considered background claims. I think the rest are not Berkeley-specific).
I think it’s good for proponents of RSPs to be open about the sorts of topics I’ve written about above, so they don’t get confused with e.g. proposing RSPs as a superior alternative to regulation. This post attempts to do that on my part. And to be explicit: I think regulation will be necessary to contain AI risks (RSPs alone are not enough), and should almost certainly end up stricter than what companies impose on themselves.
Strong agree. I wish ARC and Anthropic had been more clear about this, and I would be less critical of their RSP posts if they were upfront and clear about this stance. I think your post is strong and clear (you state multiple times, unambiguously, that you think regulation is necessary and that you wish the world had more political will to regulate). I appreciate this, and I’m glad you wrote this post.
I think it’d be unfortunate to try to manage the above risk by resisting attempts to build consensus around conditional pauses, if one does in fact think conditional pauses are better than the status quo. Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.
A few thoughts:
One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going. It is nice that they said they would run some evals at least once every 4X in effective compute and that they don’t want to train catastrophe-capable models until their infosec makes it more expensive for actors to steal their models. It is nice that they said that once they get systems that are capable of producing biological weapons, they will at least write something up about what to do with AGI before they decide to just go ahead and scale to AGI. But I mostly look at the RSP and say “wow, these are some of the most bare minimum commitments I could’ve expected, and they don’t even really tell me what a pause would look like and how they would end it.”
Meanwhile, we have OpenAI (that plans to release an RSP at some point), DeepMind (rumor has it they’re working on one but also that it might be very hard to get Google to endorse one), and Meta (oof). So I guess I’m sort of left thinking something like “If Anthropic’s RSP is the best RSP we’re going to get, then yikes, this RSP plan is not doing so well.” Of course, this is just a first version, but the substance of the RSP and the way it was communicated about doesn’t inspire much hope in me that future versions will be better.
I think the RSP frame is wrong, and I don’t want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say “OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous or about to be dangerous, then will you at least consider stopping” It seems plausible to me that governments would be willing to start with something stricter and more sensible than this “just keep going until we can prove that the model has highly dangerous capabilities” regime.
I think some improvements on the status quo can be net negative because they either (a) cement in an incorrect frame or (b) take a limited window of political will/attention and steer it toward something weaker than what would’ve happened if people had pushed for something stronger. For example, I think the UK government is currently looking around for substantive stuff to show their constituents (and themselves) that they are doing something serious about AI. If companies give them a milktoast solution that allows them to say “look, we did the responsible thing!”, it seems quite plausible to me that we actually end up in a worse world than if the AIS community had rallied behind something stronger.
If everyone communicating about RSPs was clear that they don’t want it to be seen as sufficient, that would be great. In practice, that’s not what I see happening. Anthropic’s RSP largely seems devoted to signaling that Anthropic is great, safe, credible, and trustworthy. Paul’s recent post is nuanced, but I don’t think the “RSPs are not sufficient” frame was sufficiently emphasized (perhaps partly because he thinks RSPs could lead to a 10x reduction in risk, which seems crazy to me, and if he goes around saying that to policymakers, I expect them to hear something like “this is a good plan that would sufficiently reduce risks”). ARC’s post tries to sell RSPs as a pragmatic middle ground and IMO pretty clearly does not emphasize (or even mention?) some sort of “these are not sufficient” message. Finally, the name itself sounds like it came out of a propaganda department– “hey, governments, look, we can scale responsibly”.
At minimum, I hope that RSPs get renamed, and that those communicating about RSPs are more careful to avoid giving off the impression that RSPs are sufficient.
More ambitiously, I hope that folks working on RSPs seriously consider whether or not this is the best thing to be working on or advocating for. My impression is that this plan made more sense when it was less clear that the Overton Window was going to blow open, Bengio/Hinton would enter the fray, journalists and the public would be fairly sympathetic, Rishi Sunak would host an xrisk summit, Blumenthal would run hearings about xrisk, etc. I think everyone working on RSPs should spend at least a few hours taking seriously the possibility that the AIS community could be advocating for stronger policy proposals and getting out of the “we can’t do anything until we literally have proof that the model is imminently dangerous” frame. To be clear, I think some people who do this reflection will conclude that they ought to keep making marginal progress on RSPs. I would be surprised if the current allocation of community talent/resources was correct, though, and I think on the margin more people should be doing things like CAIP & Conjecture, and fewer people should be doing things like RSPs. (Note that CAIP & Conjecture both impt flaws/limitations– and I think this partly has to do with the fact that so much top community talent has been funneled into RSPs/labs relative to advocacy/outreach/outside game).
Great work, Ben! I appreciate the actionable suggestions & the structure of the post (i.e., summaries at the top and details in the main body). Excited to see the other posts in this series!
One suggestion: I think it would be helpful to distinguish between interventions that are helpful for people with poor sleep quality (e.g., people with insomnia) and those that are helpful for people with “average” sleep quality (e.g., people who don’t have any huge problems with their sleep quality but are trying to optimize their sleep quality).
In other words: let’s assume person A has diagnosable insomnia, and person B has “average” sleep quality but is trying to optimize (i.e., by going from 50th percentile sleep quality to 80th percentile). Would you suggest the same intervention for person A and person B?
My understanding is that many of the top recommendations are typically studied for insomnia, but there is much less research supporting their effectiveness for “people with ordinary sleep habits who are trying to optimize” (epistemic status pretty uncertain: I’m not a sleep researcher but have talked with a few about this topic).
A few questions:
In general, would you say the evidence for these interventions is strongest for people with insomnia/poor sleep quality?
Which intervention(s) would you recommend most strongly to someone with insomnia/poor sleep quality?
Which intervention(s) would you recommend most strongly to someone with average/good sleep quality?
Thank you for this write-up, Claire! I will put this in my “posts in which the author does a great job explaining their reasoning” folder.
I noticed that you focused on mistakes. I appreciate this, and I’m also curious about the opposite:
What are some of the things that went especially well over the last few years? What decisions, accomplishments, or projects are you most proud of?
If you look back in a year, and you feel really excited/proud of the work that your team has done, what are some things that come to mind? What would a 95th+ percentile outcome look like? (Maybe the answer is just “we did everything in the Looking Forward” section, but I’m curious if some other things come to mind).
Clarification: I think we’re bottlenecked by both, and I’d love to see the proposals become more concrete.
Nonetheless, I think proposals like “Get a federal agency to regulate frontier AI labs like the FDA/FAA” or even “push for an international treaty that regulates AI in a way that the IAEA regulates atomic energy” are “concrete enough” to start building political will behind them. Other (more specific) examples include export controls, compute monitoring, licensing for frontier AI models, and some others on Luke’s list.
I don’t think any of these are concrete enough for me to say “here’s exactly how the regulatory process should be operationalized”, and I’m glad we’re trying to get more people to concretize these.
At the same time, I expect that a lot of the concretization happens after you’ve developed political will. If the USG really wanted to figure out how to implement compute monitoring, I’m confident they’d be able to figure it out.
More broadly, my guess is that we might disagree on how concrete a proposal needs to be before you can actually muster political will behind it, though. Here’s a rough attempt at sketching out three possible “levels of concreteness”. (First attempt; feel free to point out flaws).
Level 1, No concreteness: You have a goal but no particular ideas for how to get there. (e.g., “we need to make sure we don’t build unaligned AGI”)
Level 2, Low concreteness: You have a goal with some vagueish ideas for how to get there (e.g., “we need to make sure we don’t build unaligned AGI, and this should involve evals/compute monitoring, or maybe a domestic ban on AGI projects and a single international project).
Level 3, Medium concreteness: You have a goal with high-level ideas for how to get there. (e.g., “We would like to see licensing requirements for models trained above a certain threshold. Still ironing out whether or not that threshold should be X FLOP, Y FLOP, or $Z, but we’ve got some initial research and some models for how this would work.)
Level 4, High concreteness: You have concrete proposals that can be debated. (e.g., We should require licenses for anything above X FLOP, and we have some drafts of the forms that labs would need to fill out.)
I get the sense that some people feel like we need to be at “medium concreteness” or “high concreteness” before we can start having conversations about implementation. I don’t think this is true.
Many laws, executive orders, and regulatory procedures have vague language (often at Level 2 or in-between Level 2 and Level 3). My (loosely-held, mostly based on talking to experts and reading things) sense quite common for regulators to be like “we’re going to establish regulations for X, and we’re not yet exactly sure what they look like. Part of this regulatory agency’s job is going to be to figure out exactly how to operationalize XYZ.”
I also think that recent events have been strong evidence in favor of my position: we got a huge amount of political will “for free” from AI capabilities advances, and the best we could do with it was to push a deeply flawed “let’s all just pause for 6 months” proposal.
I don’t think this is clear evidence in favor of the “we are more bottlenecked by concrete proposals” position. My current sense is that we were bottlenecked both by “not having concrete proposals” and by “not having relationships with relevant stakeholders.”
I also expect that the process of concretizing these proposals will likely involve a lot of back-and-forth with people (outside the EA/LW/AIS community) who have lots of experience crafting policy proposals. Part of the benefit of “building political will” is “finding people who have more experience turning ideas into concrete proposals.”
Thanks for sharing this, Aaron! Really interesting pilot work.
One quick thought—I wouldn’t rely too heavily on statistical significance tests, particularly with small sample sizes. P-values are largely a function of sample size, and it’s nearly impossible to get statistical significance with 44 participants (unless your effect size is huge!).
Speaking of effect sizes, it seems like you powered to detect an effect of d=0.7. For a messaging study with rather subtle manipulations, an effect of d=0.7 seems huge! I would be pretty impressed if giving people CE info resulted in an effect size of d=0.2 or d=0.3, for instance. I’m guessing you were constrained by the # of participants you could recruit (which is quite reasonable—lots of pilot studies are underpowered). But given the low power, I’d be reluctant to draw strong conclusions.
I also appreciate that you reported the mean scores in the results section of your paper, which allowed me to skim to see if there’s anything interesting. I think there might be!
There was no significant difference in Effective Donation between the Info (M = 80.21, SD = 18.79) and No Info (M = 71.79, SD = 17.05) conditions, F(1, 34) = 1.85, p = .183, ηp2 = .052.
If this effect is real, I think this is pretty impressive/interesting. On average, the Effective Donation scores are about 10% higher for the Info Group participants than the No Info group participants (and I didn’t do a formal calculation for Cohen’s d but it looks like it’d be about d=0.5).
Of course, given the small sample size, it’s hard to draw any definitive conclusions. But it seems quite plausible to me that the Info condition worked—and at the very least, I don’t think these findings provide evidence against the idea that the info condition worked.
Would be curious to see if you have any thoughts on this. If you end up having an opportunity to test this with a larger sample size, that would be super interesting. Great work & excited to see what you do next!
Note: You don’t have to answer to follow this structure or answer these questions. The point is just to share information that might be helpful/informative to other EAs!
With that in mind, here are my answers:
Where do you work, and what do you do?
I am a PhD student studying psychology at the University of Pennsylvania.
What are things you’ve worked on that you consider impactful?
I’m trying to focus my research on topics that are impactful and neglected (e.g., digital mental health, global mental health).
I co-developed a mental health intervention for Kenyan adolescents and tested it in a randomized controlled trial.
I’ve published papers reviewing smartphone apps for depression and anxiety (here and here) and developed a new method for analyzing digital health interventions (here).
I developed an online mental health intervention designed to teach skills from CBT and positive psychology in <1 hour. We’re currently evaluating it in Kenya, India, and the US.
I recently started performing research on promoting effective giving. I’ve received funding from the EA Meta Fund and from UPenn to support this work. Through the project, we’re aiming to evaluate an intervention that applies psychological theories to improve effective giving. We’ll also be spreading information about EA to 1k+ people, and much of the funding from the project will be donated to effective charities.
What are a few ways in which you bring EA ideas/mindsets to your current job?
I work with many undergraduate students. I try to introduce them to EA concepts (e.g., thinking about importance, neglectedness, and solvability when considering projects) and refer them to EA sources (e.g., 80,000 Hours).
Several of these students have changed their independent study projects as a result of learning about EA (mostly to work on the effective giving project mentioned earlier).
I’ve casually mentioned effective altruism to graduate students professors I work with, many of whom weren’t familiar with EA previously. (Bringing this up “casually” has become easier to do now that I’m doing research relating to effective giving).
I’ve been connecting with members of the EA community who are doing similar work, like members of Spark Wave and the Happier Lives Institute.
I think it’s great that you’re releasing some posts that criticize/red-team some major AIS orgs. It’s sad (though understandable) that you felt like you had to do this anonymously.
I’m going to comment a bit on the Work Culture Issues section. I’ve spoken to some people who work at Redwood, have worked at Redwood, or considered working at Redwood.
I think my main comment is something like you’ve done a good job pointing at some problems, but I think it’s pretty hard to figure out what should be done about these problems. To be clear, I think the post may be useful to Redwood (or the broader community) even if you only “point at problems”, and I don’t think people should withhold these write-ups unless they’ve solved all the problems.
But in an effort to figure out how to make these critiques more valuable moving forward, here are some thoughts:
If I were at Redwood, I would probably have a reaction along the lines of “OK, you pointed out a list of problems. Great. We already knew about most of these. What you’re not seeing is that there are also 100 other problems that we are dealing with: lack of management experience, unclear models of what research we want to do, an ever-evolving AI progress landscape, complicated relationships we need to maintain, interpersonal problems, a bunch of random ops things, etc. This presents a tough bind: on one hand, we see some problems, and we want to fix them. On the other hand, we don’t know any easy ways to fix them that don’t trade-off against other extremely important priorities.”
As an example, take the “intense work culture” point. The most intuitive reaction is “make the work culture less intense—have people work fewer hours.” But this plausibly has trade-offs with things like research output. You could make the claim that “on the margin, if Redwood employees worked 10 fewer hours per week, we expect Redwood would be more productive in the long-run because of reduced burnout and a better culture”, but this is a substantially different (and more complicated) claim to make. And it’s not obviously-true.
As another example, take the “people feel pressure to defer” point. I personally agree that this is a big problem for Redwood/Constellation/the Bay Area scene. My guess is Buck/Nate/Bill agree. It’s possible that they don’t think it’s a huge deal relative to the other 100 things on their plate. And maybe they’re wrong about that, but I think that needs to be argued for if you want them to prioritize it. Alternatively, the problem might be that they simply don’t know what to do. Like, maybe they could put up a sign that says “please don’t defer—speak your mind!” Or maybe they could say “thank you” more when people disagree, or something. But I think often the problem is that people don’t know what interventions would be able to fix well-known problems (again, without trading off against something else that is valuable).
I’m also guessing that there are some low-hanging fruit interventions that external red-teamers could identify. For example, here are three things that I think Redwood should do:
Hire a full-time productivity coach/therapist for the Constellation offices. (I recommended this to Nate many months ago. He seemed to (correctly, imo) predict that burnout would be a big problem for Redwood employees, and he said he’d think about the therapist/coach suggestion. I believe they haven’t hired one.)
Hire an external red-teamer to interview current and former employees, identify work culture issues, and identify interventions to improve things. Conditional on this person/team identifying useful (and feasible) interventions, work with leadership to actually get them implemented. (I’m not sure if they’re doing this, and also maybe your group is already doing this, but the post focused on problems rather than interventions?)
Have someone red-team communications around employee expectations, work-trial expectations, and expectation-setting during the onboarding process. I think I’m fine with some people opting-in to a culture that expects them to work X hours a week and has Y intensity aspects. I’m less fine with people feeling misled or people feeling unable to communicate about their needs. It seems plausible to me that many of the instances of “Person gets fired or quits and then feels negatively toward Redwood & encourages people not to work there” (which happens, btw) could be avoided/lessened through really good communication/onboarding/expectation-setting. (I have no idea what Redwood’s current procedure is like, but I’d predict that a sharp red-teamer would be able to find 3+ improvements).
These are three examples of interventions that seem valuable and (relatively) low-cost to me. I’d be excited to see if your team came up with any intervention ideas, and I’d be excited to see a “proposed intervention” section in future reports. (Though again, I don’t think you should feel like you need to do this, and I think it’s good to get things out there even if they’re just raising awareness about problems).
One thing I appreciate about both of these tests is that they seem to (at least partially) tap into something like “can you think for yourself & reason about problems in a critical way?” I think this is one of the most important skills to train, particularly in policy, where it’s very easy to get carried away with narratives that seem popular or trendy or high-status.
I think the current zeitgeist has gotten a lot of folks interested in AI policy. My sense is that there’s a lot of potential for good here, but there are also some pretty easy ways for things to go wrong.
Examples of some questions that I hear folks often ask/say:
What do the experts think about X?
How do I get a job at X org?
“I think the work of X is great”--> “What about their work do you like?” --> “Oh, idk, just like in general they seem to be doing great things and lots of others seem to support X.”
What would ARC evals think about this plan?
Examples of some questions that I often encourage people to ask/say:
What do you think about X?
What do you think X is getting wrong?
If the community is wrong about X, what do you think it’s getting wrong? Do you think we could be doing better than X?
What do I think about this plan?
So far, my experience engaging with AI governance/policy folks is that these questions are not being asked very often. It feels more like a field where people are respected for “looking legitimate” as opposed to “having takes”. Obviously, there are exceptions, and there are a few people whose work I admire & appreciate.
But I think a lot of junior people (and some senior people) are pretty comfortable with taking positions like “I’m just going to defer to people who other people think are smart/legitimate, without really asking myself or others to explain why they think those people are smart/legitimate”, and this is very concerning.
As a caveat, it is of course important to have people who can play support roles and move things forward, and there’s a failure mode of spending too much time in “inside view” mode. My thesis here is simply that, on the current margin, I think the world would be better off if more people shifted toward “my job is to understand what is right and evaluate plans/people for myself” and fewer people adopted the “my job is to find a credible EA leader and row in the direction that they’re currently rowing.”
And as a final point, I think this is especially important in a context where there is a major resource/power/status imbalance between various perspectives. In the absence of critical thinking & strong epistemics, we should not be surprised if the people with the most money & influence end up shaping the narrative. (This model necessarily mean that they’re wrong, but it does tell us something like “you might expect to see a lot of EAs rally around narratives that are sympathetic toward major AGI labs, even if these narratives are wrong. And it would take a particularly strong epistemic environment to converge to the truth when one “side” has billions of dollars and is offering a bunch of the jobs and is generally considered cooler/higher-status.”
Update: The deadline has been changed to April 30. Several people pointed out that the deadline felt tight & it would limit their ability to participate.
To encourage early submissions, we are offering three “early-bird prizes” ($1000 each) to the three best submissions we receive by March 31.
Special thanks to Vaidehi, Kaleem, and those of you who emailed me with feedback about the deadline.
“Service/support mindset” reminds me of healers in role-playing games. You don’t show up on the damage charts, but you kept everyone alive (and allowed them to optimize their builds for damage)!