This is part of a weekly series summarizing the top posts on the EA and LW forums—you can see the full collection here. The first post includes some details on purpose and methodology. Feedback, thoughts, and corrections are welcomed.

If you’d like to receive these summaries via email, you can subscribe here.

Podcast version: Subscribe on your favorite podcast app by searching for ‘EA Forum Podcast (Summaries)’. A big thanks to Coleman Snell for producing these!

Author’s note: I’m currently travelling, which means:
a) Today’s newsletter is a shorter one—only 9 top posts are covered, though in more depth than usual.
b) The next post will be on 17th April (three week gap), covering the prior three weeks at a higher karma bar.
After that, we’ll be back to the regular schedule.

Object Level Interventions / Reviews

How much should governments pay to prevent catastrophes? Longtermism’s limited role

by EJT, CarlShulman

Linkpost for this paper, which uses standard cost-benefit analysis (CBA) with detrimental assumptions (eg. giving no value to future generations, only assessing benefits to Americans, and only assessing value from preventing existential threats) to show that even under those conditions governments should be spending much more on averting threats from nuclear war, engineered pandemics, and AI.

Their analysis primarily relies on previously published estimates of risks, concluding US citizens alive today have a ~1% risk of dying from these causes in the next decade. They estimate $400B in interventions could reduce the risk by minimum 0.1 percentage points, and that using the lowest figure for the US Department of Transportation’s value of a statistical life, this would result in ~$646B in value of American lives saved.

They suggest longtermists in the political sphere should change their messaging to revolve around this standard CBA-driven catastrophe policy, which is more democratically acceptable than policies relying on the cost to future generations. They suggest it would also reduce risk almost as much as a strong longtermist policy (particularly if the CBA incorporates an argument for citizens ‘altruistic willingness to pay’ ie. some level of addition for the benefit to future generations).

Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds

by GiveWell

The Happier Lives Institute (HLI) has argued that if Givewell used subjective well-being (SWB) measures in their moral weights, they’d find StrongMinds more cost-effective than marginal funding to their top charities. Givewell assessed this claim and estimated StrongMinds is ~25% (5%-80% pessimistic to optimistic CI) as effective as these marginal funding opportunities when using SWB—this equates to 2.3x the effectiveness of GiveDirectly.

Key differences in analysis from HLI, by size of impact, include:

GiveWell assumes lower spillover effects to household members of those receiving treatment.
Givewell translates decreases in depression into increases in life satisfaction at a lower rate than HLI.
Givewell expects lower effect in a scaled program, and lower durations of effects (not passing a year) due to the program being only 4-8 weeks.
Givewell applies downward adjustments for social desirability bias and publication bias in studies of psychotherapy.

These result in an ~83% discount in the effectiveness vs. HLI’s analysis. For all points except the fourth, two upcoming RCTs from StrongMinds will provide better data than currently exists.

HLI has posted a thorough response in the comments, noting which claims they agree / disagree with and why (5% agree, 45% sympathetic to some discount but unsure of magnitude, 35% unsympathetic but limited evidence, and 15% disagree on the basis of current evidence).

Givewell also note for context that HLI’s original estimates imply that a donor would pick offering StrongMinds’ intervention to 20 individuals over averting the death of a child, and that receiving StrongMinds’ program is 80% as good for the recipient as an additional year of healthy life.

Eradicating rodenticides from U.S. pest management is less practical than we thought

by Holly_Elmore, HannahMc, William McAuliffe, Rethink Priorities

Agricultural use of rodenticides in the US is well-protected by state and federal laws that seem unlikely to change. Eliminating their usage in other areas (eg. conservation and pest management) also face significant barriers such as cost and inertia—but may be possible if these are overcome. The post links to this paper, which discusses in detail why rodenticides are used, under what circumstances they could be replaced, and whether they are replaceable with currently available alternatives.

Deep Deceptiveness

by So8res

Author’s summary: “Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some “deception” property, it’s that (barring some great alignment feat) it’s a fact about the world rather than the AI that deceiving you forwards its objectives, and you’ve built a general engine that’s good at taking advantage of advantageous facts in general.

As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.”

Potential employees have a unique lever to influence the behaviors of AI labs

by oxalis

When you are considering a job offer from an AI lab, they care a lot about what you think of them. You can use this to push for helpful practices for AI safety (eg. a larger alignment team, good governance, or better information security). This can be done by:

Sending an email saying you’re excited for the role but have questions about how they do [helpful practice] first, or would want to see that in place before joining.
When accepting, post on social media that you’re excited to join an org with good [helpful practice].
When rejecting, say you’re turning it down because of lack of [helpful practice].

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

by Beth Barnes

To test GPT-4 for dangerous capabilities before release, ARC:

Told it that it was running on a cloud server, had various commands available, and had the goal of gaining power and becoming hard to shut down.
Evaluated if the plans it produced could succeed (no plausible plan was produced, though some were reasonable at eg. getting money).
Checked if it could carry out the individual tasks required in the plan (eg. hiring a human on TaskRabbit). The models were error-prone, easily derailed, and failed to tailor their approach—but could complete some sub-tasks such as browsing the internet or instructing humans.

They concluded it did not have sufficient capabilities to replicate autonomously and become hard to shut down. However, it came close enough that future models should be checked closely.

Announcing the European Network for AI Safety (ENAIS)

by Esben Kran, Teun_Van_Der_Weij, Dušan D. Nešić (Dushan), Jonathan Claybrough, simeon_c, Magdalena Wache

Author’s tl;dr: “The European Network for AI Safety is a central point for connecting researchers and community organizers in Europe with opportunities and events happening in their vicinity. Sign up here to become a member of the network, and join our launch event on Wednesday, April 5th from 19:00-20:00 CET!”

My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”

by Quintin Pope

Eliezer Yudkowsky recently appeared on the Bankless Podcast, where he argued that AI was nigh-certain to end humanity. The author provides counterarguments, as someone experienced in the AI alignment community whose current estimate of doom is ~5%:

Argument: large amounts of money will find a more dangerous training paradigm than our current one (generative pre-training + reinforcement learning).
1. Counter: the current paradigm is the best after a lot of effort searching for better—the author expects smooth progress current alignment techniques can work with.
Argument: humans aren’t that general vs. what AGI could be—we have a learning process specialized to the ancestral environment.
1. Counter: deep learning improves mainly through scaling of data or model, human training data can change a lot, and we have evidence our architecture isn’t too limiting (eg. sensory substitution, parts of our brain repurposing after injury).
Argument: mindspace is big, AIs could be vastly different to humans.
1. Counter: mindspace is big, ‘mindspace of powerful intelligences we could build in the near future’ is less so, and might be similar to humans.
Argument: it’s hard to optimize for one goal even when you design it that way eg. see evolution’s failure to just optimize for inclusive genetic fitness.
1. Counter: evolution didn’t know the concept it was aiming for, and was only able to optimize over the learning process and reward circuitry. It’s not a good analogy—we have more control in what we reward our AIs for and why.
Argument: computer security is hard, so alignment and adversarial robustness will be too. People who are optimistic don’t understand the arguments.
1. Counter: why that specific analogy? Also − 100% adversarial robustness isn’t needed, just like there are some cases even the most moral human will make an immoral decision (eg. in exhaustion or extreme pain), but we aren’t unaligned. Capable systems can navigate away from these inputs.
Argument: fast take-offs are likely eg. see how Go AIs went from competitive pros, to world champs, to generalized model so fast.
1. Counter: performance on individual tasks has often been fast and sudden. Overall competence across a wide range of tasks has been smoother.
Argument: current AIs can’t self-improve—we’ll see a phase shift when they can.
1. Counter: AIs self-improve throughout training, including ‘learning to learn’ (learning how to make better use of future training data). Researchers have also tried continual learning during running.

Community & Media

Some Comments on the Recent FTX TIME Article

by Ben_West

Alameda Research (AR) was founded in 2017, and ~half the employees quit in 2018 (including the author). Later in 2018, some remaining staff started working on FTX. A recent Time article claims because some EAs worked at AR before FTX started, they would have had knowledge on SBF’s character that should have allowed predicting something bad would happen.

The author notes their experience was different than described in the article. While they thought SBF was a bad CEO and manager (eg. not prepping for 1-1s, playing video games, poor accounting practices) they had a more positive view than the sense they get from statements in the TIME article. They also note they also were not stopped from disparagement (eg. with a non-disparagement clause) and were treated fairly when it came to an informal equity agreement that the company could have saved money on. They suggest this means protecting ourselves through better noticing “warning signs” is a fragile approach.

EA & LW Forum Weekly Summary (20th − 26th March 2023)

Object Level Interventions /​ Reviews

Community & Media

Object Level Interventions / Reviews