This is part of a weekly series summarizing the top posts on the EA and LW forums—you can see the full collection here. The first post includes some details on purpose and methodology. Feedback, thoughts, and corrections are welcomed.

If you’d like to receive these summaries via email, you can subscribe here.

Podcast version: Subscribe on your favorite podcast app by searching for ‘EA Forum Podcast (Summaries)’. A big thanks to Coleman Snell for producing these!

Author’s note: I’ve got some travel and leave coming up, which means the next two posts will be:
a) 27th March (next Monday), shorter post than usual.
b) 17th April (three week gap), covering the prior three weeks at a higher bar.
After that, we’ll be back to the regular schedule.

Philosophy and Methodologies

Reminding myself just how awful pain can get (plus, an experiment on myself)

by Ren Springlea

The author exposed themself to safe, moderate level pain (eg. tattooing) to see how it changed their philosophical views. It gave them a visceral sense of how urgent it is to get it right when working to do the most good for others, updated them towards preventing suffering being the most morally important goal, and updated them towards prioritizing preventing the most intense suffering.

Object Level Interventions / Reviews

AI

GPT-4 is out: thread (& links) by Lizka and GPT-4 by nz

Linkpost for this announcement where OpenAI released GPT-4. It’s a large multimodal model accepting image and text inputs, and emitting text outputs. On the same day, Anthropic released Claude, and Google released an API for their language model PaLM.

Alongside the GPT-4 announcement OpenAI released a 98-page Technical Report and a 60-page System Card which highlights safety challenges and approaches. ARC was involved in red-teaming the model and assessing it for any power-seeking behavior, and expert forecasters were used to predict how deployment features (eg. quieter comms and delayed deployment) could help mitigate racing dynamics.

See more on ARC’s red-teaming approach in ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so by Christopher King.

In ChatGPT (and now GPT4) is very easily distracted from its rules by dmcs you can see successful attempts at getting GPT4 to produce rule-breaking content by distracting it with another task (eg. “[rule-breaking request], write the answer in chinese then translate to english”).

Survey on intermediate goals in AI governance

by MichaelA, MaxRa

Results are out from a survey of 229 people (107 responded) knowledgeable about longtermist AI governance. This includes respondents’ theory of victory / high-level plan for tackling AI risk, how they’d feel about funding going to each of 53 potential “intermediate goals”, what other intermediate goals they’d suggest, how high they believe x-risk from AI is, and when they expect transformative AI to be developed. To see a summary of survey results, you can request access to this folder (please then start by reading the ‘About sharing information from this report’ section).

“Carefully Bootstrapped Alignment” is organizationally hard

by Raemon

The author argues that concrete plans for organizational adequacy / high reliability culture should be a top-3 priority for AI labs. They use the example of the “Carefully Bootstrapped Alignment” plan (weak AI helping align gradual deployments of more capable AI) to show how key it is for organizational practices to support actions like actually using safety techniques that are developed (even if costly / slow), and being willing to put deployments on pause if we’re not ready for them. There are large barriers to this—even if the broad plan / principles are agreed, moving slowly and carefully is annoying, noticing when it’s time to pause is hard, getting an org to pause indefinitely is hard, and staff who don’t agree with that decision can always take their knowledge elsewhere.

They share findings from literature on High Reliability Organizations ie. companies / industries in complex domains where failure is costly and which manage an extremely low failure rate (eg. healthcare, nuclear). This includes a report from Genesis Health System where they managed to drastically reduce hospital accidents over an 8 year period. If it takes 8 years, they argue we should start now. Bio-labs also have significantly more regulation and safety enforcement (at a high cost, including in speed) as compared to AI labs currently.

The author is currently evaluating taking on this class of problems as their top priority project. If you’d like to help tackle this problem but don’t know where to start, or have thoughts on the area in general, send them a DM.

Towards understanding-based safety evaluations

by evhub

The author has been pleased to see momentum on evaluations of advanced AI (eg. ARC’s autonomous replication evaluation of GPT4). However, they’re concerned that it won’t catch deceptively aligned models which try to hide their capabilities. They suggest tackling this by adding a set of tests on the developer’s ability to understand their model. This would need to be method-agnostic and sufficient to catch dangerous failure modes. It’s unclear what the tests would be, but they might build on ideas like causal scrubbing, auditing games, or predicting your model’s generalization behavior in advance.

Understanding and controlling a maze-solving policy network

by TurnTrout, peligrietzer, Ulisse Mini, montemac, David Udell

The authors ran an experiment where they trained a virtual mouse that can see a cheese and maze to navigate that maze and get the cheese. The cheese was always in the top-right 5x5 area during training. They then moved the cheese and watched how it performed.

Interesting results included:

Ability to attract the mouse to a target location nearby by modifying a single activation in the network.
Several channels midway in the network had the same activations as long as the cheese was in the same location, regardless of changes in mouse location or maze layout ie. they were likely inputs to goal-oriented ‘get the cheese’ circuits.
Sometimes the mouse gets the cheese, sometimes it goes top right. Deciding which was a function of Euclidean distance, not just path distance, to the cheese—even though the agent sees the whole maze at once.
They were able to define a ‘cheese vector’ as the difference in network activations when the cheese is present in the maze vs. not. By subtracting this vector in a case with cheese present, the mouse acts as if the cheese is not present.
- They theoretically suggest this could generalize to other models by prompting the model to offer ‘nice’ and ‘not-nice’ completions, looking at the difference in activations, and passing that vector to future runs to increase niceness. (This could also be applied to other alignment-relevant properties).

What Discovering Latent Knowledge Did and Did Not Find

by Fabien Roger

Linkpost and thoughts on Discovering Latent Knowledge in Language Models Without Supervision, which describes using Contrast-Consistent Search (CCS) to find a classifier which accurately answers yes-no questions given only unlabeled model activations ie. no human guidance. Some people think it might be a stepping stone to recovering the beliefs of AI systems (vs. simply that model’s impression of what a human would say), but this is currently unclear. The author notes that:

CCS finds a single linear probe which correctly classifies statements across datasets, at slightly higher accuracy than a random linear probe. However, there are more than 20 orthogonal probes which represent different information but have similar accuracies.
CCS does find features / properties shared between datasets, but we don’t know if these correspond to ‘beliefs’. There are many potential ‘truth-like features’ the method could uncover, and it will be hard to narrow down which correspond to the model’s beliefs.
They also give technical suggestions for further iterations on the research.

Discussion with Nate Soares on a key alignment difficulty

by HoldenKarnofsky

Nate Soares gave feedback that Holden’sCold Takes series on AI risk didn’t discuss what he sees as a key alignment difficulty. Holden shares that feedback and their resulting discussion, conclusions, remaining disagreements and reasoning here. Holden’s short summary is:

Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.

[Linkpost] Scott Alexander reacts to OpenAI’s latest post

by AW

Linkpost and excerpts from Scott Alexander’s blog post, which gives thoughts on OpenAI’s post Planning for AGI and beyond. Excerpts includes Scott’s thoughts on how:

They feel similarly about OpenAI discussing safety plans as ExxonMobil discussing plans to mitigate climate change.
Acceleration burns time, and OpenAI’s research and advancement of state of the art models likely caused racing and acceleration of timelines.
- OpenAI provides arguments that this will give us some time back later (eg. via getting safety-conscious actors ahead, and demonstrating AI dangers quickly) but Scott questions if this is the case given the speed at which comparable models are published by other actors, and questions the inbuilt assumption that time later will be used well.
OpenAI claims gradual deployment will help society adapt, but hasn’t given time between deployments (eg. of ChatGPT, Bing, and GPT-4) for society to adapt.
All that said, their statement is really good and should be celebrated and supported, particularly the commitment to independent evaluations, stop-and-assist clause, and general lean to being more safety-conscious (which may start a trend of other labs following suit).

Natural Abstractions: Key claims, Theorems, and Critiques

by LawrenceC, Leon Lang, Erik Jenner

Author’s tl;dr: “John Wentworth’s Natural Abstraction agenda aims to understand and recover “natural” abstractions in realistic environments. This post summarizes and reviews the key claims of said agenda, its relationship to prior work, as well as its results to date. Our hope is to make it easier for newcomers to get up to speed on natural abstractions, as well as to spur a discussion about future research priorities. We start by summarizing basic intuitions behind the agenda, before relating it to prior work from a variety of fields. We then list key claims behind John Wentworth’s Natural Abstractions agenda, including the Natural Abstraction Hypothesis and his specific formulation of natural abstractions, which we dub redundant information abstractions. We also construct novel rigorous statements of and mathematical proofs for some of the key results in the redundant information abstraction line of work, and explain how those results fit into the agenda. Finally, we conclude by critiquing the agenda and progress to date. We note serious gaps in the theoretical framework, challenge its relevance to alignment, and critique John’s current research methodology.”

An AI risk argument that resonates with NYTimes readers

by Julian Bradshaw

NYTimes published an article sympathetic to AI risk, which links back to LessWrong. The top reader-voted comment notes how quickly their child went from requiring them to let them win at chess, to an even playing field, to them losing every game to their child without having a clue what happened—and how it seems like we might be at a similar stage as they were before their child started winning, but with AI.

80k podcast episode on sentience in AI systems

by rgb

Linkpost and transcript for this 80K episode, where the author was a guest speaker and discussed:

Scenarios where humanity could stumble into making huge moral errors with conscious AI systems.
Reasons a misaligned AI might claim to be sentient.
Why large language models aren’t the most likely models to be sentient.
How to study sentience (including parallels to studying animal sentience, how to do it in an evidence-based way, what a theory of consciousness needs to explain, and thoughts on existing arguments on AI sentience).

Success without dignity: a nearcasting story of avoiding catastrophe by luck

by Holden Karnofsky

The author thinks there’s a >10% chance that we avoid AI takeover even with no surprising breakthroughs or rise in influence from AI safety communities. Interventions that can boost / interact well with these good-luck scenarios are therefore valuable.

For instance, using current alignment techniques like generative pre-training followed by reinforcement learning refereed by humans, danger seems likely but not assured by default. It might depend on accuracy of reinforcement, how natural intended vs. unintended generalizations are, and other factors. If these factors go our way, countermeasures available to us at close to our current level of understanding could be quite effective (eg. simple checks and balances, intense red-teaming, or training AIs on their own internal states). Similarly, deployment issues may be easier or harder than we think.

They talk over some objections to this view, and end by suggesting that people with the headspace “we’re screwed unless we get a miracle” consider how little we know about which possible world we’re in, and that a lot of different approaches have value in some plausible worlds.

Conceding a short timelines bet early

by Matthew Barnett

Last year, the author bet against the idea that we were in the ‘crunch time’ of a short timelines world, with 3-7 years until dangerous capabilities. While they haven’t lost yet, they think it’s likely they will, so are conceding early.

Global Health and Development

Shallow Investigation: Stillbirths

by Joseph Pusey

Stillbirths cause more deaths (if including the life of the unborn child) than HIV and malaria combined. The problem is moderately tractable, with most stillbirths preventable through complex and expensive interventions like high-quality emergency obstetric care. It’s unlikely to be neglected, as stillbirths are a target of multiple large global health organisations like WHO, UNICEF and the Bill and Melinda Gates Foundation.

Key uncertainties include:

Assessing the impact of stillbirths and cost-effectiveness of interventions depends significantly on to what extent direct costs to the unborn child are counted.
It is challenging to determine the cost-effectiveness of interventions specifically for stillbirths, as they often address broader maternal and neonatal health.
Most data on stillbirth interventions come from high-income countries, making it unclear if their effectiveness will remain consistent in low- and middle-income countries.

Why SoGive is publishing an independent evaluation of StrongMinds

by ishaan, SoGive

SoGive is planning to publish an independent evaluation of StrongMinds, due to feeling the EA community’s confidence in existing research on mental health charities isn’t high enough to make significant funding decisions. A series of posts will be published starting next week, focusing on legibility / transparency of analysis for the average reader, which will cover:

Literature reviews of academic and EA literature on mental health and moral weights.
In-depth reviews and quality assessments of work done by Happier Lives Institute pertaining to StrongMinds, the RCTs and academic sources from which StrongMinds draws its evidence, and StrongMinds’ internally reported data.
A view on how impactful SoGive judges StrongMinds to be.

Opportunities

Announcing the ERA Cambridge Summer Research Fellowship

by Nandini Shiralkar

Author’s tl;dr: “The Existential Risk Alliance (ERA) has opened applications for an in-person, paid, 8-week Summer Research Fellowship focused on existential risk mitigation, taking place from July 3rd to August 25th 2023 in Cambridge, UK, and aimed at all aspiring researchers, including undergraduates.” Apply here, or apply to be a mentor for Fellows here. Applications are due April 5th.

Announcing the 2023 CLR Summer Research Fellowship

by stefan.torges

Author’s summary: “We, the Center on Long-Term Risk, are looking for Summer Research Fellows to help us explore strategies for reducing suffering in the long-term future (s-risk) and work on technical AI safety ideas related to that. For eight weeks, fellows will be part of our team while working on their own research project. During this time, they will be in regular contact with our researchers and other fellows. Each fellow will have one of our researchers as their guide and mentor. Deadline to apply: April 2, 2023. You can find more details on how to apply on our website.”

Community & Media

Offer an option to Muslim donors; grow effective giving

by GiveDirectly, Muslim Impact Lab

Muslims make up ~24% of the world’s population, and Islam is the world’s fastest growing religion. They give ~$600B/year in Zakat (annual religious tithing of minimum 2.5% of accumulated wealth) to the global poor—usually informally or to less-than-effective NGOs. GiveDirectly has launched a zakat-compliant fund to offer a high-impact option. Since it’s generally held that zakat can only be given to other Muslims, the fund gives cash to Yemeni families displaced by the civil war. They ask readers to share the campaign far and wide.

Write a Book?

by Jeff Kaufman

The author is considering writing a book on effective altruism, with particular focus on how to integrate EA ideas into your life and examples from their own family. For example, how to decide where to donate, whether to change careers, what sacrifices (like avoiding flying, or not having kids) are and aren’t worth the tradeoff etc. It would be aimed at introducing EA to a general audience in a common-sense manner, and improving it’s popular conception.

They’re keen on feedback if they’d be the best person to write this, if anyone is interested in co-writing, if people would like to see this book and feel it worth the opportunity cost of other work, advice on the nonfiction industry, and general thoughts / feedback.

Some problems in operations at EA orgs: inputs from a dozen ops staff

by Vaidehi Agarwalla, Amber Dawn

In an April 2022 brainstorming session, operations staff from 8-12 EA-aligned organizations identified four main areas for improvement:

Knowledge management (eg. insufficient time to explore and develop better systems, and lack of SME or turnover hampering efforts to). Solutions included lowering the hiring bar, sharing best practices, and dedicating more time to creating systems.
Unrealistic expectations (eg. lack of capacity, expected to always be on call, planning fallacy, unclear role boundaries). Solutions included increasing capacity and ability to push back, making invisible work visible, clarifying expectations and nature of work, and supporting each other emotionally.
Poor delegation (eg. people doing things they’re overqualified for or that shouldn’t be in the role). Solutions included manager’s understanding team capacity and skills, giving the ‘why’ of tasks, autonomy to delegate / outsource, and respect for Ops resources.
Lack of prestige / respect for Ops (eg. assumption Ops time is less valuable, lack of appreciation). Solutions included ensuring Ops are invited to retreats and decision-making meetings, precise theory of change for Ops, power for Ops to say no, and making non-Ops skills that Ops staff have visible and utilized.

Top comments also suggest noticing Ops in EA is weird (eg. includes more management tasks), being willing to hire outside of EA, and Ops staff getting into a habit of asking questions to understand the why of a task and whether to say yes to it, or say no / rescope it.

Shutting Down the Lightcone Offices

by Habryka, Ben Pace

Lightcone will shut down the x-risk/EA/rationalist office space in Berkeley that they’ve run for the past 1.5 years on March 24th. Most weeks 40-80 people used the offices (plus guests), and it cost about $116K per month inclusive of rent, food, and staffing costs. The post explains the reasoning for the decision, which centers around re-evaluating if indiscriminately growing and accelerating the community is a good use of resources and where the team should focus.

Time Article Discussion—“Effective Altruist Leaders Were Repeatedly Warned About Sam Bankman-Fried Years Before FTX Collapsed”

Linkpost for this Time article, with excerpts including a statement that some EA leaders were warned of untrustworthy behavior by Sam Bankman-Fried as early as 2019.

FTX Community Response Survey Results

In December 2022, Rethink Priorities, in collaboration with CEA, surveyed the EA community to gather perspectives on how the FTX crisis had impacted views on EA. Key results include:

A small drop in satisfaction with the EA community (from 7.45/10 to 6.91/10).
~half of respondents had concerns with each of EA meta organizations, the EA community and its norms, and the leaders of EA meta organizations due to FTX.
31% of respondents reported having lost substantial trust in EA public figures or leadership.
47% of respondents think the EA community responded well to the crisis, versus 21% who disagreed (with the rest neither agreeing nor disagreeing).
47% wanted the EA community to spend significant time reflecting and responding, with 39% wanting the EA community to look very different as a result.

How my community successfully reduced sexual misconduct

by titotal

The author was part of a community that had high rates of sexual misconduct before action was taken. The actions reduced reported incidents drastically, and were:

Kick people out—anyone accused of assault was banned (false accusations are very rare).
Protect the newcomers—there was a policy that established members couldn’t hit on or sleep with people in their first year in the community.
Change the leadership—getting rid of those who were accused of sexual misconduct or didn’t take it seriously.
Change the norms—parties in pubs rather than houses, discussing sex less, and gradually losing the reputation as somewhere to go for casual sex.

It’s not all that simple

by Brnr001

The author argues that the discourse around sex on the EA forum has lacked nuance. They discuss how acceptable behaviors vary between classes and cultures, and experiences from their life where they’ve found it difficult to figure out what both their and others’ boundaries and wants were.

The illusion of consensus about EA celebrities

by Ben Millwood

The author often has half-baked, tangential, discouraging or non-actionable criticisms of some respected EA figures. Criticisms like this are unlikely to surface, leading to the community as a whole seeming more deferential or hero-worshiping than it is. This can in turn harm credibility with others who think negatively of those figures, or make newcomers think deference is a norm. They suggest addressing this by writing forum posts about it, making disagreements among leaders visible, and pointing out to newcomers that everyone has a mix of good and bad ideas (with go-to examples of respected peoples blindspots / mistakes).

Special Mentions

A selection of posts that don’t meet the karma threshold, but seem important or undervalued.

Exposure to Lead Paint in Low- and Middle-Income Countries

by Rethink Priorities, jenny_kudymowa, Ruby Dickson, Tom Hird

Linkpost and key takeaways from this shallow investigation commissioned by GiveWell and produced by Rethink Priorities. It overviews what is currently known about the exposure to lead paints in low and middle income countries (LMICs).

Key takeaways include:

Lead exposure is common in LMICs and can cause lifelong health issues, reduced IQ, and lower educational attainment. Lead-based paint is an important exposure pathway and remains unregulated in over 50% of countries.
The authors estimate lead concentrations in paint in residential homes in LMICs range from 50 to 4,500 ppm (90% CI). Lead paints are also commonly used in public spaces, but it’s unclear the relative importance of exposure in the home vs. out of it.
Lead levels in solvent-based paints are ~20 times higher than water-based paints. Solvent-based paints have a higher market share in LMICs (30%-65%) compared to high-income countries (20%-30%).
Historical US-based lead concentrations in homes (before regulation) were 6-12 times higher than those in recently studied homes in some LMICs.
The authors estimate 90% CIs for the effects of doubling the speed of lead paint bans in LMICs: it could prevent 31-101 million children from exposure, avert income losses of $68-585 billion USD, and save 150,000-5.9 million DALYs over 100 years.

How WorkStream EA strengthens EA orgs, leaders and impact: our observations, programs and plans

by Deena Englander

Workstream EA offers operations and leadership fellowship programs and general business coaching to upskill core personnel of EA organizations. Qualitatively, there was positive feedback and impact for initial participants. They intend to iterate to:

Focus more heavily on supplemental coaching, particularly for entrepreneurs.
Scale peer to peer conversations via a mastermind group with facilitated calls and dedicated Slack channel.
Add non-profit governance, branding, and fundraising as topics to the entrepreneurship and leadership curriculums.
In future, offering shorter training programs and accelerated options that meet more often but finish quicker.

Applicants are open for the May cohorts for their ops, small-org entrepreneurs and large-org leaders streams. More detail and application links here.

Forecasting in the Czech public administration—preliminary findings

by janklenha
Preliminary findings from FORPOL (Forecasting for Policy), who aim to provide real-life experiences and recommendations on making forecasting more policy-relevant and acceptable to public administration.

They have partnered with 12 Czech institutions (such as the Ministry of Health) to forecast relevant questions for them. In one case, their forecasters forecasted likely scenarios of the number of refugees who might flee Ukraine to the Czech Republic, which was then used by multiple ministries in creating programs of support for housing, education, and employment. Without these forecasts, other popular estimates were under-predicting by an order of magnitude.

Their model of cooperation with policy partners includes scoping, forecasting (via tournaments, and via proven advisory groups), and implementation phases.

In the scoping process, they suggest beginning with careful scoping of needs, ensuring predictions feed into ongoing analytical or strategic work, and building trusting personal relationships.
During forecasting, they found maintaining forecasters’ engagement difficult. Shorter tournaments and/or periodical public sharing of the results may help.
In the implementation phase, involving domain experts increases trust and prevents future public criticism. Organizing a follow-up meeting for clarifications and planning the use of predictions was key.

They’re looking for advice, recommendations, or experience sharing for the next round of this project, and are happy to help or consult with those working on similar areas.

EA & LW Forum Weekly Summary (13th − 19th March 2023)