EA & LW Forums Weekly Summary (12th Dec − 18th Dec 22′)

Supported by Rethink Priorities

This is part of a weekly series summarizing the top posts on the EA and LW forums—you can see the full collection here. The first post includes some details on purpose and methodology. Feedback, thoughts, and corrections are welcomed.

If you’d like to receive these summaries via email, you can subscribe here.

Podcast version: prefer your summaries in podcast form? A big thanks to Coleman Snell for producing these! Subscribe on your favorite podcast app by searching for ‘EA Forum Podcast (Summaries)’. More detail here.

Author’s note: I’m heading on holidays, so this will be the last weekly summary until mid-January. Hope you all have a great end of year!

Top /​ Curated Readings

Designed for those without the time to read all the summaries. Everything here is also within the relevant sections later on so feel free to skip if you’re planning to read it all. These are picked by the summaries’ author and don’t reflect the forum ‘curated’ section.

Announcing WildAnimalSuffering.org, a new resource launched for the cause

by David van Beveren

Vegan Hacktivists released this website, which educates the viewer on issues surrounding Wild Animal Suffering, and gives resources for getting involved or learning more. Their focus was combining existing resources into something visually engaging and accessible, as an intro point for those interested in learning about it. Please feel free to share with your networks!

The winners of the Change Our Mind Contest—and some reflections

by GiveWell

First place winners of GiveWell’s contest for critiques of their cost-effectiveness analyses:

They assigned two first place winners due to the quality of submissions, in addition to 8 honorable mentions, and $500 prizes for all others of the 49 entries meeting contest criteria.

GiveWell think the contest was worth doing, providing both new ideas, and affecting their prioritization on issues they were aware of but hadn’t addressed. Currently they expect the contest entries to shift the allocation of resources between programs, but think it’s unlikely they’ll lead to adding or removing programs from their list of recommended charities. They’ve identified ~100 discrete suggestions from entries which they’re tracking and prioritizing now.

Revisiting algorithmic progress

by Tamay, Ege Erdil

Summary of the authors’ research paper on the effect of algorithmic process in image classification on ImageNet. They find that every 9 months (95% CI: 4 to 25 months), better algorithms contribute the equivalent of a doubling of computer budgets. Progress in image classification has been roughly ~45% scaling of compute, 45% better algorithms, and ~10% scaling of data. The better algorithms primarily act via using compute more effectively (as opposed to data augmenting).

EA Forum

Philosophy and Methodologies

Octopuses (Probably) Don’t Have Nine Minds

by Bob Fischer

Part of the Moral Weight Project Sequence.

Based on the split-brain condition in humans, some people have wondered whether some humans “house” multiple subjects.

There are superficial parallels between the split-brain condition and the apparent neurological structures of some animals, including octopuses and chickens. To assign a non-negligible credence to these animals housing multiple subjects in a way that matters morally, we’d need evidence that different parts of the animals have valenced conscious states (like pain). This is difficult to get for several reasons outlined in the post. The author therefore recommends not assuming multiple subjects in a single animal for the purposes of the Moral Weight Project.

Overall, the author places up to a 0.1 credence that there are multiple subjects in the split-brain case, but no higher than 0.025 for the 1+8 model of octopuses.

GiveWell’s Moral Weights Underweight the Value of Transfers to the Poor

by Trevor Woolley and Ethan Ligon

Givewell baselines their cost-effectiveness analyses on the value of doubling consumption. This assumes that the functional form of marginal utility over consumption is 1/​x (where x is real consumption). There is strong evidence this doesn’t match the preferences of the Kenyan beneficiaries of GiveDirectly, and therefore underweights the value of cash transfers to the very poor.

The authors suggest GiveWell was likely intending to value “halving marginal utility of expenditure”. They empirically estimate the marginal utility over consumption (λ) as revealed by Kenyan beneficiaries of GiveDirectly’s cash transfers program and conclude the value per dollar of cash transfers is 2.6 times GiveWell’s current number (from 0.0034 to 0.009).

The full paper can be read here.

Neuron Count-Based Measures May Currently Underweight Suffering in Farmed Fish

by MHR

Neuron counts have historically been used as a proxy for the moral weight of different animal species. While alternate systems have been proposed, they are often still an input.

The only publicly-available empirical reports of fish neuron counts sample exclusively from species of <1g bodyweight, while farmed fish are at least 1000x larger. Some sources apply these neuron counts to farmed fish without correction, which is likely to underweight them. Even where corrections are applied, there is uncertainty in how to extrapolate it.

Because of this, the author suggests animal welfare advocates be highly skeptical of current neuron-count based estimates of the moral weight of farmed fish, and consider funding studies to empirically measure neuron counts in these species.

Object Level Interventions /​ Reviews

Creating a database for base rates

by nikos

The author is creating a database to collect base rates for various categories of events eg. protests that have (or have not) led to regime change, developments of new antibiotics, elections with small margins of victory. You can suggest new base rate categories you’d like looked into here.

The main goal is to develop a better understanding of the merits and limitations of reference class forecasting, with a secondary goal of collecting information useful to forecasters and EA stakeholders. Anyone is free to use the data for their own research.

The next decades might be wild

by mariushobbhahn

The author imagines “what [they] would expect the world to look like if (median compute for transformative AI ~2036) were true”. They claim tech can be disruptive, and reach widespread adoption within a few decades of introduction (eg. phones, internet), with the rate of adoption accelerating. AI is getting useful in the real world (including in assisting human coders), transformers work astonishingly well in multiple domains, it seems like AI hype is not slowing down, and AI accomplishments have been unexpected in the past (eg. many were surprised by the first Chess AIs, or by GPT-2, GPT-3, or DALL-E). Based on these points the author writes predictions for each decade between now and 2050+, in the form of vignettes.

Radical tactics can increase support for more moderate groups

by James Ozden

Surveys were conducted on the same 1.4K people before and after a ‘Just Stop Oil’ campaign. The campaign was radical, with 92% of those surveyed aware of them after it. The survey asked about support for climate policies and identification with a more moderate climate organization (Friends of the Earth). Identification increased from 50.3% to 52.9%, showing a ‘radical flank effect’ - a benefit to the moderate organization from the more radical organization’s campaigning (p =0.007). However, it also showed increased polarization—those with low baseline identification with Friends of the Earth reduced their support for climate policies after the campaign (and vice versa).

Concrete actionable policies relevant to AI safety (written 2019)

by weeatquince

An unedited copy of the author’s 2019 notes on UK AI policy. They took best practices from nuclear safety policy and applied them to AI safety. They no longer agree with everything written. Key recommendations (excluding those marked as ‘now unsure of’) include:

  • Support more long-term thinking in policy /​ politics.

  • Improve the processes for identifying, mitigating, and planning for future risks.

  • Improve the ability of the government to draw on technical and scientific expertise.

  • Have civil servants research policy issues around ethics and technology and AI.

  • Set up a regulator in the form of a well-funded body of technical experts, to ensure safe and ethical behavior of the tech industry and government.

Opportunities

The winners of the Change Our Mind Contest—and some reflections

by GiveWell

First place winners of GiveWell’s contest for critiques of their cost-effectiveness analyses:

They assigned two first place winners due to the quality of submissions, in addition to 8 honorable mentions, and $500 prizes for all others of the 49 entries meeting contest criteria.

They think the contest was worth doing, providing both new ideas, and increasing their prioritization on issues they were aware of but hadn’t addressed. Currently they expect the contest entries to shift the allocation of resources between programs, but think it’s unlikely they’ll lead to adding or removing programs from their list of recommended charities. They identified ~100 discrete suggestions from entries which they’re tracking and prioritizing now.

Announcing the Forecasting Research Institute (we’re hiring)

by Tegan

The Forecasting Research Institute (FRI) is a new organization focused on advancing the science of forecasting for the public good. Their strategy is based around:

  1. Filling in gaps in the science of forecasting eg. how to handle low probability events or complex topics that can’t be captured in a single forecast.

  2. Adapting forecast methods to practical purposes eg. identifying where forecasting could be most useful, and increasing decision-relevance of questions.

Concrete upcoming projects include developing a forecasting proficiency test to quickly identify accurate forecasters, identifying leading indicators of increased risk from AI, and exploring ways to judge and incentivize answers to far-future questions.

They have open, fully remote positions for research analysts, data analysts, content editors and research assistants. Apply here.

Open Philanthropy is hiring for (lots of) operations roles!

by maura

Open Philanthropy is hiring for a Business Operations Lead, Business Operations Generalists, Finance Operations Assistant, Grants Associates, People Operations Generalist, Recruiter and Salesforce Administrator & Technical Project Manager. Most but not all roles are worldwide remote, if you can overlap with some US working hours. Applications and referrals are open now (there’s a $5K referral bonus).

CEEALAR: 2022 Update

by CEEALAR

The Centre for Enabling EA Learning & Research (CEEALAR) is an EA hotel that provides grants in the form of food and accommodation on-site in Blackpool, UK. They have lots of space and encourage applications from those wishing to learn or work on research or charitable projects in any cause area. This includes study and upskilling with the intent to move into those areas.

Since opening 4.5 years ago, they’ve supported ~100 EAs with their career development, and hosted another ~200 visitors for events /​ networking /​ community building. It costs CEEALAR ~£800/​month to host someone—including free food, logistics, and project guidance. This is ~13% the cost of an established EA worker, and an example of hits-based giving.

They have plans to expand, and are fixing up a next door property that will increase capacity by ~70%. They welcome donations, though aren’t in imminent need (they have 12 − 20 months of runway, depending on factors covered in the post). They’re also looking for a handy-person.

Applications open for AGI Safety Fundamentals: Alignment Course

by Jamie Bernardi, richard_ngo

Apply by 5th January to join the AGI Safety Fundamentals: Alignment Course. It will run Feb—Apr 2023, with 8 weeks of reading and virtual discussions, and a 4-week capstone. Commitment is ~4 hours per week.

Community & Media

What Rethink Priorities General Longtermism Team Did in 2022, and Updates in Light of the Current Situation

by Linch

The General Longtermism team at Rethink Priorities has existed for just under a year, with an average of ~5 FTE. Its theory of change was facilitating the creation of scalable longtermist megaprojects, and improving strategic clarity on intermediate goals longtermists should pursue.

Outputs included:

  • Supporting creation of the Special Projects team, which provides fiscal sponsorship to external entrepreneurial projects.

  • Cofounding and running Condor Camp, a project to engage world-class talent in Brazil for longtermist causes.

  • Cofounding and running Pathfinder, a project to help mid-career professionals find high impact work.

  • 13 shallow research dives into specific projects, with deeper dives on air sterilization techniques, whistleblowing, AI safety recruitment, and infrastructure for independent researchers.

  • Founder search for multiple promising projects.

  • A model for prioritizing between longtermist projects.

  • Research and database of resources on nanotech strategy.

The team is currently reorienting strategy for 2023. Recent changes to EA funding mean megaprojects seem less relevant (and some research questions more relevant), but it’s still plausible entrepreneurial longtermist projects might be a main research direction for the team.

Ideas for highly impactful research projects, donations, expressions of interest, and feedback on plans are all highly appreciated.

EA career guide for people from LMICs

by Surbhi B, Mo Putera, varun_agr, AmAristizabal

The authors broadly recommend the following for EAs from low and middle income countries (LMICs):

  • Build career capital early on

  • Work on global issues over local ones, unless clear reasons for the latter

  • Some individuals to do local versions of: community building, priorities research, charity-related activities, or career advising

They discuss pros, cons, and concrete next steps for each. Individuals can use the scale /​ neglectedness /​ tractability framework, marginal value, and personal fit to assess options. They suggest looking for local comparative advantage at global priorities, and taking the time to upskill and engage deeply with EA ideas before jumping into direct work.

Announcing WildAnimalSuffering.org, a new resource launched for the cause

by David van Beveren

Vegan Hacktivists released this website, which educates the viewer on issues surrounding Wild Animal Suffering, and gives resources for getting involved or learning more. Their focus was combining existing resources into something visually engaging and accessible, as an intro point for those interested in learning about it. Please feel free to share with your networks!

Announcing ERA: a spin-off from CERI

by Nandini Shiralkar

The CERI Fellowship has spun off from the Cambridge Existential Risks Initiative (CERI), and will be run by a new nonprofit called ERA from 2023. This allows CERI to re-focus on local community projects for the University of Cambridge, and reduces name confusion with the many EA projects /​ groups ending in ‘ERI’.

Applications for their July—August ERA Cambridge Fellowship (8-week paid programme focused on existential risk mitigation projects) will open in Jan /​ Feb—register your interest here to be notified when they do. They’re also looking for mentors, and expressions of interest for joining the team.

The Rules of Rescue—out now!

by Theron

The Rules of Rescue is a new book by the post author, which “defends a novel picture of the moral reasons and requirements to use time, money, and other resources to help others the most.” It’s open access and you can read the PDF for free here, visit the website, or buy an ebook or printed copy.

Reflections on the PIBBSS Fellowship 2022

by nora, particlemania

PIBBSS (Principles of Intelligent Behavior in Biological and Social Systems) facilitates research on parallels between intelligent behavior in natural and artificial systems, with the aim to use this towards building safe and aligned AI.

They ran a 3-month summer research fellowship with 20 scholars from varying fields—including 6 weeks reading, 2 research retreats, biweekly speakers, and individual research support. ~12 had a significant counterfactual move toward engaging in the AI safety field, 6-10 made interesting progress on promising research programs like intrinsic reward-shaping in brains, and 3-5 started long-term collaborations. They also developed a multi-disciplinary research network beyond just Fellows.

They think they’ll run it again, with more structured support, encouraging faster communicable outputs of research, weighting ML experience higher for those with prosaic projects, and being more careful about accepting fellows with conflicting incentives (eg. from academia).

I went to the Progress Summit. Here’s What I Learned.

by Nick Corvino

The Progress Summit is run by The Atlantic, to “highlight the most exciting ideas in science and technology” and “discuss how we can invent our way to a better world”. The author thinks the Progress Studies community is reasonably aligned with what EAs care about and could be a good alternative for those who find EA too intimidating, intense, or too longtermism-focused.

The author shares some reflections from attending, including:

  • It felt more professional than EA events (cocktails, food, outfits, smooth Ops).

  • Talks were fluffy, but speakers were eloquent and engaging. They were often ‘selling’ their products in their talks, to appeal to investors and venture capitalists.

  • Networking was less intense—more small talk.

  • The majority of attendees were bullish on tech progress and weren’t across x-risks like AGI or biorisk. Where risk was addressed, it was economic, climate change, or war.

Personal Finance for EAs

by NicoleJaneway

The author went to EAGxBerkeley, and found many young EAs don’t have a strong grasp of personal finance. They suggest EAs (especially student groups) could benefit from education here eg. how to use low-cost index funds for investing, or setting up rainy day funds. Because EAs have different needs to the general population (eg. they can take more risk with assets they plan to donate), they also suggest the next EAGx have a talk that covers smart ways to maximize giving strategies, geared towards the rules of the country hosting it.

EA Landscape in the UK

by DavidNash

The UK has three EA hubs—London, Oxford, and Cambridge. In addition there are many student groups, a city group in Bristol, and the EA hotel in Blackpool. The post details EA communities, organisations, and offices in each city.

We should say more than “x-risk is high”

by OllieBase

Some posts have argued that in order to persuade people to work on high priority issues like AI Safety and biosecurity, we only need to point to high x-risk this century, not to longtermism or broader EA principles. The author agrees this could convince people, but disagrees with that approach in general, because:

  1. Our situation could change (eg. x-risk lower than we thought)

  2. Our priorities could change (eg. the best interventions could be something indirect like ensuring global peace)

  3. It risks losing what makes EA distinctive, and being dismissed as alarmist—other movements also focus on x-risk arguments (eg. Extinction Rebellion).

Therefore, the author suggests outlining the case for longtermism and how it implies that x-risk should be a top priority even if x-risk is low, to make the community robust to these scenarios.

Kurzgesagt’s most recent video promoting the introducing of wild life to other planets is unethical and irresponsible

by David van Beveren

Kurzgesagt is an educational youtube channel that has ~20M subscribers. They’ve done several videos on EA and Longtermism related topics, and have funding from Open Philanthropy for this.

Their latest video, “How to Terraform Mars—WITH LASERS” promotes the idea of seeding wildlife on other planets. It doesn’t mention anything about the welfare of these animals, which could involve suffering from adapting to hostile and unfamiliar environments. The author argues not addressing this issue is a common problem in almost all major plans and discussions on terraforming or space colonization.

You Don’t Have to Call Yourself an Effective Altruist or Fraternize With Effective Altruists or Support Longtermism, Just Please, for the Love of God, Help the Global Poor

by Omnizoid

There are amazing opportunities to help the global poor (see GiveWell recommendations), some of whose incomes are ~1% of poor people in the USA. The author asks readers to please support this cause, even if they think badly of EA /​ don’t want to be part of the EA community.

The Effective Altruism movement is not above conflicts of interest

by sphor

Linkpost and excerpts from an EA criticism contest entry published by a pseudonymous author on 31st August 2022 (before the collapse of FTX).

The post notes that EA relying on ultra-wealthy individuals like Sam Bankman-Fried (SBF) incentivizes the community to accept political stances and moral judgments based on their alignment with the interests of its wealthy donors. They argue EA has failed to identify and publicize these conflicts of interest, and suggest that EA should do so, and then consider what systematic safeguards might be needed. So far EA has relied on the promotion of debates, which isn’t sufficient because individuals can’t consciously free themselves of bias.

As an example, they discuss how cryptocurrency is inherently political, how attacks on it affect EA’s reputation, and the risk to EA if SBF were involved in an ethical or legal scandal. Because of this, EA has an incentive to protect SBF’s reputation, think positively of cryptocurrency, and counter critics.

EA is probably undergoing “Evaporative Cooling” right now

by freedomandutility

When a group goes through a crisis (eg. the FTX collapse), those who hold the group’s beliefs least strongly leave, and those who hold the group’s beliefs most strongly stay. This might leave the remaining group less able to identify weaknesses within group beliefs or course-correct, or “steer”. The author suggests one way to combat this would be to move community building focus to producing moderately-engaged EAs instead of highly-engaged EAs.

Cryptocurrency is not all bad. We should stay away from it anyway.

by titotal

The author argues that “the crypto industry as a whole has significant problems with speculative bubbles, ponzis, scams, frauds, hacks, and general incompetence”, and that EA orgs should avoid being significantly associated with it until the industry becomes stable.

In the last year, at least 4 crypto firms collapsed, excluding FTX. Previous downturns have included the collapse of the largest-at-the-time crypto exchange mt gox. Crypto’s use is dominated by people using it to get rich—after 14 years, there are almost no widespread uses outside of this. This all means it’s a speculative bubble, and it will likely collapse again (maybe not in the same way). If we’re associated with it this could lead to a negative reputation that EA “keeps getting scammed”.

“I’m as approving of the EA community now as before the FTX collapse” (Duncan Sabien)

by Will Aldred

Given a community of thousands, the author expects some bad things to happen. FTX feels more like bad luck than “how dare we not have predicted this /​ why weren’t we robust to this”. We should reflect and potentially act, which they believe is happening, but the level of vigilance being proposed by some would paralyze the movement. They continue to support and endorse EA to the same level as before this black swan event.

I’m less approving of the EA community now than before the FTX collapse

by throwaway790

The author is specifically less approving of CEA /​ EVF, Will MacAskill, and donation practices during the “funding overhang”. They are more approving of Peter Wildeford, Rethink Priorities, Rob Wiblin, and Dustin Moskovitz. The reasons for disapproval include:

  • Statements during the events being minimizing, ambiguous, or missing

  • Not taking enough responsibility for involvement with FTX

  • Funding decisions like buying Wytham Abbey

They are still in favor of EA principles, and plan to donate to EA causes.

Reflections on Vox’s “How effective altruism let SBF happen”

by Richard Y Chappell

Richard believes the article correctly identifies that EA needs more respect for established procedures, and suggests a culture of consulting with senior advisors who understand how institutions work and why. He disagrees with the framing from Vox that “the problem is the dominance of philosophy”.

Sam Bankman-Fried has been arrested

by Markus Amalthea Magnuson

On 12th December, SBF was arrested in the Bahamas following receipt of formal notification from the United States that it has filed criminal charges against SBF and is likely to request his extradition.

The US Attorney for the SDNY asked that FTX money be returned to victims. What are the moral and legal consequences to EA? by Fermi–Dirac Distribution

In a December 13 press conference the United States Attorney for the Southern District of New York said: “To any person, entity or political campaign that has received stolen customer money, we ask that you work with us to return that money to the innocent victims.” This post is an open thread for discussion on this.

Didn’t Summarize

Hugh Thompson Jr (1943–2006) by Gavin

EA’s Achievements in 2022 by ElliotJDavies (open thread)

Today is Draft Amnesty Day (December 16-18) by Lizka

[Expired] $50 TisBest Charity Gift Card to the first 20,000 people who sign up by Michael Huang

LW Forum

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

by Collin

The author and collaborators published Discovering Latent Knowledge in Language Models Without Supervision. This post discusses how it fits into a broader alignment scheme.

Paper summary (summarized from this twitter thread): Existing language model training techniques have the issue that human data has human-like errors, and eg. a model trained to generate highly-rated text can output errors human evaluators don’t notice. Instead, the authors propose finding latent “truth-like” features without human supervision, by searching for implicit beliefs or knowledge learned by a model. They use CCS (Contrast-Consistent Search) to outperform model outputs on accuracy, even when model outputs are unreliable or uninformative.

Post summary (Author’s tl;dr): unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.

AI alignment is distinct from its near-term applications

by paulfchristiano

Existing AI systems are misaligned ie. they will often do things their designers don’t want like say offensive things. These systems are a good empirical testbed for alignment research. However, if companies train AIs to be very conservative and inoffensive, it risks backlash against and misunderstanding of what alignment is. The main purpose of alignment is to stop AI killing everyone, and it could be very bad if efforts to prevent this were undermined by a vague public conflation between AI alignment and corporate policies.

Revisiting algorithmic progress

by Tamay, Ege Erdil

Summary of the authors’ research paper on the effect of algorithmic process in image classification on ImageNet. They find that every 9 months (95% CI: 4 to 25 months), better algorithms contribute the equivalent of a doubling of computer budgets. Progress in image classification has been roughly ~45% scaling of compute, 45% better algorithms, and ~10% scaling of data. The better algorithms primarily act via using compute more effectively (as opposed to data augmenting).

[Interim research report] Taking features out of superposition with sparse autoencoders

by Lee Sharkey, Dan Braun, beren

Author’s TL;DR: Recent results from Anthropic suggest that neural networks represent features in superposition. This motivates the search for a method that can identify those features. Here, we construct a toy dataset of neural activations and see if we can recover the known ground truth features using sparse coding. We show that, contrary to some initial expectations, it turns out that an extremely simple method – training a single layer autoencoder to reconstruct neural activations with an L1 penalty on hidden activations – doesn’t just identify features that minimize the loss, but actually recovers the ground truth features that generated the data. We’re sharing these observations quickly so that others can begin to extract the features used by neural networks as early as possible. We also share some incomplete observations of what happens when we apply this method to a small language model and our reflections on further research directions.

Trying to disambiguate different questions about whether RLHF is “good”

by Buck

Conversations on whether Reinforcement Learning from Human Feedback (RLHF) is a promising alignment strategy, ‘won’t work’, or ‘is just capabilities research’ are muddled. The author distinguishes 11 related questions, and gives their opinions on them.

Overall they think RLHF by itself (with non-aided human overseers) is unlikely to be a promising alignment strategy, and there are failure modes like RLHF selecting for models that look aligned but aren’t. However, they think a broader version (eg. with AI-assisted humans) could be a part of an alignment strategy, and that researching alignment schemes involving RLHF could be one of the most promising research directions.

Okay, I feel it now

by g1

The author has been aware of AI x-risk arguments for a while, and often agreed with them, but in a detached way. Spending time observing ChatGPT has brought their gut feelings into line with their beliefs.

Can we efficiently explain model behaviors?

by paulfchristiano

Alignment Research Center’s (ARC’s) current plan for Eliciting Latent Knowledge (ELK) has 3 major challenges. The author describes why they expect significant progress on #1 and #3 over the next 6 months, and why that would be a big deal even if #2 turns out to be extremely challenging. The challenges are:

  1. Formalizing probabilistic heuristic argument as an operationalization of ‘explanation’

  2. Finding sufficiently specific explanations for important model behaviors

  3. Checking whether particular instances of a behavior are ‘because of’ a particular explanation

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

by LawrenceC

Linkpost for this paper.

Author’s summary: The authors propose a method for training a harmless AI assistant that can supervise other AIs, using only a list of rules (a “constitution”) as human oversight. The method involves two phases: first, the AI improves itself by generating and revising its own outputs; second, the AI learns from preference feedback, using a model that compares different outputs and rewards the better ones. The authors show that this method can produce a non-evasive AI that can explain why it rejects harmful queries, and that can reason in a transparent way, better than standard RLHF.

Didn’t Summarize

Consider using reversible automata for alignment research by Alex_Altair

[Interim research report] Taking features out of superposition with sparse autoencoders by Lee Sharkey, Dan Braun, beren

Crossposted to LessWrong (0 points, 0 comments)