I think AI Risk Intro 1: Advanced AI Might Be Very Bad is great.
mic
I agree, that seems concerning. Ultimately, since the AI developers are designing the AIs, I would guess that they would try to align the AI to be helpful to the users/consumers or to the concerns of the company/government, if they succeed at aligning the AI at all. As for your suggestions “Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI’s company’s legal team says would impose the highest litigation risk?” – these all seem plausible to me.
On the separate question of handling conflicting interests: there’s some work on this (e.g., “Aligning with Heterogeneous Preferences for Kidney Exchange” and “Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning”), though perhaps not as much as we would like.
But I sometimes have a fear in the back of my mind that some of the attendees who are intrigued by these ideas are later going to look up effective altruism, get the impression that the movement’s focus is just about existential risks these days, and feel duped. Since EA pitches don’t usually start with longtermist ideas, it can feel like a bait and switch.
To avoid the feeling of a bait and switch, I think one solution is to introduce existential risk in the initial pitch. For example, when introducing my student group Effective Altruism at Georgia Tech, I tend to say something like: “Effective Altruism at Georgia Tech is a student group which aims to empower students to pursue careers tackling the world’s most pressing problems, such as global poverty, animal welfare, or existential risk from climate change, future pandemics, or advanced AI.” It’s totally fine to mention existential risk – students still seem pretty interested and happy to sign up for our mailing list.
I think AI alignment isn’t really about designing AI to maximize for the preference satisfaction of a certain set of humans. I think an aligned AI would look more like an AI which:
is not trying to cause an existential catastrophe or take control of humanity
has had undesirable behavior trained out or adversarially filtered
learns from human feedback about what behavior is more or less preferable
In this case, we would hope the AI would be aligned to the people who are allowed to provide feedback
has goals which are corrigible
is honest, non-deceptive, and non-power-seeking
Thanks for writing this! There’s been a lot of interest of EA community building, but I think the one of the most valuable parts of EA community building is basically just recruiting – e.g., notifying interested people about relevant opportunities and inspiring people to apply for impactful opportunities. A lot of potential talent isn’t looped in with a local EA group or the EA community at all, however, so I think more professional recruiting could help a lot with solving organizational bottlenecks.
I was excited to read this post! At EA at Georgia Tech, some of our members are studying industrial engineering or operations research. Should we encourage them to reach out to you if they’re interested in getting involved with operations research for top causes?
What are some common answers you hear for Question #4: “What are the qualities you look for in promising AI safety researchers? (beyond general intelligence)”
Technical note: I think we need to be careful to note the difference in meaning between extinction and existential catastrophe. When Joseph Carlsmith talks about existential catastrophe, he doesn’t necessarily mean all humans dying; in this report, he’s mainly concerned about the disempowerment of humanity. Following Toby Ord in The Precipice, Carlsmith defines an existential catastrophe as “an event that drastically reduces the value of the trajectories along which human civilization could realistically develop”. It’s not straightforward to translate his estimates of existential risk to estimates of extinction risk.
Of course, you don’t need to rely on Joseph Carlsmith’s report to believe that there’s a ≥7.9% chance of human extinction conditioning on AGI.
- Jul 4, 2022, 4:01 PM; 27 points) 's comment on My Most Likely Reason to Die Young is AI X-Risk by (
Here’s my proposal for a contest description. Contest problems #1 and 2 are inspired by Richard Ngo’s Alignment research exercises.
AI alignment is the problem of ensuring that advanced AI systems take actions which are aligned with human values. As AI systems become more capable and approach or exceed human-level intelligence, it becomes harder to ensure that they remain within human control instead of posing unacceptable risks.
One solution to AI alignment proposed by Stuart Russell, a leading AI researcher, is the assistance game, also called a cooperative inverse reinforcement learning (CIRL) game, following these principles:
“The machine’s only objective is to maximize the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behavior.”
For a more formal specification of this proposal, please see Stuart Russell’s new book on why we need to replace the standard model of AI, Cooperatively Learning Human Values, and Cooperative Inverse Reinforcement Learning.
Contest problem #1: Why are assistance games not an adequate solution to AI alignment?
The first link describes a few critiques; you’re free to restate them in your own words and elaborate on them. However, we’d be most excited to see a detailed, original exposition of one or a few issues, which engages with the technical specification of an assistance game.
Another proposed solution to AI alignment is iterated distillation and amplification (IDA), proposed by Paul Christiano. Paul runs the Alignment Research Center and previously ran the language model alignment team at OpenAI. In IDA, a human H wants to train an AI agent, X by repeating two steps: amplification and distillation. In the amplification step, the human uses multiple copies of X to help it solve a problem. In the distillation step, the agent X learns to reproduce the same output as the amplified system of the human + multiple copies of X. Then we go through another amplification step, then another distillation step, and so on.
You can learn more about this at Iterated Distillation and Amplification and see a simplified application of IDA in action at Summarizing Books with Human Feedback.
Contest problem #2: Why might an AI system trained through IDA be misaligned with human values? What assumptions would be needed to prevent that?
Contest problem #3: Why is AI alignment an important problem? What are some research directions and key open problems? How can you or other students contribute to solving it through your career?
We’d recommend reading Intro to AI Safety, Why AI alignment could be hard with modern deep learning, AI alignment—Wikipedia, My Overview of the AI Alignment Landscape: A Bird’s Eye View—AI Alignment Forum, AI safety technical research—Career review, and Long-term AI policy strategy research and implementation—Career review.
You’re free to submit to one or more of these contest problems. You can write as much or as little as you feel is necessary to express your ideas concisely; as a rough guideline, feel free to write between 300 and 2000 words. For the first two content problems, we’ll be evaluating submissions based on the level of technical insight and research aptitude that you demonstrate, not necessarily quality of writing.
I like how contest problems #1 and 2:
provide concrete proposals for solutions to AI alignment, so it’s not an impossibly abstract problem
ask participants to engage with prior research and think about issues, which seems to be an important aspect of doing research
are approachable
Contest problem #3 here isn’t a technical problem, but I think it can be helpful so that participants actually end up caring about AI alignment rather than just engaging with it on a one-time basis as part of this contest. I think it would be exciting if participants learned on their own about why AI alignment matters, form a plan for how they could work on it as part of their career, and end up motivated to continue thinking about AI alignment or to support AI safety field-building efforts in India.
Some quick thoughts:
Strong +1 to actually trying and not assuming a priori that you’re not good enough.
If you’re at all interested in empirical AI safety research, it’s valuable to just try to get really good at machine learning research.
An IMO medalist or generic “super-genius” is not necessarily someone who would be a top-tier AI safety researcher, and vice versa.
For trying AI safety technical research, I’d strongly recommend How to pursue a career in technical AI alignment.
As a countervailing perspective, Dan Hendrycks thinks that it would be valuable to have automated moral philosophy research assistance to “help us reduce risks of value lock-in by improving our moral precedents earlier rather than later” (though I don’t know if he would endorse this project). Likewise, some AI alignment researchers think it would be valuable to have automated assistance with AI alignment research. If EAs could write a nice EA Forum post just by giving GPT-EA-Forum a nice prompt and revising the resulting post, that could help EAs save time and explore a broader space of research directions. Still, I think some risks are:
This bot would write content similar to what the EA Forum has already written, rather than advancing EA philosophy
The content produced is less likely to be well-reasoned, lowering the quality of content on the EA Forum
Distributed computing seems to be a skill in high demand among AI safety organizations. Does anyone have recommendations for resources to learn about it? Would it look like using the PyTorch Distributed package or something like a microservices architecture?
I feel somewhat concerned that after reading your repeated writing saying “use your AGI to (metaphorically) burn all GPUs”, someone might actually do so, but of course their AGI isn’t actually aligned or powerful enough to do so without causing catastrophic collateral damage. At least the suggestion encourages AI race dynamics – because if you don’t make AGI first, someone else will try to burn all your GPUs! – and makes the AI safety community seem thoroughly supervillain-y.
Points 5 and 6 suggest that soon after someone develops AGI for the first time, they must use it to perform a pivotal act as powerful as “melt all GPUs”, or else we are doomed. I agree that figuring out how to align such a system seems extremely hard, especially if this is your first AGI. But aiming for such a pivotal act with your first AGI isn’t our only option, and this strategy seems much riskier than if we take some more time use our AGI to solve alignment further before attempting any pivotal acts. I think it’s plausible that all major AGI companies could stick to only developing AGIs that are (probably) not power-seeking for a decent number of years. Remember, even Yann LeCun of Facebook AI Research thinks that AGI should have strong safety measures. Further, we could have compute governance and monitoring to prevent rogue actors from developing AGI, at least until we solve alignment enough to entrust more capable AGIs to develop strong guarantees against random people developing misaligned superintelligences. (There are also similar comments and responses on LessWrong.)
Perhaps a crux here is that I’m more optimistic than you about things like slow takeoffs, AGI likely being at least 20 years out, the possibility of using weaker AGI to help supervise stronger AGI, and AI safety becoming mainstream. Still, I don’t think it’s helpful to claim that we must or even should aim to try to “burn all GPUs” with our first AGI, instead of considering alternative strategies.
Thanks for writing this! I’ve seen Hilary Greaves’ video on longtermism and cluelessness in a couple university group versions of the Intro EA Program (as part of the week on critiques and debates), so it’s probably been influencing some people’s views. I think this post is a valuable demonstration that we don’t need to be completely clueless about the long-term impact of presentist interventions.
I’m really sorry that my comment was harsher than I intended. I think you’ve written a witty and incisive critique which raises some important points, but I had raised my standards since this was submitted to the Red Teaming Contest.
For future submissions to the Red Teaming Contest, I’d like to see posts that are much more rigorously argued than this. I’m not concerned about whether the arguments are especially novel.
My understanding of the key claim of the post is, EA should consider reallocating some more resources from longtermist to neartermist causes. This seems plausible – perhaps some types of marginal longtermist donations are predictably ineffective, or it’s bad if community members feel that longtermism unfairly has easier access to funding – but I didn’t find the four reasons/arguments given in this post particularly compelling.
The section Political Capital Concern appears to claim: If EA as a movement doesn’t do anything to help regular near-term causes, people will think that it’s not doing anything to help people, and it could die as a movement. I agree that this is possible (though I also think a “longtermism movement” could still be reasonably successful, though unlikely to have much membership compared to EA.) However, EA continues dedicate substantial resources to near-term causes – hundreds of millions of dollars of donations each year! – and this number is only increasing, as GiveWell hopes to direct 1 billion dollars of donations per year. EA continues to highlight its contributions to near-term causes. As a movement, EA is doing fine in this regard.
So then, if the EA movement as a whole is good in this regard, who should change their actions based on the political capital concern? I think it’s more interesting to examine whether local EA groups, individuals, and organizations should have a direct positive impact on near-term causes for signalling reasons. The post only gives the following recommendation (which I find fairly vague): “Instead, the thought is: when running your utility models, factor this in however you can. Consider that utility translated from EA resources to present life, when done effectively and messaged well, [4] redounds as well on the gains to future life.” However, rededicating resources from longtermism to neartermism has costs to the longtermist projects you’re not supporting. How do we navigate these tradeoffs? It would have been great to see examples for this.
The “Social Capital Concern” section writes:
focusing on longterm problems is probably way more fun than present ones.[7] Longtermism projects seem inherently more big picture and academic, detached from the boring mundanities of present reality.
This might be true for some people, but I think for most EAs, concrete or near-term ways of helping people has a stronger emotional appeal, all else equal. I would find the inverse of the sentence a lot more convincing, to be honest: “focusing on near-term problems is probably way more fun than ones in the distant future. Near-term projects seem inherently more appealing and helpful, grounded in present-day realities.”
But that aside, if I am correct that longtermism projects are sexier by nature, when you add communal living/organizing to EA, it can probably lead to a lot of people using flimsy models to talk and discuss and theorize and pontificate, as opposed to creating tangible utility, so that they can work on cool projects without having to get their hands too dirty, all while claiming the mantle of not just the same, but greater, do-gooding.
Longtermist projects may be cool, and their utility may be more theoretical than near-term projects, but I’m extremely confused what you mean when they don’t involve getting your hands dirty (in a way such that near-termist work, such as GiveWell’s charity effectiveness research, involves more hands-on work). Effective donations have historically been the main neartermist EA thing to do, and donating is quite hands-off.
So individual EA actors, given social incentives brought upon by increased communal living, will want to find reasons to engage in longtermism projects because it will increase their social capital within the community.
This seems likely, and thanks for raising this critique (especially if it hasn’t been highlighted before), but what should we do about it? The red-teaming contest is looking for constructive and action-relevant critiques, and I think it wouldn’t be that hard to take some time to propose suggestions. The action implied by the post is that we should consider shifting more resources to near-termism, but I don’t think that would necessarily be the right move, compared to, e.g., being more thoughtful about social dynamics and making an effort to welcome neartermist perspectives.
The section on Muscle Memory Concern writes:
I think this is a reason to avoid a disproportionate emphasis on longtermism projects. Because longtermism efficacy is inherently more difficult to calculate with confidence, it can become quite easy to forget how to provide utility quickly and confidently.
I don’t know, even the most meta of longtermist projects, such as longtermist community building (or to go even another meta level, support for longtermist community building), is quite grounded in metrics and have short feedback loops, such that you can tell if your activities are having an impact – if not impact on the utility across all time, then at least something tangible, such as high-impact career transitions. I think the skills would transfer fairly well over to something more near-termist, such as community organizing for animal welfare, or running organizations in general. In contrast, if you’re doing charity effectiveness research, whether near-termist or longtermist, it can be hard to tell if your work is any good. Over time, I think that now that we have more EAs getting their hands dirty with projects instead of just earning to give, as a community, we have more experience to be able to execute projects, whether longtermist or near-termist.
As for the final section, the discount factor concern:
Future life is less likely to exist than current life. I understand the irony here, since longtermism projects seek to make it more likely that future life exists. But inherently you just have to discount the utility of each individual future life. In the aggregate, there’s no question that the utility gains are still enormous. But each individual life should have some discount based on this less-likely-to-exist factor.
I think longtermists are already accounting for the fact that we should discount future people by their likelihood to exist. That said, longtermist expected utility calculations are often more naive than they should be. For example, we often wrongly interpret reducing x-risk reduction from one cause by 1% as reducing x-risk as a whole by 1%, or conflate a 1% x-risk reduction this century with a 1% x-risk reduction across all time.
(I hope you found this comment informative, but I don’t know if I’ll respond to this comment, as I already spent an hour writing this and don’t know if it was a good use of my time.)
Some quick thoughts:
EA Virtual Programs should be fine in my opinion, especially if you think you have more promising things to do than coordinating logistics for a program or facilitating cohorts
The virtual Intro EA Program only has discussions in English and Spanish. If group members would much prefer to have discussions in Hungarian instead, it might be useful for you to find some Hungarian-speaking facilitators.
Like Jaime commented, if you’re delegating EA programs to EA Virtual Programs, it’s best for you to have some contact with participants, especially particularly engaged ones, so that you can have one-on-one meetings exploring their key uncertainties, share with them relevant opportunities, encouraging them to etc.
It’s rare for the EAIF to provide full-time funding for community building (see this comment)
I’d try to see if you could do more publicity of EA Virtual Programs, such as at Hungarian universities
I see two new relevant roles on the 80,000 Hours job board right now:
OpenAI—Product Manager, Applied Safety
Note that I’m not sure this is what you have in mind for AI safety; this role seems to be focused on developing and enforcing usage guidelines of products like DALL-E 2, Copilot, and GPT-3.
Here’s an excerpt from Anthropic’s job posting. It’s looking for basic familiarity with deep learning and mechanistic interpretability, but mostly nontechnical skills.
In this role you would:
Partner closely with the interpretability research lead on all things team related, from project planning to vision-setting to people development and coaching.
Translate a complex set of novel research ideas into tangible goals and work with the team to accomplish them.
Ensure that the team’s prioritization and workstreams are aligned with its goals.
Manage day-to-day execution of the team’s work including investigating models, running experiments, developing underlying software infrastructure, and writing up and publishing research results in a variety of formats.
Unblock your reports when they are stuck, and help get them whatever resources they need to be successful.
Work with the team to uplevel their project management skills, and act as a project management leader and counselor.
Support your direct reports as a people manager—conducting productive 1:1s, skillfully offering feedback, running performance management, facilitating tough but needed conversations, and modeling excellent interpersonal skills.
Coach and develop your reports to decide how they would like to advance in their careers and help them do so.
Run the interpretability team’s recruiting efforts, in concert with the research lead.
You might be a good fit if you:
Are an experienced manager and enjoy practicing management as a discipline.
Are a superb listener and an excellent communicator.
Are an extremely strong project manager and enjoy balancing a number of competing priorities.
Take complete ownership over your team’s overall output and performance.
Naturally build strong relationships and partner equally well with stakeholders in a variety of different “directions”—reports, a co-lead, peer managers, and your own manager.
Enjoy recruiting for and managing a team through a period of growth.
Effectively balance the needs of a team with the needs of a growing organization.
Are interested in interpretability and excited to deepen your skills and understand more about this field.
Have a passion for and/or experience working with advanced AI systems, and feel strongly about ensuring these systems are developed safely.
Other requirements:
A minimum of 3-5 years of prior management or equivalent experience
Some technical or science-based knowledge or expertise
Basic familiarity in deep learning, AI, and circuits-style interpretability, or a desire to learn
Previous direct experience in machine learning is a plus, but not required
You might want to share this project idea in the Effective Environmentalism Slack, if you haven’t already done so.
I’m curious whether the reason why EA may be perceived as a cult while, e.g., environmentalist and social justice activism are not, is primarily that the concerns of EA are much less mainstream.
I appreciate the suggestions on how to make EA less cultish, and I think they are valuable to implement, but I don’t think they would have a significant effect on public perception of whether EA is a cult.