Thanks for taking the time to write thoughtful criticism. Wanted to add a few quick notes (though note that I’m not really impartial as I’m socially very close with Redwood)
- I personally found MLAB extremely valuable. It was very well-designed and well-taught and was the best teaching/learning experience I’ve had by a fairly wide margin
- Redwood’s community building (MLAB, REMIX and people who applied to or worked at Redwood) has been a great pipeline for ARC Evals and our biggest single source for hiring (we currently have 3 employees and 2 work triallers who came via Redwood community building efforts).
- It was also very useful for ARC Evals to be able to use Constellation office space while we were getting started, rather than needing to figure this out by ourselves.
- As a female person I feel very comfortable in Constellation. I’ve never felt that I needed to defer or was viewed for my dating potential rather than my intellectual contributions. I do think I’m pretty happy to hold my ground and sometimes oblivious to things that bother other people, so that might not be a very strong evidence that it isn’t an issue for other people. However, I have been bothered in the past by places that try to make up the gender balance by hiring a lot of women for non-technical roles. In these places, people assume that the women who are there are non-technical. I think it would make the environment worse for me personally if there was pressure for Constellation to balance the gender ratios.
- I think there have been various ways in which Redwood culture and management style were not great. I think some of this was due to difficult tradeoffs or normal challenges of being a new organization, and some of it was unforced errors. I think they are mostly aware of the issues and taking steps to fix them, although I don’t think I expect them to be excellent at management that soon. Some of my recommendations (which I’ve told them before and think they have mostly taken on board):
-- If Buck is continuing to manage people (and maybe also if not), he should get management coaching
—Give employees lots of concrete positive feedback (at least once per week)
-- When letting people go, be very clear that hiring is noisy, people perform differently at different organizations; Redwood is a challenging and often low-management environment that, like a PhD program, is not a good fit for everyone; they shouldn’t be too discouraged. (I think Redwood believes this but hasn’t been as clear as they could be about communicating it)
-- Make sure expectations are clear for work trials
—Make growth for their employees a serious priority, especially for their top performers—this should be something that is done deliberately with time set aside for it
ElizabethBarnes
METR is hiring!
Bounty: Diverse hard tasks for LLM agents
Bounty: example debugging tasks for evals
- ElizabethBarnes 3 Oct 2023 1:34 UTC12 points6 ∶ 0
Error
The value NIL is not of type SIMPLE-STRING when binding #:USER-ID163
- ElizabethBarnes 2 Apr 2023 19:43 UTC13 points2 ∶ 1in reply to: Dan H’s comment on: Critiques of prominent AI safety labs: Redwood Research
In my understanding, there was another important difference in Redwood’s project from the standard adversarial robustness literature: they were looking to eliminate only ‘competent’ failures (ie cases where the model probably ‘knows’ what the correct classification is), and would have counted it a success if there were still failures if the failure was due to a lack of competence on the model’s part (e.g. ‘his mitochondria were liberated’ → implies harm but only if you know enough biology)
I think in practice in their exact project this didn’t end up being a super clear conceptual line, but at the start it was plausible to me that only focusing on competent failures made the task feasible even if the general case is impossible.
This is a really great write-up, thanks for doing this so conscientiously and thoroughly. It’s good to hear that Surge is mostly meeting researchers’ needs.
Re whether higher-quality human data is just patching current alignment problems—the way I think about it is more like: there’s a minimum level of quality you need to set up various enhanced human feedback schemes. You need people to actually read and follow the instructions, and if they don’t do this reliably you really won’t be able to set up something like amplification or other schemes that need your humans to interact with models in non-trivial ways. It seems good to get human data quality to the point where it’s easy for alignment researchers to implement different schemes that involve complex interactions (like the humans using an adversarial example finder tool or looking at the output of an interpretability tool). This is different from the case where we e.g. have an alignment problem because MTurkers mark common misconceptions as truthful, whereas more educated workers correctly mark them as false, which I don’t think of as a scalable sort of improvement.
The evaluations project at the Alignment Research Center is looking to hire a generalist technical researcher and a webdev-focused engineer. We’re a new team at ARC building capability evaluations (and in the future, alignment evaluations) for advanced ML models. The goals of the project are to improve our understanding of what alignment danger is going to look like, understand how far away we are from dangerous AI, and create metrics that labs can make commitments around (e.g. ‘If you hit capability threshold X, don’t train a larger model until you’ve hit alignment threshold Y’). We’re also still hiring for model interaction contractors, and we may be taking SERI MATS fellows.
I think DM clearly restricts REs more than OpenAI (and I assume Anthropic). I know of REs at DM who have found it annoying/difficult to lead projects because of being REs, I know of someone without a PhD who left Brain (not DeepMind but still Google so prob more similar) partly because it was restrictive, and lead team at OAI/Anthropic, and I know of people without an undergrad degree who have been hired by OAI/Anthropic. At OpenAI I’m not aware of it being more difficult for people to lead projects etc because of being ‘officially an RE’. I had bad experiences at DM that were ostensibly related to not having a PhD (but could also have been explained by lack of research ability).
High-quality human data
Artificial Intelligence
Most proposals for aligning advanced AI require collecting high-quality human data on complex tasks such as evaluating whether a critique of an argument was good, breaking a difficult question into easier subquestions, or examining the outputs of interpretability tools. Collecting high-quality human data is also necessary for many current alignment research projects.
We’d like to see a human data startup that prioritizes data quality over financial cost. It would follow complex instructions, ensure high data quality and reliability, and operate with a fast feedback loop that’s optimized for researchers’ workflow. Having access to this service would make it quicker and easier for safety teams to iterate on different alignment approaches
Some alignment research teams currently manage their own contractors because existing services (such as surgehq.ai and scale.ai) don’t fully address their needs; a competent human data startup could free up considerable amounts of time for top researchers.
Such an organization could also practice and build capacity for things that might be needed at ‘crunch time’ – i.e., rapidly producing moderately large amounts of human data, or checking a large volume of output from interpretability tools or adversarial probes with very high reliability.
The market for high-quality data will likely grow – as AI labs train increasingly large models at a high compute cost, they will become more willing to pay for data. As models become more competent, data needs to be more sophisticated or higher-quality to actually improve model performance.
Making it less annoying for researchers to gather high-quality human data relative to using more compute would incentivize the entire field towards doing work that’s more helpful for alignment, e.g., improving products by making them more aligned rather than by using more compute.
[Thanks to Jonas V for writing a bunch of this comment for me]
[Views are my own and do not represent that of my employer]
- ElizabethBarnes 3 Mar 2020 18:15 UTC6 points0 ∶ 0in reply to: Sean_o_h’s comment on: COVID-19 brief for friends and family
Although I believe all the deaths were at a nursing home, where you’d expect a much higher death rate
- ElizabethBarnes 3 Mar 2020 18:03 UTC6 points0 ∶ 0in reply to: eca’s comment on: COVID-19 brief for friends and family
Big source of uncertainty is how long the fatigue persists—it wasn’t entirely clear from the SARS paper whether that was the fraction of people who still had fatigue at 4 years, or people who’d had it at some point. Numbers are very different if it’s a few months of fatigue vs rest of your life. Not sure I’ve split up the persistent CF vs temporary post-viral fatigue properly
A friend pointed me to a study showing a high rate of chronic fatigue in SARS survivors (40%). I did a quick analysis of risk of chronic fatigue from getting COVID-19 (my best guess for young healthy people is ~2 weeks lost in expectation, but could be less than a day or more like 100 days on what seem like reasonable assumptions. ) https://docs.google.com/spreadsheets/d/1z2HTn72fM6saFH42VKs6lEdvooLJ6qaXwCrQ5YZ33Fk/edit?usp=sharing
Thanks for doing this! Some nitpicking on this graph: https://i.ibb.co/wLd1vSg/donations-income-scatter.png (donations and income)
1) the trendline looks a bit weird. Did you force it to go through (0,0)?
2) Your axis labels initially go up by factors of 100, then the last one only a factor of 10.
Thanks for the post! I am generally pretty worried that I and many people I know are all deluding ourselves about AI safety—it has a lot of red flags from the outside (although these are lessening as more experts come onboard, more progress is made in AI capabilities, and more concrete work is done on safety). I think it’s more likely than not we’ve got things completely wrong, but that it’s still worth working on. If that’s not the case, I’d like to know!
I like your points about language. I think there’s a closely related problem where it’s very hard to talk or think about anything that’s between human level at some task and omnipotent. Once you try to imagine something that can do things humans can’t, there’s no way to argue that the system wouldn’t be able to do something. There is always a retort that just because you, a human, think it’s impossible, doesn’t mean a more intelligent system couldn’t achieve it.
On the other hand, I think there are some good examples of couching safety concerns in non-anthropomorphic language. I like Dr Krakovna’s list of specification gaming examples: https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/
I also think Iterated Distillation and Amplification is a good example of a discussion of AI safety and potential mitigation strategies that’s couched in ideas of training distributions and gradient descent rather than desires and omnipotence.
Re the sense of meaning point, I don’t think that’s been my personal experience—I switched into CS from biology partly because of concern about x-risk, and know various other people who switched fields from physics, music, maths and medicine. As far as I could tell, the arguments for AI safety still mostly hold up now I know more about relevant fields, and I don’t think I’ve noticed egregious errors in major papers. I’ve definitely noticed some people who advocate for the importance of AI safety making mistakes and being confused about CS/ML fundamentals, but I don’t think I’ve seen this from serious AI safety researchers.
Re anchoring, this seems like a very strong claim. I think a sensible baseline to take here would be expert surveys, which usually put several percent probability on HLMI being catastrophically bad. (e.g. https://aiimpacts.org/2016-expert-survey-on-progress-in-ai/#Chance_that_the_intelligence_explosion_argument_is_about_right)
I’d be curious if you have an explanation for why your numbers are so far away from expert estimates? I don’t think that these expert surveys are a reliable source of truth, just a good ballpark for what sort of orders of magnitude we should be considering.
You say
I think a given amount of dolorium/dystopia (say, the amount that can be created with 100 joules of energy) is far larger in absolute moral expected value than hedonium/utopia made with the same resources
Could you elaborate more on why this is the case? I would tend to think that a prior would be that they’re equal, and then you update on the fact that they seem to be asymmetrical, and try to work out why that is the case, and whether those factors will apply in future. They could be fundamentally asymmetrical, or evolutionary pressures may tend to create minds with these asymmetries. The arguments I’ve heard for why are:
The worst thing that can happen to an animal, in terms of genetic success, is much worse than the best thing.
This isn’t entirely clear to me: I can imagine a large genetic win such as securing a large harem could be comparable to the genetic loss of dying, and many animals will in fact risk death for this. This seems particularly true considering that dying leaving no offspring doesn’t make your contribution to the gene pool zero, just that it’s only via your relatives.
There is selection against strong positive experiences in a way that there isn’t against strong negative experiences.
The argument here is, I think, that strong positive experiences will likely result in the animal sticking in the blissful state, and neglecting to feed, sleep, etc, whereas strong negative experiences will just result in the animal avoiding a particular state, which is less maladaptive. This argument seems stronger to me but still not entirely satisfying—it seems to be quite sensitive to how you define states.
Thanks very much for writing this, and thanks to Greg for funding it! I think this is a really important discussion. Some slightly rambling thoughts below.
We can think about 3 ways of improving the EV of the far future:
1: Changing incentive structures experienced by powerful agents in the future (e.g. avoiding arms races, power struggles, selection pressures)
2: a) Changing the moral compass of powerful agents in the future in specific directions (e.g. MCE).
b) Indirect ways to improve the moral compass of powerful agents in the future (e.g. philosophy research, education, intelligence/empathy enhancement)
All of these are influenced both by strategies such as activism, improving institutions, and improving education, as well as by AIA. I am inclined to think of AIA as a particularly high-leverage point at which we can have influence on these.
However, these are issues are widely encountered. Consider 2b: we have to decide how to educate the next generation of humans, and they may well end up with ethical beliefs that are different from ours, so we must judge how much to try and influence or constrain them, and how much to accept that the changes are actually progress. This is similar to the problem of defining CEV: we have some vague idea of the direction in which better values lie (more empathy, more wisdom, more knowledge), but we can’t say exactly what the values should be. For this intervention, working on AIA may be more important than activism because it has more leverage—it is likely to be more tractable and have greater influence on the future than the more diffuse ways in we can push on education and intergenerational moral progress.
This framework also suggests that MCE is just one example of a collection of similar interventions. MCE involves pushing for a fairly specific belief and behaviour change on a principle that’s fairly uncontroversial. You could also imagine similar interventions—for instance, helping people reduce unwanted aggressive or sadistic behaviour. We could call this something like ‘uncontroversial moral progress’: helping individuals and civilisation to live by their values more. (on a side note: sometimes I think of this as the minimal core of EA: trying to live according to your best guess of what’s right)
The choice between working on 2a and 2b depends, among other things, on your level of moral uncertainty.
I am inclined to think that AIA is the best way to work on 1 and 2b, as it is a particularly high-leverage intervention point to shape the power structures and moral beliefs that exist in the future. It gives us more of a clean slate to design a good system, rather than having to work within a faulty system.
I would really like to see more work on MCE and other examples of ‘uncontroversial moral progress’. Historical case studies of value changes seem like a good starting point, as well as actually testing the tractability of changing people’s behaviour.
I also really appreciated your perspective on different transformative AI scenarios, as I’m worried I’m thinking about it in an overly narrow way.
See also the models in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5576214/ (cost-effectiveness of mitigating biorisk) and https://onlinelibrary.wiley.com/doi/full/10.1111/j.1539-6924.2007.00960.x (asteroid risk), which have estimates for the risk level, cost of reducing it, and cost per qualy for different future discount levels.
“If we ignore distant future generations by discounting, the benefits of reducing existential risk fall by between 3 and 5 orders of magnitude (with a 1% to 5% discount rate), which is still far more cost-effective than measures to reduce small-scale casualty events. Under our survey model (Model 1), the cost per life-year varies between $1,300 and $52,000 for a 5% discount rate and between $770 and $30,000 for a 1% discount rate. These costs are even competitive with first-world healthcare spending, where typically anything less than $100,000 per quality adjusted life-year is considered a reasonable purchase.29
This suggests that even if we are concerned about welfare only in the near term, reducing existential risks from biotechnology is still a cost-effective means of saving expected life if the future chance of an existential risk is anything above 0.0001 per year.”
I think their model ought to include a category of catastrophic risk—they don’t have anything between disaster (100,000 deaths) and extinction.
“Even if we expected humanity to become extinct within a generation, traditional statistical life valuations would warrant a 32 billion annual investment in asteroid defense (Gerrard & Barber, 1997). Yet the United States spends only $4 million per year on asteroid detection and there is no direct spending on mitigation.”
CHCAI/MIRI research internship in AI safety
Good question, thanks! Added at top of post. In next few weeks is super helpful for us, but expect we’ll offer bounty until at least end of Feb.