Habryka comments on Please don’t criticize EAs who “sell out” to OpenAI and Anthropic

Habryka 5 Mar 2023 23:27 UTC
3 points
2 ∶ 0
In general, if someone is doing AI safety technical or governance work at an AI lab that is also doing capabilities research, it is fair game to tell them that you think their approach will be ineffective or that they should consider switching to a role at another organization to avoid causing accidental harm. It is not acceptable to tell them that their choice of where to work means they are “AI capabilities people” who aren’t serious about AI safety. Given that they are working on AI safety, it is likely that they have already weighed the obvious objections to their career choices.
I think this perspective makes more sense than my original understanding of the OP, but I do think it is still misguided. Sadly, it is not very difficult for an organization to just label a job “AI Safety” and then have them work on stuff whose primary aim is to make them more money, in this case by working on things like AI bias, or setting up RLHF pipelines, which might help a bit with some safety, but where the primary result is still billions of additional dollars flowing into AI labs primarily doing scaling-related work.
I sadly do not think that just because someone is working on “AI Safety” that they have weighed and properly considered the obvious objections to their career choices. Indeed, safety-washing seems easy and common, and if you can just hire top EAs by slapping a safety label on a capabilities position, then we will likely make the world worse.
I do react differently to someone working in a safety position, but I do actually have a separate additional negative judgement if I find out that someone is actually working in capabilities but calling their work safety. I think that kind of deception is increasingly happening, and additionally makes coordinating and working in this space harder.
- ben.smith 6 Mar 2023 22:12 UTC
  1 point
  0 ∶ 0
  Parent
  I have a very uninformed view on the relative Alignment and Capabilities contributions of things like RLHF. My intuition is that RLHF is positive for alignment I’m almost entirely uninformed on that. If anyone’s written a summary on where they think these grey-area research areas lie I’d be interested to read it. Scott’s recent post was not a bad entry into the genre but obviously just worked a a very high level.