Ben_West🔸 comments on The AI people have been right a lot

Ben_West🔸 19 Apr 2026 5:57 UTC
12 points
3 ∶ 0
Maybe, but “if EA had just stuck to Earning To Give and malaria nets and decaging chickens then the impact would have been greater” doesn’t clearly follow. Malaria nets look a lot worse if we all die in a few years from AI anyway, and cage free pledges have ~0 value if humanity ends before the pledge can be fulfilled.
- Henry Howard🔸 19 Apr 2026 8:46 UTC
  6 points
  0 ∶ 0
  Parent
  That’s a fair point. At either end of the extreme of outcomes: “ASI kills us all” or “ASI quickly uplifts everyone out of poverty” almost all decisions/actions we make today are pretty meaningless.
  But if the next few decades fall somewhere between those two extremes, which I think they probably will, the impact of improving people’s lives remains substantial.
  - Ben_West🔸 19 Apr 2026 17:19 UTC
    5 points
    1 ∶ 0
    Parent
    Hmm, but in a success without dignity world making interpretability a bit better, or governments a bit more interested, is relevant, right?
    - Henry Howard🔸 22 Apr 2026 7:58 UTC
      6 points
      1 ∶ 0
      Parent
      Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
      - Ben_West🔸 24 Apr 2026 21:25 UTC
        6 points
        0 ∶ 0
        Parent
        I suspect that I’m still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nanda’s 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also he’s skeptical that it will be a silver bullet.
        I agree with Ben Stewart that there’s a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability aren’t making progress on their own terms.
        Ben Stewart 24 Apr 2026 23:10 UTC
        4 points
        0 ∶ 0
        Parent
        I think Henry’s skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretability’s prospects for reducing x-risk.
        Henry Howard🔸 25 Apr 2026 1:25 UTC
        4 points
        0 ∶ 0
        Parent
        Thanks Ben. I actually suggested both in my original comment: both
        (a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
        and also
        (b) that I’m skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but I’m very ignorant of the field)
        Ben_West🔸 25 Apr 2026 1:49 UTC
        2 points
        0 ∶ 0
        Parent
        I see, thanks! I’m not sure exactly what you’d consider as evidence here, but e.g. here’s citation count on papers from the past year vs. AI Lab Watch safety rating^[1]
        ^
        Raw data. Note that anthropic doesn’t use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something “interpretability” but I expect the Ant/GDM > OAI >> * ordering to be true for almost any definition.
        Henry Howard🔸 25 Apr 2026 1:21 UTC
        2 points
        0 ∶ 0
        Parent
        Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment “blind audit game” seems a bit sus
        ~~“Three auditing teams played this game under conditions mimicking those of AI developers conducting real pre-deployment alignment audits, including access to model weights and training data”~~
        ~~They had access to the training documents? That doesn’t seem like detecting alignment, that’s just a search through files to find one with malicious instructions~~
        “A fourth team was given affordances that a lab-external auditors would have—namely, only API access to sample from the model—and failed to discover its hidden objective despite it being possible in principle to do so.”
        ~~🤔 when they don’t have the training documents they can’t detect the misalignment?~~
        ~~If they’re claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.~~
        Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dish—easy and doesn’t tell you much.
        Ben_West🔸 25 Apr 2026 2:00 UTC
        2 points
        0 ∶ 0
        Parent
        Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
        Also note that “training data” does not mean “instructions”. Section 3 describes their training process.
      - Ben Stewart 24 Apr 2026 9:01 UTC
        4 points
        1 ∶ 0
        Parent
        I think there’s a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously that’s depends on a lot of details and there’s plenty of room for debate.
        I think a stronger line of critique could be that early-mid AI safety efforts/thinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAI’s founding, etc). I haven’t interrogated that history to know where to come down, but it’s a plausible way that the whole of AI safety has been net-negative. (This claim doesn’t really detract from future impact of AI safety though, if the cat’s out of the bag)
        What links here?
        Ben_West🔸's comment on The AI people have been right a lot by Dylan Matthews (24 Apr 2026 21:25 UTC; 6 points)
- Ian Turner 20 Apr 2026 19:24 UTC
  2 points
  1 ∶ 4
  Parent
  Malaria nets only last 3 years anyway, their direct impact does not require the world to last longer than that (although, perhaps you value saving a life less, if you think the world will soon end).
  - Mo Putera 21 Apr 2026 3:29 UTC
    16 points
    7 ∶ 0
    Parent
    The way the benefits calculation cashes out on an individual beneficiary basis essentially requires that they (mostly under-5s) live out full lives and enjoy 40 years of increased income, it isn’t a function of how long the nets last.
- Alex N. 26 Apr 2026 9:36 UTC
  1 point
  0 ∶ 0
  Parent
  The existence of existential threats does not in itself create a strong argument to redirect the effort. Otherwise EA should have been focusing on nuclear disarmament, climate change, asteroid defence, pandemic prevention etc. from the get go
  .