Henry Howard🔸 comments on The AI people have been right a lot

Henry Howard🔸 22 Apr 2026 7:58 UTC
6 points
1 ∶ 0
Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
- Ben_West🔸 24 Apr 2026 21:25 UTC
  6 points
  0 ∶ 0
  Parent
  I suspect that I’m still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nanda’s 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also he’s skeptical that it will be a silver bullet.
  I agree with Ben Stewart that there’s a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability aren’t making progress on their own terms.
  - Ben Stewart 24 Apr 2026 23:10 UTC
    4 points
    0 ∶ 0
    Parent
    I think Henry’s skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretability’s prospects for reducing x-risk.
    - Henry Howard🔸 25 Apr 2026 1:25 UTC
      4 points
      0 ∶ 0
      Parent
      Thanks Ben. I actually suggested both in my original comment: both
      (a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
      and also
      (b) that I’m skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but I’m very ignorant of the field)
    - Ben_West🔸 25 Apr 2026 1:49 UTC
      2 points
      0 ∶ 0
      Parent
      I see, thanks! I’m not sure exactly what you’d consider as evidence here, but e.g. here’s citation count on papers from the past year vs. AI Lab Watch safety rating^[1]
      ^
      Raw data. Note that anthropic doesn’t use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something “interpretability” but I expect the Ant/GDM > OAI >> * ordering to be true for almost any definition.
  - Henry Howard🔸 25 Apr 2026 1:21 UTC
    2 points
    0 ∶ 0
    Parent
    Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment “blind audit game” seems a bit sus
    ~~“Three auditing teams played this game under conditions mimicking those of AI developers conducting real pre-deployment alignment audits, including access to model weights and training data”~~
    ~~They had access to the training documents? That doesn’t seem like detecting alignment, that’s just a search through files to find one with malicious instructions~~
    “A fourth team was given affordances that a lab-external auditors would have—namely, only API access to sample from the model—and failed to discover its hidden objective despite it being possible in principle to do so.”
    ~~🤔 when they don’t have the training documents they can’t detect the misalignment?~~
    ~~If they’re claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.~~
    Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dish—easy and doesn’t tell you much.
    - Ben_West🔸 25 Apr 2026 2:00 UTC
      2 points
      0 ∶ 0
      Parent
      Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
      Also note that “training data” does not mean “instructions”. Section 3 describes their training process.
- Ben Stewart 24 Apr 2026 9:01 UTC
  4 points
  1 ∶ 0
  Parent
  I think there’s a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously that’s depends on a lot of details and there’s plenty of room for debate.
  I think a stronger line of critique could be that early-mid AI safety efforts/thinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAI’s founding, etc). I haven’t interrogated that history to know where to come down, but it’s a plausible way that the whole of AI safety has been net-negative. (This claim doesn’t really detract from future impact of AI safety though, if the cat’s out of the bag)
  What links here?
  - Ben_West🔸's comment on The AI people have been right a lot by Dylan Matthews (24 Apr 2026 21:25 UTC; 6 points)