Henry Howard🔸 comments on The AI people have been right a lot

Henry Howard🔸 25 Apr 2026 1:21 UTC
2 points
0 ∶ 0
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment “blind audit game” seems a bit sus
~~“Three auditing teams played this game under conditions mimicking those of AI developers conducting real pre-deployment alignment audits, including access to model weights and training data”~~
~~They had access to the training documents? That doesn’t seem like detecting alignment, that’s just a search through files to find one with malicious instructions~~
“A fourth team was given affordances that a lab-external auditors would have—namely, only API access to sample from the model—and failed to discover its hidden objective despite it being possible in principle to do so.”
~~🤔 when they don’t have the training documents they can’t detect the misalignment?~~
~~If they’re claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.~~
Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dish—easy and doesn’t tell you much.
- Ben_West🔸 25 Apr 2026 2:00 UTC
  2 points
  0 ∶ 0
  Parent
  Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
  Also note that “training data” does not mean “instructions”. Section 3 describes their training process.