Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
âThree auditing teams played this game under conditions mimicking those of AI developers conducting real pre-deployment alignment audits, including access to model weights and training dataâ
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructions
âA fourth team was given affordances that a lab-external auditors would haveânamely, only API access to sample from the modelâand failed to discover its hidden objective despite it being possible in principle to do so.â
đ¤ when they donât have the training documents they canât detect the misalignment?
If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.
Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructionsđ¤ when they donât have the training documents they canât detect the misalignment?If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.