Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
I suspect that Iâm still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nandaâs 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also heâs skeptical that it will be a silver bullet.
I agree with Ben Stewart that thereâs a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability arenât making progress on their own terms.
I think Henryâs skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretabilityâs prospects for reducing x-risk.
Thanks Ben. I actually suggested both in my original comment: both
(a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
and also
(b) that Iâm skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but Iâm very ignorant of the field)
I see, thanks! Iâm not sure exactly what youâd consider as evidence here, but e.g. hereâs citation count on papers from the past year vs. AI Lab Watch safety rating[1]
Raw data. Note that anthropic doesnât use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something âinterpretabilityâ but I expect the Ant/âGDM > OAI >> * ordering to be true for almost any definition.
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
âThree auditing teams played this game under conditions mimicking those of AI developers conducting real pre-deployment alignment audits, including access to model weights and training dataâ
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructions
âA fourth team was given affordances that a lab-external auditors would haveânamely, only API access to sample from the modelâand failed to discover its hidden objective despite it being possible in principle to do so.â
đ¤ when they donât have the training documents they canât detect the misalignment?
If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.
Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.
I think thereâs a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously thatâs depends on a lot of details and thereâs plenty of room for debate. I think a stronger line of critique could be that early-mid AI safety efforts/âthinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAIâs founding, etc). I havenât interrogated that history to know where to come down, but itâs a plausible way that the whole of AI safety has been net-negative. (This claim doesnât really detract from future impact of AI safety though, if the catâs out of the bag)
Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
I suspect that Iâm still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nandaâs 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also heâs skeptical that it will be a silver bullet.
I agree with Ben Stewart that thereâs a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability arenât making progress on their own terms.
I think Henryâs skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretabilityâs prospects for reducing x-risk.
Thanks Ben. I actually suggested both in my original comment: both
(a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
and also
(b) that Iâm skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but Iâm very ignorant of the field)
I see, thanks! Iâm not sure exactly what youâd consider as evidence here, but e.g. hereâs citation count on papers from the past year vs. AI Lab Watch safety rating[1]
Raw data. Note that anthropic doesnât use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something âinterpretabilityâ but I expect the Ant/âGDM > OAI >> * ordering to be true for almost any definition.
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructionsđ¤ when they donât have the training documents they canât detect the misalignment?If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.
I think thereâs a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously thatâs depends on a lot of details and thereâs plenty of room for debate.
I think a stronger line of critique could be that early-mid AI safety efforts/âthinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAIâs founding, etc). I havenât interrogated that history to know where to come down, but itâs a plausible way that the whole of AI safety has been net-negative. (This claim doesnât really detract from future impact of AI safety though, if the catâs out of the bag)