Maybe, but âif EA had just stuck to Earning To Give and malaria nets and decaging chickens then the impact would have been greaterâ doesnât clearly follow. Malaria nets look a lot worse if we all die in a few years from AI anyway, and cage free pledges have ~0 value if humanity ends before the pledge can be fulfilled.
Thatâs a fair point. At either end of the extreme of outcomes: âASI kills us allâ or âASI quickly uplifts everyone out of povertyâ almost all decisions/âactions we make today are pretty meaningless.
But if the next few decades fall somewhere between those two extremes, which I think they probably will, the impact of improving peopleâs lives remains substantial.
Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
I suspect that Iâm still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nandaâs 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also heâs skeptical that it will be a silver bullet.
I agree with Ben Stewart that thereâs a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability arenât making progress on their own terms.
I think Henryâs skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretabilityâs prospects for reducing x-risk.
Thanks Ben. I actually suggested both in my original comment: both
(a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
and also
(b) that Iâm skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but Iâm very ignorant of the field)
I see, thanks! Iâm not sure exactly what youâd consider as evidence here, but e.g. hereâs citation count on papers from the past year vs. AI Lab Watch safety rating[1]
Raw data. Note that anthropic doesnât use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something âinterpretabilityâ but I expect the Ant/âGDM > OAI >> * ordering to be true for almost any definition.
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
âThree auditing teams played this game under conditions mimicking those of AI developers conducting real pre-deployment alignment audits, including access to model weights and training dataâ
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructions
âA fourth team was given affordances that a lab-external auditors would haveânamely, only API access to sample from the modelâand failed to discover its hidden objective despite it being possible in principle to do so.â
đ€ when they donât have the training documents they canât detect the misalignment?
If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.
Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.
I think thereâs a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously thatâs depends on a lot of details and thereâs plenty of room for debate. I think a stronger line of critique could be that early-mid AI safety efforts/âthinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAIâs founding, etc). I havenât interrogated that history to know where to come down, but itâs a plausible way that the whole of AI safety has been net-negative. (This claim doesnât really detract from future impact of AI safety though, if the catâs out of the bag)
Malaria nets only last 3 years anyway, their direct impact does not require the world to last longer than that (although, perhaps you value saving a life less, if you think the world will soon end).
The way the benefits calculation cashes out on an individual beneficiary basis essentially requires that they (mostly under-5s) live out full lives and enjoy 40 years of increased income, it isnât a function of how long the nets last.
The existence of existential threats does not in itself create a strong argument to redirect the effort. Otherwise EA should have been focusing on nuclear disarmament, climate change, asteroid defence, pandemic prevention etc. from the get go
Maybe, but âif EA had just stuck to Earning To Give and malaria nets and decaging chickens then the impact would have been greaterâ doesnât clearly follow. Malaria nets look a lot worse if we all die in a few years from AI anyway, and cage free pledges have ~0 value if humanity ends before the pledge can be fulfilled.
Thatâs a fair point. At either end of the extreme of outcomes: âASI kills us allâ or âASI quickly uplifts everyone out of povertyâ almost all decisions/âactions we make today are pretty meaningless.
But if the next few decades fall somewhere between those two extremes, which I think they probably will, the impact of improving peopleâs lives remains substantial.
Hmm, but in a success without dignity world making interpretability a bit better, or governments a bit more interested, is relevant, right?
Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
I suspect that Iâm still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nandaâs 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also heâs skeptical that it will be a silver bullet.
I agree with Ben Stewart that thereâs a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability arenât making progress on their own terms.
I think Henryâs skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretabilityâs prospects for reducing x-risk.
Thanks Ben. I actually suggested both in my original comment: both
(a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
and also
(b) that Iâm skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but Iâm very ignorant of the field)
I see, thanks! Iâm not sure exactly what youâd consider as evidence here, but e.g. hereâs citation count on papers from the past year vs. AI Lab Watch safety rating[1]
Raw data. Note that anthropic doesnât use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something âinterpretabilityâ but I expect the Ant/âGDM > OAI >> * ordering to be true for almost any definition.
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructionsđ€ when they donât have the training documents they canât detect the misalignment?If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.
I think thereâs a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously thatâs depends on a lot of details and thereâs plenty of room for debate.
I think a stronger line of critique could be that early-mid AI safety efforts/âthinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAIâs founding, etc). I havenât interrogated that history to know where to come down, but itâs a plausible way that the whole of AI safety has been net-negative. (This claim doesnât really detract from future impact of AI safety though, if the catâs out of the bag)
Malaria nets only last 3 years anyway, their direct impact does not require the world to last longer than that (although, perhaps you value saving a life less, if you think the world will soon end).
The way the benefits calculation cashes out on an individual beneficiary basis essentially requires that they (mostly under-5s) live out full lives and enjoy 40 years of increased income, it isnât a function of how long the nets last.
The existence of existential threats does not in itself create a strong argument to redirect the effort. Otherwise EA should have been focusing on nuclear disarmament, climate change, asteroid defence, pandemic prevention etc. from the get go
.