[seeing] a lot of safety research as “eating marginal probability” of things going well, progressively addressing harder and harder safety scenarios.
The implicit strat (which Olah may not endorse) is to try to solve easy bits, then move on to harder bits, then note the rate you are progressing at and get a sense of how hard things are that way.
This would be fine if we could be sure we actually were solving the problems, and also not fooling ourselves about the current difficulty level, and if the relevant research landscape is smooth and not blockable by a single missing piece.
I agree the implicit strat here doesn’t seem like it’ll make progress on knowing whether the hard problems are real. Lots of the hard problems (generalising well ood, existence of sharp left turns) just don’t seem very related to the easier problems (like making LLMs say nice things), and unless you’re explicitly looking for evidence of hard problems I think you’ll be able to solve the easier problems in ways that won’t generalise (e.g. hammering LLMs with enough human supervision in ways that aren’t scalable, but are sufficient to ‘align’ it).
See also Anthropic’s view on this
The implicit strat (which Olah may not endorse) is to try to solve easy bits, then move on to harder bits, then note the rate you are progressing at and get a sense of how hard things are that way.
This would be fine if we could be sure we actually were solving the problems, and also not fooling ourselves about the current difficulty level, and if the relevant research landscape is smooth and not blockable by a single missing piece.
I agree the implicit strat here doesn’t seem like it’ll make progress on knowing whether the hard problems are real. Lots of the hard problems (generalising well ood, existence of sharp left turns) just don’t seem very related to the easier problems (like making LLMs say nice things), and unless you’re explicitly looking for evidence of hard problems I think you’ll be able to solve the easier problems in ways that won’t generalise (e.g. hammering LLMs with enough human supervision in ways that aren’t scalable, but are sufficient to ‘align’ it).