Disclaimer: I joined OP two weeks ago in the Program Associate role on the Technical AI Safety team. I’m leaving some comments describing questions I wanted to know to assess whether I should take the job (which, obviously, I ended up doing).
How inclined are you/would the OP grantmaking strategy be towards technical research with theories of impact that aren’t “researcher discovers technique that makes the AI internally pursue human values” → “labs adopt this technique”. Some examples of other theories of change that technical research might have:
Providing evidence for the dangerous capabilities of current/future models (should such capabilities emerge) that can more accurately inform countermeasures/policy/scaling decisions.
Detecting/demonstrating emergent misalignment from normal training procedures. This evidence would also serve to more accurately inform countermeasures/policy/scaling decisions.
Reducing the ease of malicious misuse of AIs by humans.
Limiting the reach/capability of models instead of ensuring their alignment.
I’m very interested in these paths. In fact, I currently think that well over half the value created by the projects we have funded or will fund in 2023 will go through “providing evidence for dangerous capabilities” and “demonstrating emergent misalignment;” I wouldn’t be surprised if that continues being the case.
Disclaimer: I joined OP two weeks ago in the Program Associate role on the Technical AI Safety team. I’m leaving some comments describing questions I wanted to know to assess whether I should take the job (which, obviously, I ended up doing).
How inclined are you/would the OP grantmaking strategy be towards technical research with theories of impact that aren’t “researcher discovers technique that makes the AI internally pursue human values” → “labs adopt this technique”. Some examples of other theories of change that technical research might have:
Providing evidence for the dangerous capabilities of current/future models (should such capabilities emerge) that can more accurately inform countermeasures/policy/scaling decisions.
Detecting/demonstrating emergent misalignment from normal training procedures. This evidence would also serve to more accurately inform countermeasures/policy/scaling decisions.
Reducing the ease of malicious misuse of AIs by humans.
Limiting the reach/capability of models instead of ensuring their alignment.
I’m very interested in these paths. In fact, I currently think that well over half the value created by the projects we have funded or will fund in 2023 will go through “providing evidence for dangerous capabilities” and “demonstrating emergent misalignment;” I wouldn’t be surprised if that continues being the case.