I know of at least one potential counterexample: OpenAI’s RLHF was developed by AI safety people who joined OpenAI to promote safety. But it’s not clear that RLHF helps with x-risk.
I’d go further and say that it’s not actually a counterexample. RLHF allowed OpenAI to be hugely profitable—without it they wouldn’t’ve been able to publicly release their models and get their massive userbase.
I’d go further and say that it’s not actually a counterexample. RLHF allowed OpenAI to be hugely profitable—without it they wouldn’t’ve been able to publicly release their models and get their massive userbase.