Thanks, I appreciate the paraphrase. Yes, that is a great summary.
I’m more optimistic e.g. that control turns out to be useful, or that there are hacky alignment techniques which work long enough to get through to the automation of crucial safety research
I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.
As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?
I think that there are various non-research pathways for such people to (in expectation) increase the safety of the lab they’re working at
I’m curious about the pathways you have in mind. I may have missed something here.
e.g. I think that one class of work people in labs could do is capabilities monitoring, and I think that if this were done to a good standard it could in fact help to reduce perceived-safety to outsiders in a timely fashion
I’m skeptical that that would work in this corporate context.
“Capabilities” are just too useful economically and can creep up on you. Putting aside whether we can even measure comprehensively enough for “dangerous capabilities”.
In the meantime, it’s great marketing to clients, to the media, and to national interests: You are working on AI systems that could become so capable, that you even have an entire team devoted to capabilities monitoring.
I guess I’m quite sceptical that signals like “well the safety team at this org doesn’t really have any top-tier researchers and is generally a bit badly thought of” will be meaningfully legible to the relevant outsiders, so I don’t really think that reducing the quality of their work will have too much impact on perceived safety
This is interesting. And a fair argument. Will think about this.
I’m curious about the pathways you have in mind. I may have missed something here.
I think it’s basically things flowing in some form through “the people working on the powerful technology spend time with people seriously concerned with large-scale risks”. From a very zoomed out perspective it just seems obvious that we should be more optimistic about worlds where that’s happening compared to worlds where it’s not (which doesn’t mean that necessarily remains true when we zoom in, but it sure affects my priors).
If I try to tell more concrete stories they include things of the form “the safety-concerned people have better situational awareness and may make better plans later”, and also “when systems start showing troubling indicators, culturally that’s taken much more seriously”. (Ok, I’m not going super concrete in my stories here, but that’s because I don’t want to anchor things on a particular narrow pathway.)
I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.
As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?
Of course I’d prefer to have something more robust. But I don’t think the lack of that means it’s necessarily useless.
I don’t think control is likely to scale to arbitrarily powerful systems. But it may not need to. I think the next phase of the problem is like “keep things safe for long enough that we can get important work out of AI systems”, where the important work has to be enough that it can be leveraged to something which sets us up well for the following phases.
I don’t think control is likely to scale to arbitrarily powerful systems. But it may not need to… which sets us up well for the following phases.
Under the concept of ‘control’, I am including the capacity of the AI system to control their own components’ effects.
I am talking about fundamental workings of control. Ie. control theory and cybernetics. I.e. as general enough that results are applicable to any following phases as well.
Anders Sandberg has been digging lately into fundamental controllability limits. Could be interesting to talk with Anders.
Thanks, I appreciate the paraphrase. Yes, that is a great summary.
I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.
As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?
I’m curious about the pathways you have in mind. I may have missed something here.
I’m skeptical that that would work in this corporate context.
“Capabilities” are just too useful economically and can creep up on you. Putting aside whether we can even measure comprehensively enough for “dangerous capabilities”.
In the meantime, it’s great marketing to clients, to the media, and to national interests:
You are working on AI systems that could become so capable, that you even have an entire team devoted to capabilities monitoring.
This is interesting. And a fair argument. Will think about this.
I think it’s basically things flowing in some form through “the people working on the powerful technology spend time with people seriously concerned with large-scale risks”. From a very zoomed out perspective it just seems obvious that we should be more optimistic about worlds where that’s happening compared to worlds where it’s not (which doesn’t mean that necessarily remains true when we zoom in, but it sure affects my priors).
If I try to tell more concrete stories they include things of the form “the safety-concerned people have better situational awareness and may make better plans later”, and also “when systems start showing troubling indicators, culturally that’s taken much more seriously”. (Ok, I’m not going super concrete in my stories here, but that’s because I don’t want to anchor things on a particular narrow pathway.)
Thanks for clarifying.
Of course I’d prefer to have something more robust. But I don’t think the lack of that means it’s necessarily useless.
I don’t think control is likely to scale to arbitrarily powerful systems. But it may not need to. I think the next phase of the problem is like “keep things safe for long enough that we can get important work out of AI systems”, where the important work has to be enough that it can be leveraged to something which sets us up well for the following phases.
Under the concept of ‘control’, I am including the capacity of the AI system to control their own components’ effects.
I am talking about fundamental workings of control. Ie. control theory and cybernetics.
I.e. as general enough that results are applicable to any following phases as well.
Anders Sandberg has been digging lately into fundamental controllability limits.
Could be interesting to talk with Anders.