I think this is a good piece, and I’m glad you wrote it. One disagreement I have is that I think the behaviour of a deep learning agent trained by HFDT is less predictable than this piece suggests. My guesses would be something like:
30%: the behaviour of HFDT-AGI is basically fine/instability is easy to manage
30%: the behaviour of HFDT-AGI is unstable in a hard-to-manage way and includes seriously bad but not catastrophic behaviours
30%: the behaviour of HFDT-AGI is unstable and includes catastrophic behaviours
10%: the behaviour of HFDT-AGI converges to some kind of catastrophic behaviour
These numbers aren’t the product of much reflection, they’d probably change substantially with a couple of hours’ thought. Also, note that catastrophe is overall highly likely for the bottom 40% of outcomes, as people will probably make multiple attempts at AGI and consequently explore a range of possible behaviours.
I interpret your piece as arguing that option 4 actually has more than 50% weight, and this doesn’t seem right to me.
We seem to agree that an agent trained by episodic reward might behave in a way that is interpretable as “strategically maximising reward”, and might behave in a different way. In particular, if we assume that training episodes are IID from some distribution, then it seems reasonable to assume that the agent behaves as a strategic reward maximiser in episodes sampled from the same distribution, relative to that distribution. However, in the case we’re interested in, we’re interested in behaviour in a more general environment, and it’s much less clear whether the behaviour over time will be interpretable in this way.
I’m not sure about the following: you spend substantial time discussing the actions of a strategic reward maximiser, and argue that this follows from a premise of “correct generalisation”, but also that it’s “very plausible” that its behaviour cannot ultimately be interpreted as strategic reward maximisation. Does your overall assessment of likely AI takeover depend on strategic reward maximisation being quite likely? I think it is substantially more likely that the AIs behaviour cannot be interpreted as strategic reward maximisation. I’ll say a bit more about this later.
If strategic reward maximisation is not likely, you argue “Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control”. I think this section depends on a strong and, in my view, not especially compelling assumption. I’m going to try to spell it out. The structure of your argument seems to be:
HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function
In your preferred reference class of utility functions, this implies catastrophic behaviour
I’m going to call this class “simple utility functions”, though I think this phrasing is not especially epistemically hygienic because it suggests a sharper definition than we actually have
(I’m putting words in your mouth a bit here, so please let me know if this isn’t what you meant).
This depends on the assumption:
HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function which is, with high probability, a simple utility function
This assumption is strong. Many behaviours can be rationalised as maximising some utility function, but for most behaviours the relevant utility function is one that appears quite crazy – of the form “I value taking this action in these exact circumstances” – and not dangerous. However, adding a constraint to the class of relevant utility functions will render almost all behaviours unable to be interpreted as utility maximising.
It seems to me (and I’ve only thought about it a little, and I certainly don’t have a proof) that there’s basically only one way that the behaviour of HFDT-AGI ends up satisfying this assumptions, and that’s if HFDT can (in the limit of high competence) be well described by an algorithm that does something like:
Pick a function that specifies “true utility”
According to a prior that corresponds to “simple utilities” updated by reinforcement
Pick a hypothesis about how to maximise it
But it seems rather unlikely to me that this is how HFDT works insofar as just about any description of this level of specificity of how HFDT works seems unlikely, and this one doesn’t seem particularly good. I think this also holds for the special case where the function specifying “true utility” is, in some sense or other, the reward allocated to HFDT during training.
I think my argument for why to doubt this assumption is weak, but I hope you can at least appreciate that it is a strong assumption.
I think this is a good piece, and I’m glad you wrote it. One disagreement I have is that I think the behaviour of a deep learning agent trained by HFDT is less predictable than this piece suggests. My guesses would be something like:
30%: the behaviour of HFDT-AGI is basically fine/instability is easy to manage
30%: the behaviour of HFDT-AGI is unstable in a hard-to-manage way and includes seriously bad but not catastrophic behaviours
30%: the behaviour of HFDT-AGI is unstable and includes catastrophic behaviours
10%: the behaviour of HFDT-AGI converges to some kind of catastrophic behaviour
These numbers aren’t the product of much reflection, they’d probably change substantially with a couple of hours’ thought. Also, note that catastrophe is overall highly likely for the bottom 40% of outcomes, as people will probably make multiple attempts at AGI and consequently explore a range of possible behaviours.
I interpret your piece as arguing that option 4 actually has more than 50% weight, and this doesn’t seem right to me.
We seem to agree that an agent trained by episodic reward might behave in a way that is interpretable as “strategically maximising reward”, and might behave in a different way. In particular, if we assume that training episodes are IID from some distribution, then it seems reasonable to assume that the agent behaves as a strategic reward maximiser in episodes sampled from the same distribution, relative to that distribution. However, in the case we’re interested in, we’re interested in behaviour in a more general environment, and it’s much less clear whether the behaviour over time will be interpretable in this way.
I’m not sure about the following: you spend substantial time discussing the actions of a strategic reward maximiser, and argue that this follows from a premise of “correct generalisation”, but also that it’s “very plausible” that its behaviour cannot ultimately be interpreted as strategic reward maximisation. Does your overall assessment of likely AI takeover depend on strategic reward maximisation being quite likely? I think it is substantially more likely that the AIs behaviour cannot be interpreted as strategic reward maximisation. I’ll say a bit more about this later.
If strategic reward maximisation is not likely, you argue “Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control”. I think this section depends on a strong and, in my view, not especially compelling assumption. I’m going to try to spell it out. The structure of your argument seems to be:
HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function
In your preferred reference class of utility functions, this implies catastrophic behaviour
I’m going to call this class “simple utility functions”, though I think this phrasing is not especially epistemically hygienic because it suggests a sharper definition than we actually have
(I’m putting words in your mouth a bit here, so please let me know if this isn’t what you meant).
This depends on the assumption:
HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function which is, with high probability, a simple utility function
This assumption is strong. Many behaviours can be rationalised as maximising some utility function, but for most behaviours the relevant utility function is one that appears quite crazy – of the form “I value taking this action in these exact circumstances” – and not dangerous. However, adding a constraint to the class of relevant utility functions will render almost all behaviours unable to be interpreted as utility maximising.
It seems to me (and I’ve only thought about it a little, and I certainly don’t have a proof) that there’s basically only one way that the behaviour of HFDT-AGI ends up satisfying this assumptions, and that’s if HFDT can (in the limit of high competence) be well described by an algorithm that does something like:
Pick a function that specifies “true utility”
According to a prior that corresponds to “simple utilities” updated by reinforcement
Pick a hypothesis about how to maximise it
But it seems rather unlikely to me that this is how HFDT works insofar as just about any description of this level of specificity of how HFDT works seems unlikely, and this one doesn’t seem particularly good. I think this also holds for the special case where the function specifying “true utility” is, in some sense or other, the reward allocated to HFDT during training.
I think my argument for why to doubt this assumption is weak, but I hope you can at least appreciate that it is a strong assumption.