I agree with this basic point, but I think on the other side there is a large gap in concreteness that makes makes it much easier to usefully criticize my approach (I’m at the stage of actually writing pseudocode and code which we can critique).
So far I think that the problems in my approach will also appear for MIRI’s approach. For example:
Solomonoff induction or logical inductors have reliability problems that are analogous to reliability problems for machine learning. So to carry out MIRI’s agenda either you need to formulate induction differently, or you need to somehow solve these problems. (And as far as I can tell, the most promising approaches to this problem apply both to MIRI’s version and the mainstream ML version.) I think Eliezer has long understood this problem and has alluded to it, but it hasn’t been the topic of much discussion (I think largely because MIRI/Eliezer have so many other problems on their plates).
Capability amplification requires breaking cognitive work down into smaller steps. MIRI’s approach also requires such a breakdown. Capability amplification is easier in a simple formal sense (that if you solve the agent foundations you will definitely solve capability amplification, but not the other way around).
I’ve given some concrete definitions of deliberation/extrapolation, and there’s been public argument about whether they really capture human values. I think CEV has avoided those criticisms not because it solves the problem, but because it is sufficiently vague that it’s hard to criticize along these lines (and there are sufficiently many other problems that this one isn’t even at the top of the list). If you want to actually give a satisfying definition of CEV, I feel you are probably going to have to go down the same path that started with this post. I suspect Eliezer has some ideas for how to avoid these problems, but at this point those ideas have been subject to even less public discussion than my approach.
I agree there are further problems in my agenda that will be turned up by my discussion. But I’m not sure there are fewer such problems than for the MIRI agenda, since I think that being closer to concreteness may more than outweigh the smaller amount of discussion.
If you agree that many of my problems also come up eventually for MIRI’s agenda, that’s good news about the general applicability of MIRI’s research (e.g. the reliability problems for Solomonoff induction may provide a good bridge between MIRI’s work and mainstream ML), but I think it would also be a good reason to focus on the difficulties that are common to both approaches rather than to problems like decision theory / self-reference / logical uncertainty / naturalistic agents / ontology identification / multi-level world models / etc.
And as far as I can tell, the most promising approaches to this problem apply both to MIRI’s version and the mainstream ML version.
I’m not sure which approaches you’re referring to. Can you link to some details on this?
Capability amplification requires breaking cognitive work down into smaller steps. MIRI’s approach also requires such a breakdown. Capability amplification is easier in a simple formal sense (that if you solve the agent foundations you will definitely solve capability amplification, but not the other way around).
I don’t understand how this is true. I can see how solving FAI implies solving capability amplification (just emulate the FAI at a low level *), but if all you had was a solution that allows a specific kind of agent (e.g., with values well-defined apart from its implementation details) keep those values as it self-modifies, how does that help a group of short-lived humans who don’t know their own values break down an arbitrary cognitive task and perform it safely and as well as an arbitrary competitor?
(* Actually, even this isn’t really true. In MIRI’s approach, an FAI does not need to be competitive in performance with every AI design in every domain. I think the idea is to either convert mainstream AI research into using the same FAI design, or gain a decisive strategic advantage via superiority in some set of particularly important domains.)
My understanding is, MIRI’s approach is to figure out how to safely increase capability by designing a base agent that can make safe use of arbitrary amounts of computing power and can safely improve itself by modifying its own design/code. The capability amplification approach is to figure out how to safely increase capability by taking a short-lived human as the given base agent, making copies of it and and organize how the copies work together. These seem like very different problems with their own difficulties.
I think CEV has avoided those criticisms not because it solves the problem, but because it is sufficiently vague that it’s hard to criticize along these lines (and there are sufficiently many other problems that this one isn’t even at the top of the list).
I agree that in this area MIRI’s approach and yours face similar difficulties. People (including me) have criticized CEV for being vague and likely very difficult to define/implement though, so MIRI is not exactly getting a free pass by being vague. (I.e., I assume Daniel already took this into account.)
But I’m not sure there are fewer such problems than for the MIRI agenda, since I think that being closer to concreteness may more than outweigh the smaller amount of discussion.
This seems like a fair point, and I’m not sure how to weight these factors either. Given that discussion isn’t particularly costly relative to the potential benefits, an obvious solution is just to encourage more of it. Someone ought to hold a workshop to talk about your ideas, for example.
I think it would also be a good reason to focus on the difficulties that are common to both approaches
MIRI’s traditional goal would allow you to break cognition down into steps that we can describe explicitly and implement on transistors, things like “perform a step of logical deduction,” “adjust the probability of this hypothesis,” “do a step of backwards chaining,” etc. This division does not need to be competitive, but it needs to be reasonably close (close enough to obtain a decisive advantage).
Capability amplification requires breaking cognition down into steps that humans can implement. This decomposition does not need to be competitive, but it needs to be efficient enough that it can be implemented during training. Humans can obviously implement more than transistors, the main difference is that in the agent foundations case you need to figure out every response in advance (but then can have a correspondingly greater reason to think that the decomposition will work / will preserve alignment).
I can talk in more detail about the reduction from (capability amplification --> agent foundations) if it’s not clear whether it is possible and it would have an effect on your view.
On competitiveness:
I would prefer be competitive with non-aligned AI, rather than count on forming a singleton, but this isn’t really a requirement of my approach. When comparing difficulty of two approaches you should presumably compare the difficulty of achieving a fixed goal with one approach or the other.
On reliability:
On the agent foundations side, it seems like plausible approaches involve figuring out how to peer inside the previously-opaque hypotheses, or understanding what characteristic of hypotheses can lead to catastrophic generalization failures and then excluding those from induction. Both of these seem likely applicable to ML models, though would depend on how exactly they play out.
On the ML side, I think the other promising approaches involve either adversarial training, ensembling / unanimous votes, which could be applied to the agent foundations problem.
I can talk in more detail about the reduction from (capability amplification --> agent foundations) if it’s not clear whether it is possible and it would have an effect on your view.
Yeah, this is still not clear. Suppose we had a solution to agent foundations, I don’t see how that necessarily helps me figure out what to do as H in capability amplification. For example the agent foundations solution could say, use (some approximation of) exhaustive search in the following way, with your utility function as the objective function, but that doesn’t help me because I don’t have a utility function.
When comparing difficulty of two approaches you should presumably compare the difficulty of achieving a fixed goal with one approach or the other.
My point was that HRAD potentially enables the strategy of pushing mainstream AI research away from opaque designs (which are hard to compete with while maintaining alignment, because you don’t understand how they work and you can’t just blindly copy the computation that they do without risking safety), whereas in your approach you always have to worry about “how do I compete with with an AI that doesn’t have an overseer or has an overseer who doesn’t care about safety and just lets the AI use whatever opaque and potentially dangerous technique it wants”.
On the agent foundations side, it seems like plausible approaches involve figuring out how to peer inside the previously-opaque hypotheses, or understanding what characteristic of hypotheses can lead to catastrophic generalization failures and then excluding those from induction.
Oh I see. In my mind the problems with Solomonoff Induction means that it’s probably not the right way to define how induction should be done as an ideal, so we should look for something kind of like Solomonoff Induction but better, not try to patch it by doing additional things on top of it. (Like instead of trying to figure out exactly when CDT would make wrong decisions and add more complexity on top of it to handle those cases, replace it with UDT.)
My point was that HRAD potentially enables the strategy of pushing mainstream AI research away from opaque designs (which are hard to compete with while maintaining alignment, because you don’t understand how they work and you can’t just blindly copy the computation that they do without risking safety), whereas in your approach you always have to worry about “how do I compete with with an AI that doesn’t have an overseer or has an overseer who doesn’t care about safety and just lets the AI use whatever opaque and potentially dangerous technique it wants”.
I think both approaches potentially enable this, but are VERY unlikely to deliver. MIRI seems more bullish that fundamental insights will yield AI that is just plain better (Nate gave me the analogy of Judea Pearl coming up with Causal PGMs as such an insight), whereas Paul just seems optimistic that we can get a somewhat negligible performance hit for safe vs. unsafe AI.
But I don’t think MIRI has given very good arguments for why we might expect this; it would be great if someone can articulate or reference the best available arguments.
I have a very strong intuition that dauntingly large safety-performance trade-offs are extremely likely to persist in practice, thus the only answer to the “how do I compete” question seems to be “be the front-runner”.
I agree with this basic point, but I think on the other side there is a large gap in concreteness that makes makes it much easier to usefully criticize my approach (I’m at the stage of actually writing pseudocode and code which we can critique).
So far I think that the problems in my approach will also appear for MIRI’s approach. For example:
Solomonoff induction or logical inductors have reliability problems that are analogous to reliability problems for machine learning. So to carry out MIRI’s agenda either you need to formulate induction differently, or you need to somehow solve these problems. (And as far as I can tell, the most promising approaches to this problem apply both to MIRI’s version and the mainstream ML version.) I think Eliezer has long understood this problem and has alluded to it, but it hasn’t been the topic of much discussion (I think largely because MIRI/Eliezer have so many other problems on their plates).
Capability amplification requires breaking cognitive work down into smaller steps. MIRI’s approach also requires such a breakdown. Capability amplification is easier in a simple formal sense (that if you solve the agent foundations you will definitely solve capability amplification, but not the other way around).
I’ve given some concrete definitions of deliberation/extrapolation, and there’s been public argument about whether they really capture human values. I think CEV has avoided those criticisms not because it solves the problem, but because it is sufficiently vague that it’s hard to criticize along these lines (and there are sufficiently many other problems that this one isn’t even at the top of the list). If you want to actually give a satisfying definition of CEV, I feel you are probably going to have to go down the same path that started with this post. I suspect Eliezer has some ideas for how to avoid these problems, but at this point those ideas have been subject to even less public discussion than my approach.
I agree there are further problems in my agenda that will be turned up by my discussion. But I’m not sure there are fewer such problems than for the MIRI agenda, since I think that being closer to concreteness may more than outweigh the smaller amount of discussion.
If you agree that many of my problems also come up eventually for MIRI’s agenda, that’s good news about the general applicability of MIRI’s research (e.g. the reliability problems for Solomonoff induction may provide a good bridge between MIRI’s work and mainstream ML), but I think it would also be a good reason to focus on the difficulties that are common to both approaches rather than to problems like decision theory / self-reference / logical uncertainty / naturalistic agents / ontology identification / multi-level world models / etc.
I’m not sure which approaches you’re referring to. Can you link to some details on this?
I don’t understand how this is true. I can see how solving FAI implies solving capability amplification (just emulate the FAI at a low level *), but if all you had was a solution that allows a specific kind of agent (e.g., with values well-defined apart from its implementation details) keep those values as it self-modifies, how does that help a group of short-lived humans who don’t know their own values break down an arbitrary cognitive task and perform it safely and as well as an arbitrary competitor?
(* Actually, even this isn’t really true. In MIRI’s approach, an FAI does not need to be competitive in performance with every AI design in every domain. I think the idea is to either convert mainstream AI research into using the same FAI design, or gain a decisive strategic advantage via superiority in some set of particularly important domains.)
My understanding is, MIRI’s approach is to figure out how to safely increase capability by designing a base agent that can make safe use of arbitrary amounts of computing power and can safely improve itself by modifying its own design/code. The capability amplification approach is to figure out how to safely increase capability by taking a short-lived human as the given base agent, making copies of it and and organize how the copies work together. These seem like very different problems with their own difficulties.
I agree that in this area MIRI’s approach and yours face similar difficulties. People (including me) have criticized CEV for being vague and likely very difficult to define/implement though, so MIRI is not exactly getting a free pass by being vague. (I.e., I assume Daniel already took this into account.)
This seems like a fair point, and I’m not sure how to weight these factors either. Given that discussion isn’t particularly costly relative to the potential benefits, an obvious solution is just to encourage more of it. Someone ought to hold a workshop to talk about your ideas, for example.
This makes sense.
On capability amplification:
MIRI’s traditional goal would allow you to break cognition down into steps that we can describe explicitly and implement on transistors, things like “perform a step of logical deduction,” “adjust the probability of this hypothesis,” “do a step of backwards chaining,” etc. This division does not need to be competitive, but it needs to be reasonably close (close enough to obtain a decisive advantage).
Capability amplification requires breaking cognition down into steps that humans can implement. This decomposition does not need to be competitive, but it needs to be efficient enough that it can be implemented during training. Humans can obviously implement more than transistors, the main difference is that in the agent foundations case you need to figure out every response in advance (but then can have a correspondingly greater reason to think that the decomposition will work / will preserve alignment).
I can talk in more detail about the reduction from (capability amplification --> agent foundations) if it’s not clear whether it is possible and it would have an effect on your view.
On competitiveness:
I would prefer be competitive with non-aligned AI, rather than count on forming a singleton, but this isn’t really a requirement of my approach. When comparing difficulty of two approaches you should presumably compare the difficulty of achieving a fixed goal with one approach or the other.
On reliability:
On the agent foundations side, it seems like plausible approaches involve figuring out how to peer inside the previously-opaque hypotheses, or understanding what characteristic of hypotheses can lead to catastrophic generalization failures and then excluding those from induction. Both of these seem likely applicable to ML models, though would depend on how exactly they play out.
On the ML side, I think the other promising approaches involve either adversarial training, ensembling / unanimous votes, which could be applied to the agent foundations problem.
Yeah, this is still not clear. Suppose we had a solution to agent foundations, I don’t see how that necessarily helps me figure out what to do as H in capability amplification. For example the agent foundations solution could say, use (some approximation of) exhaustive search in the following way, with your utility function as the objective function, but that doesn’t help me because I don’t have a utility function.
My point was that HRAD potentially enables the strategy of pushing mainstream AI research away from opaque designs (which are hard to compete with while maintaining alignment, because you don’t understand how they work and you can’t just blindly copy the computation that they do without risking safety), whereas in your approach you always have to worry about “how do I compete with with an AI that doesn’t have an overseer or has an overseer who doesn’t care about safety and just lets the AI use whatever opaque and potentially dangerous technique it wants”.
Oh I see. In my mind the problems with Solomonoff Induction means that it’s probably not the right way to define how induction should be done as an ideal, so we should look for something kind of like Solomonoff Induction but better, not try to patch it by doing additional things on top of it. (Like instead of trying to figure out exactly when CDT would make wrong decisions and add more complexity on top of it to handle those cases, replace it with UDT.)
I think both approaches potentially enable this, but are VERY unlikely to deliver. MIRI seems more bullish that fundamental insights will yield AI that is just plain better (Nate gave me the analogy of Judea Pearl coming up with Causal PGMs as such an insight), whereas Paul just seems optimistic that we can get a somewhat negligible performance hit for safe vs. unsafe AI.
But I don’t think MIRI has given very good arguments for why we might expect this; it would be great if someone can articulate or reference the best available arguments.
I have a very strong intuition that dauntingly large safety-performance trade-offs are extremely likely to persist in practice, thus the only answer to the “how do I compete” question seems to be “be the front-runner”.