I think this post is conflating two different things in an important way. There are two distinctions here about what we might mean by “A is trying to do what H wants”:
Whether the trying is instrumental, or terminal. (A might be trying to do what H wants because A intrinsically wants to try to do so, or because it thinks trying to do so is a good strategy to achieve some unrelated thing it wants.)
Whether the trying is because of coinciding utility functions, or because of some other value structure, such as loyalty/corrigibility. A might be trying to do what H wants because it values [the exact same things as H values], or it might be because A values [trying to do things H values].
To illustrate the second distinction: I could drive my son to a political rally because I also believe in the cause, or because I love my son and want to see him succeed at whatever goals he has.
I think it is much more likely that we will instill AIs with something like loyalty than that we will instill them with our exact values directly, and I think most alignment optimists consider this the more promising direction. (I think this is essentially what the term “corrigibility” refers to in alignment.) I know this has been Paul Christiano’s approach for a long time now, see for example this post.
I think this post is conflating two different things in an important way. There are two distinctions here about what we might mean by “A is trying to do what H wants”:
Whether the trying is instrumental, or terminal. (A might be trying to do what H wants because A intrinsically wants to try to do so, or because it thinks trying to do so is a good strategy to achieve some unrelated thing it wants.)
Whether the trying is because of coinciding utility functions, or because of some other value structure, such as loyalty/corrigibility. A might be trying to do what H wants because it values [the exact same things as H values], or it might be because A values [trying to do things H values].
To illustrate the second distinction: I could drive my son to a political rally because I also believe in the cause, or because I love my son and want to see him succeed at whatever goals he has.
I think it is much more likely that we will instill AIs with something like loyalty than that we will instill them with our exact values directly, and I think most alignment optimists consider this the more promising direction. (I think this is essentially what the term “corrigibility” refers to in alignment.) I know this has been Paul Christiano’s approach for a long time now, see for example this post.