I think this post is really valuable — I’m curating it. There seems to be a lack of serious but accessible (or at least, readable to non-experts like me) discussions of AI risk and strategy, and this post helps with this problem. I list some specific elements that I liked about the post below.
Please note that I have read this post less carefully than I would have liked to read it, and I have no experience or expertise in AI.
Assorted things I liked about this post
First, I think my mental model of “how we make AI happen safely” improved significantly. That seems like a big win, especially since most of the AI safety content that I’ve read focused on laying out arguments for why AI poses a big risk. This improvement in my mental model is both broad — I have a much better overview of the situation (at least of the near-casting version), and specific (I learned a lot, e.g. I was surprised to see that the success of AI checks and balances was listed as a key question for overall success on AI — this seems like a big update for me). More generally, this post had a very high density of learning-per-paragraph, for me.
Second, I really appreciated this diagram[1], variations of which appear throughout the post to orient and guide the reader:
Third, I really appreciate the clarity of the post. I don’t mean that it was easy to read — it really wasn’t — but rather that it put a lot of effort into making sure that readers took the right conclusions from it and not trying to “sound right.” E.g. I think the last section makes its position clear (if not very specific).
Fourth, there were a number of very helpful frameworks or places where the post took a difficult concept or phenomenon and broke it down. For instance:
The discussion of risk-reducing properties was helpful: breaking alignment into honesty, corrigibility, and legibility helps me place some other things I’ve read and work that I’m aware of, and helps me understand better how it relates to alignment. The example of legibility was also really helpful.
The “accurate reinforcement” section had a fair bit of content that was new to me, but which I could follow. I really appreciated the examples and types of accurate reinforcement.
Similarly, the section on adversarial training had useful concrete models of how we could train out undesired behaviors (and some pitfalls)
I really liked the example “unusual incentive” setup in the testing section (as well as the analogy)
The checks and balances section had content that was basically entirely new to me. I really appreciated that section and the pitfalls outlined, as well as the countermeasures listed.
Finally, the post was just somewhat fun to read. It was a more slow-to-read post than many on the Forum, but e.g. the section on “advanced collusion” was fascinating for someone even a bit nerdy.
I think diagrams are great. Some reasons for this:
- I personally understand things much better when I can see a diagram (I often draw things out before I write)
- I think diagrams can complement plain text by providing an alternate way for readers to engage with the material —which helps accommodate different types of readers and helps check comprehension (you think you understand what was written, then read through the diagram and get a different takeaway, which forces you to check again).
- Diagrams provide a good condensed/overview-style reference. As you read, having the diagram in mind can help you have a sense of the road map or of how different parts of the text relate to each other.
I also think the creation of the diagram is a good exercise to clarify your thoughts.
I think this post is really valuable — I’m curating it. There seems to be a lack of serious but accessible (or at least, readable to non-experts like me) discussions of AI risk and strategy, and this post helps with this problem. I list some specific elements that I liked about the post below.
Please note that I have read this post less carefully than I would have liked to read it, and I have no experience or expertise in AI.
Assorted things I liked about this post
First, I think my mental model of “how we make AI happen safely” improved significantly. That seems like a big win, especially since most of the AI safety content that I’ve read focused on laying out arguments for why AI poses a big risk. This improvement in my mental model is both broad — I have a much better overview of the situation (at least of the near-casting version), and specific (I learned a lot, e.g. I was surprised to see that the success of AI checks and balances was listed as a key question for overall success on AI — this seems like a big update for me). More generally, this post had a very high density of learning-per-paragraph, for me.
Second, I really appreciated this diagram[1], variations of which appear throughout the post to orient and guide the reader:
Third, I really appreciate the clarity of the post. I don’t mean that it was easy to read — it really wasn’t — but rather that it put a lot of effort into making sure that readers took the right conclusions from it and not trying to “sound right.” E.g. I think the last section makes its position clear (if not very specific).
Fourth, there were a number of very helpful frameworks or places where the post took a difficult concept or phenomenon and broke it down. For instance:
The action risk vs. inaction risk distinction seems useful. It’s also discussed elsewhere (and with warnings)
The discussion of risk-reducing properties was helpful: breaking alignment into honesty, corrigibility, and legibility helps me place some other things I’ve read and work that I’m aware of, and helps me understand better how it relates to alignment. The example of legibility was also really helpful.
The “accurate reinforcement” section had a fair bit of content that was new to me, but which I could follow. I really appreciated the examples and types of accurate reinforcement.
Similarly, the section on adversarial training had useful concrete models of how we could train out undesired behaviors (and some pitfalls)
I really liked the example “unusual incentive” setup in the testing section (as well as the analogy)
The checks and balances section had content that was basically entirely new to me. I really appreciated that section and the pitfalls outlined, as well as the countermeasures listed.
The “high-level factors” and key questions section was great. (I wish it had a diagram.)
Finally, the post was just somewhat fun to read. It was a more slow-to-read post than many on the Forum, but e.g. the section on “advanced collusion” was fascinating for someone even a bit nerdy.
I think diagrams are great. Some reasons for this:
- I personally understand things much better when I can see a diagram (I often draw things out before I write)
- I think diagrams can complement plain text by providing an alternate way for readers to engage with the material —which helps accommodate different types of readers and helps check comprehension (you think you understand what was written, then read through the diagram and get a different takeaway, which forces you to check again).
- Diagrams provide a good condensed/overview-style reference. As you read, having the diagram in mind can help you have a sense of the road map or of how different parts of the text relate to each other.
I also think the creation of the diagram is a good exercise to clarify your thoughts.