Thanks for this interesting post! I am probably not well-suited to apply for the fellowship. However, I was interested in the ideas you mentioned, so I wanted to share some ideas I had regardless. They might not be useful, but it was helpful for me to get them out of my head!
Behaviour science I work in this space, and much of the theory seems very relevant to understanding non-human agents. For instance, I wonder there would be value in exploring if models of human behaviour such as COM-B and the FBM could be useful in modelling the actions of AI agents. For instance, if it is useful to theorise that a human agent’s behaviour only occurs if they have sufficient motivation and ability and a trigger to act (as per the FBM), it might also be useful to do so for a non-human agent.
Persuasion I used to be interested in this (it is basically attitude and behaviour change).
I wonder if the idea of persuasion and underlying theory is useful for understanding how AI agents should respond to information and choose which information to share with other agents to achieve goals (i.e., to persuade). If so, then communications/processing models such as McGuire, Shannon-Weaver, or Lasswell may be useful.
Related to that, I wrote a (not very good) paper outlining the concept of persuasion a long time ago, which finished with: ”From a philosophical perspective, we recommend that future research should consider if non-human agents can not only persuade but can also be persuaded. Research already explores how emerging technologies, such as artificial intelligences, may be human-like to varying extents (see Bostrom, 2014; Kurzweil, 2005; Searle, 1980). If we can believe that non-biological beings might be conscious and human-like (Calverley, 2008; Hofstadter & Dennett, 1988) then maybe we should also consider whether these beings will have beliefs, attitudes and behaviours and thus be subject to persuasion?”
Systems thinking I am still a novice in this area and what I know is probably outdated. I wonder if there could be value in drawing on concepts in systems thinking when attempting to manage AI. As an example, this model suggests 12 leverage points for systems change (based on this work). Could we model/manage an agent’s behavioural outcomes in the same way?
I am interested to know what you think, if you have time. Do any of these areas seem fruitful? Are they irrelevant, or are there better approaches already in use?
I am very aware that I don’t have a good understanding of how AI agent’s behaviour is modelled with the AI safety/governance literature. I also don’t understand exactly i) what differences there are between those approaches and the approaches used in behavioural science/social science or ii) justifications for different approaches would be needed for each.
Can you (or anyone else) recommend things that I should read/watch to improve my understanding?
Behaviour science I work in this space, and much of the theory seems very relevant to understanding non-human agents. For instance, I wonder there would be value in exploring if models of human behaviour such as COM-B and the FBM could be useful in modelling the actions of AI agents. For instance, if it is useful to theorise that a human agent’s behaviour only occurs if they have sufficient motivation and ability and a trigger to act (as per the FBM), it might also be useful to do so for a non-human agent
This sounds like a potentially good analogy, but one has to be careful that it doesn’t rely on assumptions that only apply to humans, or to quite bounded agents.
I used to be interested in this (it is basically attitude and behaviour change).
I wonder if the idea of persuasion and underlying theory is useful for understanding how AI agents should respond to information and choose which information to share with other agents to achieve goals (i.e., to persuade). If so, then communications/processing models such as McGuire, Shannon-Weaver, or Lasswell may be useful.
Related to that, I wrote a (not very good) paper outlining the concept of persuasion a long time ago, which finished with: ”From a philosophical perspective, we recommend that future research should consider if non-human agents can not only persuade but can also be persuaded. Research already explores how emerging technologies, such as artificial intelligences, may be human-like to varying extents (see Bostrom, 2014; Kurzweil, 2005; Searle, 1980). If we can believe that non-biological beings might be conscious and human-like (Calverley, 2008; Hofstadter & Dennett, 1988) then maybe we should also consider whether these beings will have beliefs, attitudes and behaviours and thus be subject to persuasion?”
The topics of persuasion (both from AIs and of AIs) is indeed an important topic in alignment. There’s a general risk that optimization is very easily spent to push for manipulation of human, whether intentionally (training an AI which actually end up wanting to do something else, and so has reason to manipulate us) or unintentionally (training an AI such that it’s incentivized to answer what we would prefer rather than the most accurate and appropriate answer).
For the persuasion of AIs by AIs, there are some initial thoughts around memetics for AIs, but they are not fully formed yet.
Systems thinking I am still a novice in this area and what I know is probably outdated. I wonder if there could be value in drawing on concepts in systems thinking when attempting to manage AI. As an example, this model suggests 12 leverage points for systems change (based on this work). Could we model/manage an agent’s behavioural outcomes in the same way?
Don’t know much about this literature, but it makes me think of more structural takes on the alignment problem, that emphasize the importance of the structure of society funneling and pushing optimization, rather than the individual power of agents to alter it.
I am interested to know what you think, if you have time. Do any of these areas seem fruitful? Are they irrelevant, or are there better approaches already in use?
So, as can be seen above, none of these ideas sounds bad or impossible to make work, but judging them correctly would require far more effort put into analyzing them. Maybe you should apply for the fellowship, especially for behavioral work on which you’re more of an expert? ;)
I am very aware that I don’t have a good understanding of how AI agent’s behaviour is modelled with the AI safety/governance literature. I also don’t understand exactly i) what differences there are between those approaches and the approaches used in behavioural science/social science or ii) justifications for different approaches would be needed for each.
Can you (or anyone else) recommend things that I should read/watch to improve my understanding?
It’s a very good question, and shamefully I don’t have any answer that’s completely satisfying. But here are the next best things, some resources that will give you a more rounded perspective of alignment:
Richard Ngo’s AGI safety from first principles, a condensed starter that presents the main line of arguments in a modern (post ML revolution) way.
Thanks for this interesting post! I am probably not well-suited to apply for the fellowship. However, I was interested in the ideas you mentioned, so I wanted to share some ideas I had regardless. They might not be useful, but it was helpful for me to get them out of my head!
Behaviour science
I work in this space, and much of the theory seems very relevant to understanding non-human agents. For instance, I wonder there would be value in exploring if models of human behaviour such as COM-B and the FBM could be useful in modelling the actions of AI agents. For instance, if it is useful to theorise that a human agent’s behaviour only occurs if they have sufficient motivation and ability and a trigger to act (as per the FBM), it might also be useful to do so for a non-human agent.
Persuasion
I used to be interested in this (it is basically attitude and behaviour change).
I wonder if the idea of persuasion and underlying theory is useful for understanding how AI agents should respond to information and choose which information to share with other agents to achieve goals (i.e., to persuade). If so, then communications/processing models such as McGuire, Shannon-Weaver, or Lasswell may be useful.
Related to that, I wrote a (not very good) paper outlining the concept of persuasion a long time ago, which finished with:
”From a philosophical perspective, we recommend that future research should consider if non-human agents can not only persuade but can also be persuaded. Research already explores how emerging technologies, such as artificial intelligences, may be human-like to varying extents (see Bostrom, 2014; Kurzweil, 2005; Searle, 1980). If we can believe that non-biological beings might be conscious and human-like (Calverley, 2008; Hofstadter & Dennett, 1988) then maybe we should also consider whether these beings will have beliefs, attitudes and behaviours and thus be subject to persuasion?”
Systems thinking
I am still a novice in this area and what I know is probably outdated. I wonder if there could be value in drawing on concepts in systems thinking when attempting to manage AI. As an example, this model suggests 12 leverage points for systems change (based on this work). Could we model/manage an agent’s behavioural outcomes in the same way?
I am interested to know what you think, if you have time. Do any of these areas seem fruitful? Are they irrelevant, or are there better approaches already in use?
I am very aware that I don’t have a good understanding of how AI agent’s behaviour is modelled with the AI safety/governance literature. I also don’t understand exactly i) what differences there are between those approaches and the approaches used in behavioural science/social science or ii) justifications for different approaches would be needed for each.
Can you (or anyone else) recommend things that I should read/watch to improve my understanding?
Thanks for the thoughtful comment!
This sounds like a potentially good analogy, but one has to be careful that it doesn’t rely on assumptions that only apply to humans, or to quite bounded agents.
The topics of persuasion (both from AIs and of AIs) is indeed an important topic in alignment. There’s a general risk that optimization is very easily spent to push for manipulation of human, whether intentionally (training an AI which actually end up wanting to do something else, and so has reason to manipulate us) or unintentionally (training an AI such that it’s incentivized to answer what we would prefer rather than the most accurate and appropriate answer).
For the persuasion of AIs by AIs, there are some initial thoughts around memetics for AIs, but they are not fully formed yet.
Don’t know much about this literature, but it makes me think of more structural takes on the alignment problem, that emphasize the importance of the structure of society funneling and pushing optimization, rather than the individual power of agents to alter it.
So, as can be seen above, none of these ideas sounds bad or impossible to make work, but judging them correctly would require far more effort put into analyzing them. Maybe you should apply for the fellowship, especially for behavioral work on which you’re more of an expert? ;)
It’s a very good question, and shamefully I don’t have any answer that’s completely satisfying. But here are the next best things, some resources that will give you a more rounded perspective of alignment:
Richard Ngo’s AGI safety from first principles, a condensed starter that presents the main line of arguments in a modern (post ML revolution) way.
Rob Miles’s YouTube channel on alignment, with great videos on many different topics.
Andrew Critch and David Krueger’s ARCHES, a survey of alignment problems and perspectives that puts more emphasis than most on structural approaches.
Thanks, Adam, this was very helpful! I really appreciate that you took the time to respond in such detail.
I will see what I can do for the fellowship. I might be able to convince someone else to do it and then I can collaborate with them :)