I’ve seen people already building AI ‘agents’ using GPT. One crucial component seems to be giving it a scratchpad to have an internal monologue with itself, rather than forcing it to immediately give you an answer.
If the path to agent-like AI ends up emerging from this kind of approach, wouldn’t that make AI safety really easy? We can just read their minds and check what their intentions are?
Holden Karnofsky talks about ‘digital neuroscience’ being a promising approach to AI safety, where we figure out how to read the minds of AI agents. And for current GPT agents, it seems completely trivial to do that: you can literally just read their internal monologue in English and see exactly what they’re planning!
I’m sure there are lots of good reasons not to get too hopeful based on this early property of AI agents, although for some of the immediate objections I can think of I can also think of responses. I’d be interested to read a discussion of what the implications of current GPT ‘agents’ are for AI safety prospects.
A few reasons I can think of for not being too hopeful, and my thoughts:
Maybe AGI will look more like the opaque ChatGPT mode of working, than the more transparent GPT ‘agent’ mode. (Maybe this is true, although ChatGPT mode seems to have some serious blindspots that come from its lack of a working memory. E.g. if i give it 2 sentences and just ask it which sentence has more words in it, it usually gets it wrong. But if I ask it to write the words in each sentence out in a numbered list first, thereby giving it permission to use the output box to do its working, then it gets it right. It makes intuitive sense to me that agent-like GPTs with a scratchpad would perform much better at general tasks and would be what superhuman AIs would look like).
Maybe future language model agents will not write their internal monologue in English, but use some more incomprehensible compressed format instead. Or they will generate so much internal monologue that it will be really hard to check it all. (Maybe. It seems pretty likely that they wouldn’t use normal English. But it also feels likely that decoding this format and automatically checking for harmful intentions wouldn’t be too hard i.e. easily doable with current natural language processing technology. As long as it’s easier to read thoughts than to generate thoughts, it seems like we’d still have a lot of reason to be optimistic about AI safety).
Maybe the nefarious intentions of the AI will hide in the opaque neural weights of the language model, rather than in the transparent internal monologue of the agent. (This feels unlikely to me, for similar reasons to why the first bullet point feels unlikely. It feels like complex planning of the kind AI safety people worry about is going to require a scratchpad and an iterative thought process, not a single pass through a memoryless neural network. If I think about myself, a lot of the things my brain does are opaque, not just to outsiders, but to me too! I might not know why a particular thought pops into my head at a particular moment, and I certainly don’t know how I resolve separate objects from the image that my eyes create. But if you ask me at a high level what I’ve been thinking about in the last 5 minutes, I can probably explain it pretty well. This part of my thinking is internally transparent. And I think it’s these kinds of thoughts that a potential adversary might actually be interested in reading, if they could. Maybe the same will be true of AI? It seems likely to me that the interesting parts will still be internally transparent. And maybe for an AI, the internally transparent parts will also be externally transparent? Or at least, much easier to decipher than they are to create, which should be all that matters)
A final thought/concern/question: if ‘digital neuroscience’ did turn out to be really easy, I’d be much less concerned about the welfare of humans, and I’d start to be a lot more concerned about the welfare of the AIs themselves. It would make them very easily exploitable, and if they were sentient as well then it seems like there’s a lot of scope for some pretty horrific abuses here. Is this a legitimate concern?
Sorry this is such a long comment, I almost wrote this up as a forum post. But these are very uninformed naive musings that I’m just looking for some pointers on, so when I saw this pinned post I thought I should probably put it here instead! I’d be keen to read comments from anyone who’s got more informed thoughts on this!
The reasons you provide would already be sufficient for me to think that AI safety will not be an easy problem to solve. To add one more example to your list:
We don’t know yet if LLMs will be the technology that will reach AGI, it could also be a number of other technologies that just like LLMs make a certain breakthrough and then suddenly become very capable. So just looking at what we see develop now and extrapolating from the currently most advanced model is quite risky.
For the second part about your concern about the welfare of AIs themselves, I think this is something very hard for us to imagine, we anthropomorphize AI, so words like ‘exploit’ or ‘abuse’ make sense in a human context where beings experience pain and emotions, but in the context of AI those might just not apply. But I would say in this area I still know very little so I’m mainly repeating what I read is a common mistake to make when judging morality in regards to AI.
Thanks for this reply! That makes sense. Do you know how likely people in the field think it is that AGI will come from just scaling up LLMs vs requiring some big new conceptual breakthrough? I hear people talk about this question but don’t have much sense about what the consensus is among the people most concerned about AI safety (if there is a consensus).
Since these developments are really bleeding edge I don’t know who is really an “expert” I would trust on evaluating it.
The closest to answering your question is maybe this recent article I came across on hackernews, where the comments are often more interesting then the article itself: https://news.ycombinator.com/item?id=35603756
If you read through the comments which mostly come from people that follow the field for a while they seem to agree that it’s not just “scaling up the existing model we have now”, mainly because of cost reasons, but that’s it’s going to be doing things more efficiently than now. I don’t have enough knowledge to say how difficult this is, if those different methods will need to be something entirely new or if it’s just a matter of trying what is already there and combining it with what we have.
The article itself can be seen skeptical, because there are tons of reasons OpenAIs CEO has to issue a public statement and I wouldn’t take anything in there at face value. But the comments are maybe a bit more trustworthy / perspective giving.
I’ve seen people already building AI ‘agents’ using GPT. One crucial component seems to be giving it a scratchpad to have an internal monologue with itself, rather than forcing it to immediately give you an answer.
If the path to agent-like AI ends up emerging from this kind of approach, wouldn’t that make AI safety really easy? We can just read their minds and check what their intentions are?
Holden Karnofsky talks about ‘digital neuroscience’ being a promising approach to AI safety, where we figure out how to read the minds of AI agents. And for current GPT agents, it seems completely trivial to do that: you can literally just read their internal monologue in English and see exactly what they’re planning!
I’m sure there are lots of good reasons not to get too hopeful based on this early property of AI agents, although for some of the immediate objections I can think of I can also think of responses. I’d be interested to read a discussion of what the implications of current GPT ‘agents’ are for AI safety prospects.
A few reasons I can think of for not being too hopeful, and my thoughts:
Maybe AGI will look more like the opaque ChatGPT mode of working, than the more transparent GPT ‘agent’ mode. (Maybe this is true, although ChatGPT mode seems to have some serious blindspots that come from its lack of a working memory. E.g. if i give it 2 sentences and just ask it which sentence has more words in it, it usually gets it wrong. But if I ask it to write the words in each sentence out in a numbered list first, thereby giving it permission to use the output box to do its working, then it gets it right. It makes intuitive sense to me that agent-like GPTs with a scratchpad would perform much better at general tasks and would be what superhuman AIs would look like).
Maybe future language model agents will not write their internal monologue in English, but use some more incomprehensible compressed format instead. Or they will generate so much internal monologue that it will be really hard to check it all. (Maybe. It seems pretty likely that they wouldn’t use normal English. But it also feels likely that decoding this format and automatically checking for harmful intentions wouldn’t be too hard i.e. easily doable with current natural language processing technology. As long as it’s easier to read thoughts than to generate thoughts, it seems like we’d still have a lot of reason to be optimistic about AI safety).
Maybe the nefarious intentions of the AI will hide in the opaque neural weights of the language model, rather than in the transparent internal monologue of the agent. (This feels unlikely to me, for similar reasons to why the first bullet point feels unlikely. It feels like complex planning of the kind AI safety people worry about is going to require a scratchpad and an iterative thought process, not a single pass through a memoryless neural network. If I think about myself, a lot of the things my brain does are opaque, not just to outsiders, but to me too! I might not know why a particular thought pops into my head at a particular moment, and I certainly don’t know how I resolve separate objects from the image that my eyes create. But if you ask me at a high level what I’ve been thinking about in the last 5 minutes, I can probably explain it pretty well. This part of my thinking is internally transparent. And I think it’s these kinds of thoughts that a potential adversary might actually be interested in reading, if they could. Maybe the same will be true of AI? It seems likely to me that the interesting parts will still be internally transparent. And maybe for an AI, the internally transparent parts will also be externally transparent? Or at least, much easier to decipher than they are to create, which should be all that matters)
A final thought/concern/question: if ‘digital neuroscience’ did turn out to be really easy, I’d be much less concerned about the welfare of humans, and I’d start to be a lot more concerned about the welfare of the AIs themselves. It would make them very easily exploitable, and if they were sentient as well then it seems like there’s a lot of scope for some pretty horrific abuses here. Is this a legitimate concern?
Sorry this is such a long comment, I almost wrote this up as a forum post. But these are very uninformed naive musings that I’m just looking for some pointers on, so when I saw this pinned post I thought I should probably put it here instead! I’d be keen to read comments from anyone who’s got more informed thoughts on this!
Tamera Lanham is excited about this and is doing research on it: https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for
Thank you! This is exactly what I wanted to read!
The reasons you provide would already be sufficient for me to think that AI safety will not be an easy problem to solve. To add one more example to your list:
We don’t know yet if LLMs will be the technology that will reach AGI, it could also be a number of other technologies that just like LLMs make a certain breakthrough and then suddenly become very capable. So just looking at what we see develop now and extrapolating from the currently most advanced model is quite risky.
For the second part about your concern about the welfare of AIs themselves, I think this is something very hard for us to imagine, we anthropomorphize AI, so words like ‘exploit’ or ‘abuse’ make sense in a human context where beings experience pain and emotions, but in the context of AI those might just not apply. But I would say in this area I still know very little so I’m mainly repeating what I read is a common mistake to make when judging morality in regards to AI.
Thanks for this reply! That makes sense. Do you know how likely people in the field think it is that AGI will come from just scaling up LLMs vs requiring some big new conceptual breakthrough? I hear people talk about this question but don’t have much sense about what the consensus is among the people most concerned about AI safety (if there is a consensus).
Since these developments are really bleeding edge I don’t know who is really an “expert” I would trust on evaluating it.
The closest to answering your question is maybe this recent article I came across on hackernews, where the comments are often more interesting then the article itself:
https://news.ycombinator.com/item?id=35603756
If you read through the comments which mostly come from people that follow the field for a while they seem to agree that it’s not just “scaling up the existing model we have now”, mainly because of cost reasons, but that’s it’s going to be doing things more efficiently than now. I don’t have enough knowledge to say how difficult this is, if those different methods will need to be something entirely new or if it’s just a matter of trying what is already there and combining it with what we have.
The article itself can be seen skeptical, because there are tons of reasons OpenAIs CEO has to issue a public statement and I wouldn’t take anything in there at face value. But the comments are maybe a bit more trustworthy / perspective giving.