AI alignment research is focused on promoting alignment between the AI systems that we design and train, and our human goals, values, preferences, and norms.
I still think we should invest a huge amount of talent, time, thought, energy, and money into research on AI alignment with people’s secular values.
I think almost all alignment research is concerned with issues like what’s going on inside the AI or how can we get the AI to tell us what it thinks or how can we get the AI to do what we tell it to do rather than something that depends much on what it is that humans actually value. Alignment-research-taking-religion-into-account would look identical to current alignment research.
I don’t think it’s right that the broad project of alignment would look the same with and without considering religion. I’m curious what your reasoning is here and if I’m mistaken.
One way of reading this comment is that it’s a semantic disagreement about what alignment means. The OP seems to be talking about the problem of getting an AI to do the right thing, writ large, which may encompass a broader set of topics than alignment research as you define it.
Two other ways of reading it are that (a) solving the problem the OP is addressing (getting an AI to do the right thing, writ large) does not depend on values, or (b) solving the alignment problem will necessarily solve the value problem. I don’t entirely see how you can justify (a) without a claim like (b), though I’m curious if there’s a way.
You might justify (b) via the argument that solving alignment involves coming up a way to extrapolate values. Perhaps it is irrelevant which particular person you start with, because the extrapolation process will end up at the same point. To me this seems quite dubious. We have no such method and observe deep disagreement in the world. Which methods we use to resolve disagreement and determine whose values we include seem to involve a question of values. And from my lay sense, the methods of alignment that are currently most-discussed involve aligning it with specific preferences.
One way of reading this comment is that it’s a semantic disagreement about what alignment means. The OP seems to be talking about the problem of getting an AI to do the right thing, writ large, which may encompass a broader set of topics than alignment research as you define it.
Kind of.
Alignment researchers want AI to do the right thing. How they try to do that is mostly not sensitive to what humans want; different researchers do different stuff but it’s generally more like interpretability or robustness than teaching specific values to AI systems. So even if religion was more popular/appreciated/whatever, they’d still be doing stuff like interpretability, and still be doing it in the same way.
(a) and (b) are clearly false, but many believe that most of the making-AI-go-well problem is getting from AI killing everyone to AI not killing everyone and that going from AI not killing everyone to AI doing stuff everyone thinks is great is relatively easy. And value-loading approaches like CEV should be literally optimal regardless of religiosity.
Few alignment researchers are excited about Stuart Russell’s research, I think (at least in the bay area, where the alignment researchers I know are). I agree that if his style of research was more popular, thinking about values and metavalues and such would be more relevant.
Zach—I may be an AI alignment newbie, but I don’t understand how ‘alignment’ could be ‘mostly not sensitive to what humans want’. I thought alignment with what humans want was the whole point of alignment. But now you’re making it sound like ‘AI alignment’ means ’alignment with what Bay Area AI researchers think should be everyone’s secular priorities.
Even CEV seems to depend on an assumption that there is a high degree of common ground among all humans regarding core existential values—Yudkowsky explicitly says that CEV could only works ‘to whatever extent most existing humans, thus extrapolated, would predictably want* the same things’. If some humans are antinatalists, or Earth First eco-activisits, or religious fundamentalists yearning for the Rapture, or bitter nihilists, who want us to go extinct, then CEV won’t work to prevent AI from killing everyone. CEV and most ‘alignment’ methods only seem to work if they sweep the true religious, political, and ideological diversity of humans under the rug.
I also see no a priori reason why getting from (1) AI killing every one to AI not killing everyone would be easier than getting from (2) AI not killing eveyone to AI doing stuff everyone thinks is great. The first issue (1) seems to require explicitly prioritizing some human corporeal/body interests over the brain’s stated preferences, as I discussed here .
zdgroff—that link re. specific preferences to the 80k Hours interview with Stuart Russell is a fascinating example of what I’m concerned about. Russell seems to be arguing that either we align an AI system with one person’s individual stated preferences at a time, or we’d have to discover the ultimate moral truth of the universe, and get the AI aligned to that.
But where’s the middle ground of trying to align with multiple people who have diverse values? That’s where most of the near-term X risk lurks, IMHO—i.e. in runaway geopolitical or religious wars, or other human conflicts, amplified by AI capabilities. Even if we’re talking fairly narrow AI rather than AGI.
Zach—thanks for this comment; I’m working on a reply to it, which I’ll published as an EA Forum post within a couple of days.
A preview: I think there are good theoretical and empirical reasons why alignment research taking the full heterogeneity of human value types into account (including differences between religious values, political values, food preferences, economic ambitions, mate preferences, cultural taboos, aesthetic tastes, etc) would NOT look identical to current alignment research.
I think almost all alignment research is concerned with issues like what’s going on inside the AI or how can we get the AI to tell us what it thinks or how can we get the AI to do what we tell it to do rather than something that depends much on what it is that humans actually value. Alignment-research-taking-religion-into-account would look identical to current alignment research.
I don’t think it’s right that the broad project of alignment would look the same with and without considering religion. I’m curious what your reasoning is here and if I’m mistaken.
One way of reading this comment is that it’s a semantic disagreement about what alignment means. The OP seems to be talking about the problem of getting an AI to do the right thing, writ large, which may encompass a broader set of topics than alignment research as you define it.
Two other ways of reading it are that (a) solving the problem the OP is addressing (getting an AI to do the right thing, writ large) does not depend on values, or (b) solving the alignment problem will necessarily solve the value problem. I don’t entirely see how you can justify (a) without a claim like (b), though I’m curious if there’s a way.
You might justify (b) via the argument that solving alignment involves coming up a way to extrapolate values. Perhaps it is irrelevant which particular person you start with, because the extrapolation process will end up at the same point. To me this seems quite dubious. We have no such method and observe deep disagreement in the world. Which methods we use to resolve disagreement and determine whose values we include seem to involve a question of values. And from my lay sense, the methods of alignment that are currently most-discussed involve aligning it with specific preferences.
Kind of.
Alignment researchers want AI to do the right thing. How they try to do that is mostly not sensitive to what humans want; different researchers do different stuff but it’s generally more like interpretability or robustness than teaching specific values to AI systems. So even if religion was more popular/appreciated/whatever, they’d still be doing stuff like interpretability, and still be doing it in the same way.
(a) and (b) are clearly false, but many believe that most of the making-AI-go-well problem is getting from AI killing everyone to AI not killing everyone and that going from AI not killing everyone to AI doing stuff everyone thinks is great is relatively easy. And value-loading approaches like CEV should be literally optimal regardless of religiosity.
Few alignment researchers are excited about Stuart Russell’s research, I think (at least in the bay area, where the alignment researchers I know are). I agree that if his style of research was more popular, thinking about values and metavalues and such would be more relevant.
Zach—I may be an AI alignment newbie, but I don’t understand how ‘alignment’ could be ‘mostly not sensitive to what humans want’. I thought alignment with what humans want was the whole point of alignment. But now you’re making it sound like ‘AI alignment’ means ’alignment with what Bay Area AI researchers think should be everyone’s secular priorities.
Even CEV seems to depend on an assumption that there is a high degree of common ground among all humans regarding core existential values—Yudkowsky explicitly says that CEV could only works ‘to whatever extent most existing humans, thus extrapolated, would predictably want* the same things’. If some humans are antinatalists, or Earth First eco-activisits, or religious fundamentalists yearning for the Rapture, or bitter nihilists, who want us to go extinct, then CEV won’t work to prevent AI from killing everyone. CEV and most ‘alignment’ methods only seem to work if they sweep the true religious, political, and ideological diversity of humans under the rug.
I also see no a priori reason why getting from (1) AI killing every one to AI not killing everyone would be easier than getting from (2) AI not killing eveyone to AI doing stuff everyone thinks is great. The first issue (1) seems to require explicitly prioritizing some human corporeal/body interests over the brain’s stated preferences, as I discussed here .
zdgroff—that link re. specific preferences to the 80k Hours interview with Stuart Russell is a fascinating example of what I’m concerned about. Russell seems to be arguing that either we align an AI system with one person’s individual stated preferences at a time, or we’d have to discover the ultimate moral truth of the universe, and get the AI aligned to that.
But where’s the middle ground of trying to align with multiple people who have diverse values? That’s where most of the near-term X risk lurks, IMHO—i.e. in runaway geopolitical or religious wars, or other human conflicts, amplified by AI capabilities. Even if we’re talking fairly narrow AI rather than AGI.
Zach—thanks for this comment; I’m working on a reply to it, which I’ll published as an EA Forum post within a couple of days.
A preview: I think there are good theoretical and empirical reasons why alignment research taking the full heterogeneity of human value types into account (including differences between religious values, political values, food preferences, economic ambitions, mate preferences, cultural taboos, aesthetic tastes, etc) would NOT look identical to current alignment research.
Zach—update: I’ve written a new post today that tries to address your point: https://forum.effectivealtruism.org/posts/KZiaBCWWW3FtZXGBi/the-heterogeneity-of-human-value-types-implications-for-ai