Miles Tidmarsh
By specific values we mean any particular goal we want AIs to pursue besides deferrence to humans. So democracy and equality would both count, as would goals like harm reduction or utilitarianism
Agreed, the intent here by using “will” was because people have wildly different intuitions of what ‘could’ means. So 100% agree would mean “definitely true” and 30% disagree would mean “probably not”
Definitely agree that stability doesn’t equate to safety, but it sounds like that’s not necessary to your response.
My perspective is that even though current meat production is quite efficient, from the fundamental physics there’s no way that growing a whole living being with a brain and bones and all that is the most efficient possible way of producing this (and immune systems are irrelevant if you have good enough isolation). I do agree that at our current tech level it seems like synthetic meat won’t be competitive anytime soon. While vegan alternatives are delicious to many people, it’s not exactly the same (though wanting to eat animals for psychological reasons is definitely part of it). Though I do agree that these issues are uncertain!
In that case the intent is to vote 100% disagree (as you did here). That’s the belief that anything falling short of full alignment will cause total loss of value
The intent was that, conditional on AI sharing most but not all human values, the AIs wouldn’t change their own values later.
You could have a world where all humans die and the AIs later change their own values, and you could also have worlds where partially aligned AIs don’t wipe out humanity but change their values to be better (e.g. internalizing the goal of being aligned) or worse (e.g. internalizing paperclip maximizer) by our measures.
In worlds where the first TAIs share most but not all human values, what do you think most likely happens?
Thanks Dawn, taking these in turn:
1: “Robust alignment” is a deliberately vague term, it’s meant to incorporate your views about how hard alignment is (e.g. UDT vs. well intentioned)4: It’s a hard question, our perspective is that the backfire->cluelessness-> don’t act chain can be thought of as low tractability
5: By “stable under reflection” we meant the AI reflecting on it’s own values (while interacting with the world), where agreement means they wouldn’t change their values much (stylistically: an AI that shares 70% of our values in 2030 has those same values in 3030). But you’re right that how AIs interact (beyond competition, handled in the last question) is important.
7. S-risks do break the scale and we couldn’t find a good simple way to deal with that (though we’ll do other polls more directly on that later). The intent of “will” was to match 100% expected probability to 100% agree on the scale
If robust alignment is orthogonal to pretraining then shouldn’t that mean a strong disagreement with the statement (that alignment requires pretraining)?
Some people believe that if we get partial alignment (i.e. cares about what we want, but also cares about other things) then we can get decent outcomes for the future (analogous to humans being partially aligned to each other). But others think that if we don’t get alignment perfect ASIs will have incentive to take over, and then will either have value-drift towards something orthogonal to humans or will deliberately reformat it’s own values. “Stable under reflection” is the opinion that this wouldn’t happen: that ASIs that care somewhat about humans would continue to care somewhat about humans in the long term
That’s definitely a valid perspective, consistent with your 100% disagree answer. Other people think that aligned ASI would end things like factory farming due to abundance, cheap synthetic meat, uploading, shifts in values, or something else. There’s also debates around what it would mean for wild animals
[Question] Community Polls on Alignment Controversies
Alignment for Animals
Make the future non-human beings deserve ($5k USD in prizes)
Your AI Travel agent would book you a bullfight: benchmarking implicit animal compassion in Agentic AI
6 months is really far beyond where I expect any measurable effect
I agree there wouldn’t be new effects at that point, but we’re asking about total effects over the 6 months before/since the conference. If the connections etc. persist for 6 months then it should show up in the survey and if they have dissapeared within a few months then that indicates these effects of EAGx attendance are short-lived, which presumably makes them far less significant for a person’s EA engagement and impact overall.
Why would it be important for EAGs impact to have a spiky intervention profile?
If the EAG impacts are spiky enough that they start disspiating substantially within several months (but get re-upped but future attendance) then we should be able to detect a change with our methodology (higher engagement after). You’re right that if the effects persist for many years (and don’t stack much with repeat attendance) then we wouldn’t be able to measure any effect on repeat attendees but this would presume that it isn’t having much impact on repeat attendees anyway. On the other hand, if effects persist for many years then we should be able to detect a strong effect for first-time attendees (though you’d need a bigger sample).
That’s an interesting point: Under this model if EAGx’s don’t matter then we’d expect engagement to decerase for attendees and stable engagement could eb interpeted as a positive effect. A proper cohort analysis could help determine the volatility/churn to give us a baseline and estimate the magnitude of this effect among the sort of people who might attend EAG(x) but didn’t.
That said, I still think that any effect of EAG(x) would presumably be a lot stronger in the 6 months after a conference than in the 6 months after that (/6 months before a conference) so if it had a big effect and engagement of attendees was falling on average than you’d see a bump (or stabilization) in the few months after an event and a bigger decline after that. Though this survey has obvious limitations for detecting that.
What did you mean by the last sentence? Above I’ve assumed that it has an effect not just for new people who are attending a conference for the first time (though my intuition is that this would be bigger) but also in maintaining (on the margin) engagement of repeat attendees. Do you disagree?
Thanks Rudstead, I agree about the “keen beans” limitation, though if anything that makes them more similar to EAGx attendees (which they’re supposed to be a comparison to). In surveys in general there’s also steeply diminishing returns for getting a higher response rate with more reminders or higher cash incentives.
(2) Agreed, but hopefully we’ll be able to continue following people up over time. The main limitation is that loads of people in any cohort study are going to drop out over time, but if it succeeded such a cohort study could provide loads of information.
Thanks for the comment, this is a really strong point.
I think this can make us reasonably confident that the EAGx didn’t make people more engaged on average and even though you already expected this, I think a lot of people did expect EAGs would lead to actively higher engagement among participants. We weren’t trying to measure the EA growth rate of course, we were trying to measure whether the EAGs lead to higher counterfactual engagement among attendees.
The model where an EAG matters could look something like: There are two separate populations of EA: less-engaged members who don’t attend EAGs, and more-engaged members who attend EAGs at least sometimes. And attending an EAG helps push people into being even more engaged and maintains their level of engagement that would otherwise flag. So even if both populations are stable, EAG keeps the high-engagement population more engaged and/or larger.
A similar model where EAG doesn’t matter is that people stay engaged for other reasons and people attend EAG believing incorrectly it will help or as de-facto recreation.
If the first model is true then we should expect EA engagement to be a lot higher in the few months after the conference and gradually fall until at least the few weeks before the conference (and spiking again during/just after the conference). But if the second model is true then any effects on EA engagement from the conference should dissapear quickly, perhaps within a few weeks or even days.
While the survey isn’t perfect for measuring this (6 months is a lot of time for the effects to decay and it would be better for the initial survey would’ve been better weeks before the conference might have been getting people excited) I think it provides significant value since it asks about behavior over the past 6 months in total. You’d expect if the conference had a big effect on maintaining motivation (which averages steady-state across years) that people would donate more, have more connections, attend more events etc. 0-5 months after a conference than 6-12 months after.
Given we don’t see that, it seems harder to argue that EAGs have a big effect on motivation and therefore harder to argue that EAGs play an important role in maintaining the current steady-state motivation and energy of attendees.
It could still be that EAGs matter for other reasons (e.g. a few people get connections that create amazing value) but this seems to provide significant evidence against one major supposed channel of impact.
We considered it and I definitely agree that people who are attending their first EAGx are much more likely to be affected. The issue is that people in that bucket are already likely to be dramatically increasing their level of engagement, so it’s hard to draw conclusions from the results on that front
That was the intervention class we had in mind, though there could be other pretraining interventions that don’t fall cleanly into good/bad values (e.g. promoting risk aversion)