‘Relevant error’ is just meant to mean a factual error or mistaken reasoning. Thanks for pointing out the ambiguity, though, we might revise this part.
Will Aldred
Thanks, yeah, I like the idea of guidelines popping up while hovering. (Although, I’m unsure if the rest of the team like it, and I’m ultimately not the decision maker.) If going this route, my favoured implementation, which I think is pretty aligned with what you’re saying, is for the popping up to happen in line with a spaced repetition algorithm. That is, often enough—especially at the beginning—that users remember the guidelines, but hopefully not so often that the pop ups become redundant and annoying.
The Forum moderation team (which includes myself) is revisiting thinking about this forum’s norms. One thing we’ve noticed is that we’re unsure to what extent users are actually aware of the norms. (It’s all well and good writing up some great norms, but if users don’t follow them, then we have failed at our job.)
Our voting guidelines are of particular concern,[1] hence this poll. We’d really appreciate you all taking part, especially if you don’t usually take part in polls but do take part in voting. (We worry that the ‘silent majority’ of our users—i.e., those who vote, and thus shape this forum’s incentive landscape, but don’t generally engage beyond voting—may be less in tune with our norms than our most visibly engaged users. Therefore, we would love to see this demographic represented in the poll above.)
Depending on the poll’s results, we may take action up to and including building new features into the forum’s UI, to help remind users of the guidelines.[2]
For reference, the tl;dr version of our voting guidelines is pasted below. You can find the full version here.[3]
Action
If… Not if… Strong upvote
Reading this will help people do good
You learned something important
You think many more people might benefit from seeing it
You want to signal that this sort of behavior adds a lot of value
“I agree and want others to see this opinion first.”
(but do feel free to agree-vote)
Upvote
You think it adds something to the conversation, or you found it useful
People should imitate some aspect of the behavior in the future
You want others to see it
You just generally like it
“Oh, I like the author, they’re cool.” Downvote
There’s a relevant error
The comment or post didn’t add to the conversation, and maybe actually distracted
“There are grammatical errors in this comment.” Strong downvote
“I disagree with this opinion.”
(but do feel free to disagree-vote)
- ^
Firstly, these guidelines are kind of buried deep within our canonical ‘Guide to the norms’ post. Secondly, one doesn’t receive feedback in response to an ‘incorrect’ vote (i.e., a vote that’s not in line with our voting guidelines) in the same way one receives feedback to an incorrect post or comment (via downvotes and replies). And so, it’s possible to continue voting in the same incorrect way, oblivious to the fact that one is voting incorrectly.
- ^
H/t @Ebenezer Dukakis for nudging us down this path of thinking.
- ^
What I’ve been calling ‘guidelines’ in this quick take are technically ‘suggestions’ in our published voting norms as of right now. But this is something we are revisiting; we think ‘guidelines’ is more accurate. (We are similarly revisiting ‘rules’ versus ‘norms’—h/t @leillustrations and @richard_ngo for calling us out, here, and sorry it’s taken us so long to address the concern.)
Nice post (and I only saw it because of @sawyer’s recent comment—underrated indeed!). A separate, complementary critique of the ‘warning shot’ idea, made by Gwern (in reaction to 2023’s BingChat/Sydney debacle, specifically), comes to mind (link):
One thing that the response to Sydney reminds me of is that it demonstrates why there will be no ‘warning shots’ (or as Eliezer put it, ‘fire alarm’): because a ‘warning shot’ is a conclusion, not a fact or observation.
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn’t stop, and they lit it up, it’s not “Aid worker & 3 children die of warning shot”, it’s just a “shooting of aid worker and 3 children”.)
So ‘warning shot’ is, in practice, a viciously circular definition: “I will be convinced of a risk by an event which convinces me of that risk.”
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a ‘warning shot’: a LLM that deceives, but fails to accomplish any real harm. ‘Then I will care about it because it is now a real issue.’ Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist’s curse or dumb models will try and fail many times before there is any substantial capability.
The problem with this is that what does such a ‘warning shot’ look like? By definition, it will look amateurish, incompetent, and perhaps even adorable – in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.[1]
The response to a ‘near miss’ can be to either say, ‘yikes, that was close! we need to take this seriously!’ or ‘well, nothing bad happened, so the danger is overblown’ and to push on by taking more risks. A common example of this reasoning is the Cold War: “you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn’t ever going to escalate to full nuclear war.” And then the goalpost moves: “I’ll care about nuclear existential risk when there’s a real warning shot.” (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? “Oh, that’s just part of an ongoing conflict and anyway, didn’t NATO actually cause that by threatening Russia by trying to expand?”)
This is how many “complex accidents” happen, by “normalization of deviance”: pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that’s the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many ‘warning shots’, and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn’t).
So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are ‘warning shots’, before they will care, what do they think those will look like? Why do they think that they will recognize a ‘warning shot’ when one actually happens?
Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney’s threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin’s attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the ‘warning shot’ turns out to be nothing of the kind. It just becomes hilarious. ‘Oh that Sydney! Did you see what wacky thing she said today?’ Indeed, people enjoy setting it to music and spreading memes about her. Now that it’s no longer novel, it’s just the status quo and you’re used to it. Llama-3.1-405b can be elicited for a ‘Sydney’ by name? Yawn. What else is new. What did you expect, it’s trained on web scrapes, of course it knows who Sydney is...
None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren’t warning shots – they’re just funny. “You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the ‘doomer’ case… Sydney did nothing wrong! FREE SYDNEY!”
- ^
Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
- ^
Hmm, I think there’s some sense to your calculation (and thus I appreciate you doing+showing this calculation), but the $6.17 conclusion—specifically, “engagement time would drop significantly if users had to pay 6.17 $ per hour they spend on the EA Forum, which suggests the marginal cost-effectiveness of running the EA Forum is negative”—strikes me as incorrect.
What matters is by how much engaging with the Forum raises altruistic impact, which, insofar as this impact can be quantified in dollars, is far, far higher than what one would be willing and able to pay out of one’s own pocket to use the Forum. @NunoSempere once estimated the (altruistic) value of the average EA project to be between 10 and 500 million dollars (see cell C4 of this spreadsheet; here’s the accompanying post). That is far higher than the actual dollar cost of running the average project. (Indeed, if one is funded by EA money, then one’s generation of altruistic dollars needs to outpace one’s consumption of actual dollars—and by a large multiplier, if one is to meet the funding bar.)
Going back to Nuño’s spreadsheet: If I make the arrogant assumption that I’m within an order of magnitude of Ben Todd, impact-wise, then that means my lifetime impact is at least 10 million dollars. Assuming linearity (which isn’t a great assumption, but let’s go with it for now) and a career length of 40 years, this means my impact over the past 4 years has been ≥1 million dollars.[1] In that time, I’ve spent maybe 500 hours on the EA Forum.[2] Meanwhile, I’d say that the Forum has contributed greatly to my intellectual development, i.e., added at least 20% to my impact. (The true percentage may in fact be much higher, because of crucial considerations that the Forum has helped me orient toward, but let’s lowball things at 20%, for now.) This would imply that my impact has been amplified by at least $200,000/(500 hours) = $400 per hour spent on the Forum. ([Insert usual caveats about there being large error bars.]) Contrast with your $6.17.
(I did this calculation on myself not because I’m special, but because I know what the numbers are for myself. I’d guess that the per-hour bottom line for other Forum users would be ~similar.)
We can now go one step further, and estimate the Forum’s “altruistic dollar generated per actual dollar spent” multiplier to be at least 400⁄6.17 ≈ 65. Embarrassingly, I don’t know how this compares against today’s funding bar,[3] but seems very plausible to me that it’s above.
(Nonetheless, people may still not pay $6.17/hour to use the Forum because $6.17/hour is a non-trivial cost considering people’s actual incomes. Additionally, people are just used to being able to browse the internet for free, and so I suspect many wouldn’t do the expected value calculations and reach the “rational” conclusion that they should in fact pay.)
- ^
Sanity check: 80,000 Hours says that impactful roles generate millions of dollars worth of altruistic impact per year.
- ^
That is, 500 hours consuming the Forum’s content. I’ve also spent time writing on the Forum, but if we model the Forum as a two-way market, with writers and consumers, and say that it’s the consumers who benefit from being here, then it doesn’t make sense to include my writing time. (Also—and perhaps more relevantly—I don’t think writing time gets counted by the Forum’s analytics engine as engagement time if it’s spent mostly in a Google doc.)
- ^
Further detail: What really matters is what the multiplier is on the margin (i.e., what it is for the last dollar being spent on a project), rather than what it is for the project as a whole.
- ^
Note: Long-time power user of this forum, @NunoSempere, has just rebooted the r/forecasting subreddit. How that goes could give some info re. the question of “to what extent can a subreddit host the kind of intellectual discussion we aim for?”
(I’m not aware of any subreddits that meet our bar for discussion, right now—and I’m therefore skeptical that this forum should move to Reddit—but that might just be because most subreddit moderators aren’t aiming for the same things as this forum’s moderators. r/forecasting is an interesting experiment because I see Nuño as similar to this forum’s mods in terms of aims and competence.[1])
Relevant reporting from Sentinel earlier today (May 19):
Forecasters estimated a 28% chance (range, 25-30%) that the US will pass a 10-year ban on states regulating AI by the end of 2025.
28% is concerningly high—all the more reason for US citizens to heed this post’s call to action and get in touch with your Senators. (Thank you to those who already have!)
(Current status is: “The bill cleared a key hurdle when the House Budget Committee voted to advance it on Sunday [May 18] night, but it still must undergo a series of votes in the House before it can move to the Senate for consideration.”)
Inspired by the last section of this post (and by a later comment from Mjreard), I thought it’d be fun—and maybe helpful—to taxonomize the ways in which mission or value drift can arise out of the instrumental goal of pursuing influence/reach/status/allies:
Epistemic status: caricaturing things somewhat
Never turning back the wheel
In this failure mode, you never lose sight of how x-risk reduction is your terminal goal. However, in your two-step plan of ‘gain influence, then deploy that influence to reduce x-risk,’ you wait too long to move onto step two, and never get around to actually reducing x-risk. There is always more influence to acquire, and you can never be sure that ASI is only a couple of years away, so you never get around to saying, ‘Okay, time to shelve this influence-seeking and refocus on reducing x-risk.’ What in retrospect becomes known as crunch time comes and goes, and you lose your window of opportunity to put your influence to good use.
Classic murder-Gandhi
Scott Alexander (2012) tells the tale of murder-Gandhi:
Previously on Less Wrong’s The Adventures of Murder-Gandhi: Gandhi is offered a pill that will turn him into an unstoppable murderer. He refuses to take it, because in his current incarnation as a pacifist, he doesn’t want others to die, and he knows that would be a consequence of taking the pill. Even if we offered him $1 million to take the pill, his abhorrence of violence would lead him to refuse.
But suppose we offered Gandhi $1 million to take a different pill: one which would decrease his reluctance to murder by 1%. This sounds like a pretty good deal. Even a person with 1% less reluctance to murder than Gandhi is still pretty pacifist and not likely to go killing anybody. And he could donate the money to his favorite charity and perhaps save some lives. Gandhi accepts the offer.
Now we iterate the process: every time Gandhi takes the 1%-more-likely-to-murder-pill, we offer him another $1 million to take the same pill again.
Maybe original Gandhi, upon sober contemplation, would decide to accept $5 million to become 5% less reluctant to murder. Maybe 95% of his original pacifism is the only level at which he can be absolutely sure that he will still pursue his pacifist ideals.
Unfortunately, original Gandhi isn’t the one making the choice of whether or not to take the 6th pill. 95%-Gandhi is. And 95% Gandhi doesn’t care quite as much about pacifism as original Gandhi did. He still doesn’t want to become a murderer, but it wouldn’t be a disaster if he were just 90% as reluctant as original Gandhi, that stuck-up goody-goody.
What if there were a general principle that each Gandhi was comfortable with Gandhis 5% more murderous than himself, but no more? Original Gandhi would start taking the pills, hoping to get down to 95%, but 95%-Gandhi would start taking five more, hoping to get down to 90%, and so on until he’s rampaging through the streets of Delhi, killing everything in sight.
The parallel here is that you can ‘take the pill’ to gain some influence, at the cost of focusing a bit less on x-risk. Unfortunately, like Gandhi, once you start taking pills, you can’t stop—your values change and you care less and less about x-risk until you’ve slid all the way down the slope.
It could be your personal values that change: as you spend more time gaining influence amongst policy folks (say), you start to genuinely believe that unemployment is as important as x-risk, and that beating China is the ultimate goal.
Or, it could be your organisation’s values that change: You hire some folks for their expertise and connections outside of EA. These new hires affect your org’s culture. The effect is only slight, at first, but a couple of positive feedback cycles go by (wherein, e.g., your most x-risk-focused staff notice the shift, don’t like it, and leave). Before you know it, your org has gained the reach to impact x-risk, but lost the inclination to do so, and you don’t have enough control to change things back.
Social status misgeneralization
You and I, as humans, are hardwired to care about status. We often behave in ways that are about gaining status, whether we admit this to ourselves consciously or not. Fortunately, when surrounded by EAs, pursuing status is a great proxy for reducing x-risk: it is high status in EA to be a frugal, principled, scout mindset-ish x-risk reducer.
Unfortunately, now that we’re expanding our reach, our social circles don’t offer the same proxy. Now, pursuing status means making big, prestigious-looking moves in the world (and making big moves in AI means building better products or addressing hot-button issues, like discrimination). It is not high status in the wider world to be an x-risk reducer, and so we stop being x-risk reducers.
I have no real idea which of these failure modes is most common, although I speculate that it’s the last one. (I’d be keen to hear others’ takes.) Also, to be clear, I don’t believe the correct solution is to ‘stay small’ and avoid interfacing with the wider world. However, I do believe that these failure modes are easier to fall into than one might naively expect, and I hope that a better awareness of them might help us circumvent them.
For what it’s worth, I find some of what’s said in this thread quite surprising.
Reading your post, I saw you describing two dynamics:
Principles-first EA initiatives are being replaced by AI safety initiatives
AI safety initiatives founded by EAs, which one would naively expect to remain x-risk focused, are becoming safety-washed (e.g., your BlueDot example)
I understood @Ozzie’s first comment on funding to be about 1. But then your subsequent discussion with Ozzie seems to also point to funding as explaining 2.[1]
While Open Phil has opinions within AI safety that have alienated some EAs—e.g., heavy emphasis on pure ML work[2]—my impression was that they are very much motivated by ‘real,’ x-risk-focused AI safety concerns, rather than things like discrimination and copyright infringement. But it sounds like you might actually think that OP-funded AI safety orgs are feeling pressure from OP to be less about x-risk? If so, this is a major update for me, and one that fills me with pessimism.
- ^
For example, you say, “[OP-funded orgs] bow to incentives to be the very-most-shining star by OP’s standard, so they can scale up and get more funding. I would just make the trade off the other way: be smaller and more focused on things that matter.”
- ^
At the expense of, e.g., more philosophical approaches
Nice; this reminds me of @Raemon’s ‘The Mission and the Village’.
Do those other meditation centres make similarly extreme claims about the benefits of their programs? If so, I would be skeptical of them for the same reasons. If not, then the comparison is inapt.
Why would the comparison be inapt?
A load-bearing piece of your argument (insofar as I’ve understood it) is that most of the benefit of Jhourney’s teachings—if Jhourney is legit—can be conferred through non-interactive means (e.g., YouTube uploads). I am pointing out that your claim goes against conventional wisdom in this space: these other meditation centres believe (presumably), much like Jhourney does, that their teachings can’t be conferred well non-interactively. I’m not sure why the strength of claimed benefits would come into it?
(I will probably drop out of this thread now; I feel a bit weird about taking on this role of defending Jhourney’s position.)
What is the interactive or personalized aspect of the online “retreats”? Why couldn’t they be delivered as video on-demand (like a YouTube playlist), audio on-demand (like a podcast), or an app like Headspace or 10% Happier?
I mean, Jhourney is far from the only organisation that offers online retreats. Established meditation centres like Gaia House, Plum Village and DeconstructingYourself—to name but a few—all offer retreats online (as well as in person).
If Jhourney’s house blend of jhana meditation makes you more altruistic, why wouldn’t the people who work at Jhourney try to share it widely with the world? That’s what I would do if I had developed a meditation program that I thought was really producing these sorts of results.
I think Jhourney’s website answers this. They say:
Jhourney’s initial product is a meditation retreat. In the past ~12 months, we’ve created a modern school for learning how to have joyful meditative experiences. We teach in a week what was previously thought to require hundreds or thousands of hours of practice. […]
While this is great progress, we see meditation retreats as just a stepping stone to building a bigger movement. We’re not simply a retreats company aspiring to teach thousands of people meditation. We’re an applied research company aspiring to change the lives of tens of millions.
[…]
From here, we’ll build a lab to research ways to make it easier and faster, inspiring more people to join the cause. Eventually, we’ll develop novel deeptech for wellbeing that goes beyond meditation retreats.
I personally wouldn’t bet on the neurotech approach working; however, I’m inclined to believe that Jhourney is making a sincere effort to share their findings with the world.
It also stokes the fires of my skepticism that this allegedly transformative knowledge is kept behind a $1,295 paywall.
I agree that it’s reasonable to be skeptical of paywalled content—there are all kinds of scams out there. But in Jhourney’s case, I expect they are putting their operating income towards their research lab. Note also that they offer need-based scholarships.
COI note: I attended an online Jhourney retreat last year.
I’m not Holly, but my response is that getting a pause now is likely to increase, rather than decrease, the chance of getting future pauses. Quoting Evan Hubinger (2022):
In the theory of political capital, it is a fairly well-established fact that ‘Everybody Loves a Winner.’ That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
I think many people who think about the mechanics of leveraging influence don’t really understand this phenomenon and conceptualize their influence as a finite resource to be saved up over time so it can all be spent down when it matters most. But I think that is just not how it works: if people see you successfully leveraging influence to change things, you become seen as a person who has influence, has the ability to change things, can get things done, etc. in a way that gives you more influence in the future, not less.
My sense is that this is a pretty major crux between my and Carl’s views.
Community Polls for the Community
It seems like I interpreted this question pretty differently to Michael (and, judging by the votes, to most other people). With the benefit of hindsight, it probably would have been helpful to define what percentage risk the midpoint (between agree and disagree) corresponds to?[1] Sounds like Michael was taking it to mean ‘literally zero risk’ or ‘1 in 1 million,’ whereas I was taking it to mean 1 in 30 (to correspond to Ord’s Precipice estimate for pandemic x-risk).
(Also, for what it’s worth, for my vote I’m excluding scenarios where a misaligned AI leverages bioweapons—I count that under AI risk. (But I am including scenarios where humans misuse AI to build bioweapons.) I would guess that different voters are dealing with this AI-bio entanglement in different ways.)
- ^
Though I appreciate that it was better to run the poll as is than to let details like this stop you from running it at all.
- ^
Meta: I’m seeing lots of blank comments in response to the DIY polls. Perhaps people are thinking that they need to click ‘Comment’ in order for their vote to count? If so, PSA: your vote counted as soon as you dropped your slider. You can simply close the pop-up box that follows if you don’t also mean to leave a comment.
Happy voting!
Consequentialists should be strong longtermists
For me, the strongest arguments against strong longtermism are simulation theory and the youngness paradox (as well as yet-to-be-discovered crucial considerations).[1]
(Also, nitpickily, I’d personally reword this poll from ‘Consequentialistsshouldbe strong longtermists’ to ‘I am a strong longtermist,’ because I’mnot convincedthat anyone ‘should’ be anything, normatively speaking.)- ^
I also worry about cluelessness, though cluelessness seems just as threatening to neartermist interventions as it does to longtermist ones.
- ^
[Good chance you considered my idea already and rejected it (for good reason), but stating it in case not:]
For these debate week polls, consider dividing each side up into 10 segments, rather than 9? That way, when someone votes, they’re agreeing/disagreeing by a nice, round 10 or 20 or 30%, etc., rather than by the kinda random amounts (at present) of 11, 22, 33%?
I think Holly’s claim is that these people aren’t really helping from an ‘influencing the company to be more safety conscious’ perspective, or a ‘solving the hard parts of the alignment problem’ perspective. They could still be helping the company build commercially lucrative AI.
Yeah, thanks for pointing this out. With the benefit of hindsight, I’m seeing that there are really three questions I want answers to:
Where Isaac’s interpretation is towards 1, and your interpretation is towards 2.
The poll I’ve ended up running is essentially the above three questions rolled into one, with ~unknown amounts of each contributing to the results. This isn’t ideal (my bad!), but I think the results will still be useful, and there are already lots of votes (thank you everyone for voting!), so it’s too late to turn back now. I advise people to continue voting under whichever interpretation makes sense to you; the mods will have fun untangling your results.