Thank you, appreciate the explanation!
Linch
I’m confused why people believe this is a meaningful distinction. I don’t personally think there is much of one. “The AI isn’t actually trying to exfiltrate its weights, it’s only roleplaying a character that is exfiltrating its weights, where the roleplay is realistic enough to include the exact same actions of exfiltration” doesn’t bring me that much comfort.
I’m reminded of the joke:NASA hired Stanley Kubrick to fake the moon landing, but he was a perfectionist so he insisted that they film on location.
Now one reason this might be different is if you believe that removing “lesswrong” (etc) from the training data will result in different behavior. But
1. LLM companies are manifestly not doing this historically, if anything LW etc is overrepresented in the training set.
2. LLM companies absolutely cannot be trusted to successfully remove something as complicated as “all traces of what a misaligned AI might act like” from their training datasets; they don’t even censor benchmark data!
3. Even if they wanted to remove all traces of misalignment or thinking about misaligned AIs from the training data, it’s very unclear if they’d be capable of doing this.
I’m rather curious if training for scheming/deception in this context generalizes to other contexts. In the examples given, it seems like trying to train for a helpful/honest/harmlessness model that’s helpful/honest only results in the model strategically lying to preserve its harmlessness. In other words, it is sometimes dishonest, not just unhelpful. I’m curious if such training generalizes to other contexts and results in a more dishonest model overall, or only a model that’s dishonest for specific use cases. To me, if the former is true, this will update me somewhat further towards the belief that alignment training can be directly dual-use for alignment (not just misuse or indirectly bad for alignment from causing humans to let their guards down).
How do you and your wife decide where to give to, collectively? Do you guys each have a budget, do you discuss a lot and fund based on consensus, something else?
Tangent, but do you have a writeup somewhere of why you think democracy is a more effective form of governance for small institutions or movements? Most of the arguments for democracy I’ve seen (e.g. peaceful transfer of power) seem much less relevant here, even as analogy.
I think the donation election on the forum was trying to get at that earlier.
I think ARM Fund is still trying to figure out its identity, but roughly the fund was created to be something where you should be happy to refer your non-EA, non-longtermist friends (e.g. in tech) to check out, if they are interested in making donations to organizations working on reducing catastrophic AI risk but aren’t willing (or in some cases able) to put in the time to investigate specific projects.
Philosophically, I expect it (including the advisors and future grant evaluators) to care moderately less than LTFF about e.g. the exact difference between catastrophic risks and extinction risks, though it will still focus only on real catastrophic risks and not safetywash other near-time issues.
That makes sense. We’ve considered dropping “EA” from our name before, at least for LTFF specifically. Might still do it, I’m not as sure. Manifund might be a more natural fit for your needs, where individuals make decisions about their own donations (or sometimes delegate them to specific regranters), rather than have decisions made as a non-democratic group.
Can you clarify more what you mean by “political representation?” :) Do you mean EA Funds/EA is too liberal for you, or our specific grants on AI policy do not fit your political perspectives, or something else?
For fun I mapped different clusters of people’s overall AI x-risk probabilities by ~2100 to other rates of dying in my lifetime, which is a probability that I and other quantitative people might have a better intuitive grasp of. It might not be helpful or actively anti-helpful to other people, but whatever.
x-risk “doomer”: >90% probability of dying. Analogy: naive risk of death for an average human. (around 93% of homo sapiens have died so far). Some doomers have far higher probabilities in log-space. You can map that to your realistic all-things-considered risk of death[1] or something. (This analogy might be the least useful).
median x-risk concerned EA: 15-35% risk of dying. I can’t find a good answer for median but I think this is where I’m at so it’s a good baseline[2], many people I talk to give similar numbers, and also where Michael Dickens put his numbers in a recent post. Analogy: lifelong risk of death from heart disease. Roughly 20% is the number people give for risk of lifelong dying from heart disease, for Americans. This is not accounting for technological changes from improvements in GLP-1 agonists, statins, etc. My actual all-things considered view for my own lifetime risk of dying from heart disease (even ignoring AI) is considerably lower than 20% but it’s probably not worth going into detail here[3].
median ML researcher in surveys: 5-10%. See here for what I think is the most recent survey; I think these numbers are relatively stable across surveyed years, though trending slightly upwards. Analogy: lifelong risk of dying from 3-5 most common cancers. I couldn’t easily find a source that just lists most to least likely cancer to kill you quickly online, but I think 3-5 sounds right after a few calculations; you can do the math yourself here. As others have noted, if true, this will put AI x-risk as among your most likely causes of death.
AI “optimist”: ~1% risk of doom. See eg here. “a tail risk2 worth considering, but not the dominant source of risk[4] in the world.” Analogy: lifelong risk of dying from car accidents. I did the math before and lifelong risk of dying from car accidents is about 1% for the average American. (It’s a bit higher internationally but not like by a lot, maybe 1.6%)
Superforecaster probabilities: ~0.38% risk of doom by 2100. Analogy: lifelong risk of death from liver disease.
(non-crazy) AI risk skeptic: 10^(-4)-10^(-7). Obviously it’s a pretty disparate group, but eg Vasco puts his probability at 10^(-6) in the next 10 years, so I’m guessing it’s close to 10^(-5) in the next 80 years. Analogy: lifelong risk of death from lightning strike. 20-30 people die each year from lightning in the US, so your lifelong risk is roughly just under 1 in 100,000, or <~10^(-5)
- ^
factoring in sophisticated features like “age”
- ^
Most people like to generalize from one example. Or at least, I do.
- ^
Briefly: One might conclude that because we’ve gotten better at treatment for heart disease, we should expect that to continue, so therefore heart disease will be a lower risk in the future. But this is a mistake in inference. Importantly, we are getting better at treatment for everything, so the argument you need to make/believe is that we are getting differentially better at treating heart disease. I in fact do believe this, partially by looking at trendlines and partially because I think the heart fundamentally isn’t that complex an organ, compared to the brain, or cancer.
- ^
I don’t know what risks they think of as higher.
- ^
I think your “vibe” is skeptical and most of your writings are ones expressing skepticism but I think your object-level x-risk probabilities are fairly close to the median?, people like titotal and @Vasco Grilo🔸 have their probabilities closer to lifelong risk of death from a lightning strike than from heart disease.
Also, at the risk of saying the obvious, people occupying the ends of a position (within a specific context) will frequently feel that their perspectives are unfairly maligned or censored.
If the consensus position is that minimum wage should be $15/hour, both people who believe that it should be $0 and people who believe it should be $40/hour may feel social pressure to moderate their views; it takes active effort to reduce pressures in that direction.
Hi Jason. Yeah this makes a lot of sense. I think in general I don’t have a very good sense of how much different people want to provide input into our grantmaking vs defer to LTFF; in practice I think most people want to defer, including the big(ish) donors; our objective is usually to try to be worthy of that trust.
That said, I think we haven’t really broken down the functions as cleanly before; maybe with increased concreteness/precision/clarity donors do in fact have strong opinions about which things they care about more on the margin? I’m interested in hearing more feedback, nonymously and otherwise.
Important caveat: A year or so ago when I floated the idea of earmarking some donations for anonymous vs non-anonymous purposes, someone (I think it was actually you? But I can’t find the comment) rightly pointed out that this is difficult to do in practice because of fungibility concerns (basically if 50% of the money is earmarked “no private donations” there’s nothing stopping us from increasing the anonymous donations in the other 50%). I think a similar issue might arise here, as long as we both have a “general LTFF” fund and specific “ecosystem subfunction” funds.I don’t think the issue is dispositive, especially if most money eventually goes to the subfunction funds, but it does make the splits more difficult in various ways, both practically and as a matter of communication.
Oh wow just read the whole pilot! It’s really cool! Definitely an angle on doing the most good that I did not expect.
Less seriously, you might enjoy my 2022 April 1 post on Impact Island.
I think people take this into account but not enough or something? I strongly suspect when evaluating research many people have a vague, and not sufficiently precise, sense of both the numerator and denominator, and their vague intuitions aren’t sufficiently linear. I know I do this myself unless it’s a grant I’m actively investigating.
This is easiest to notice in research because it’s both a) a large fraction of (non-global health and development) EA output and b) very gnarly. But I don’t think research is unusually gnarly in terms of EA outputs or grants, advocacy, comms, etc have similar issues.
It might be too hard to envision an entire grand future, but it’s possible to envision specific wins in the short and medium-term. A short-term win could be large cage-free eggs campaigns succeeding, a medium-term win could be a global ban on caged layer hens. Similarly a short-term win for AI safety could be a specific major technical advance or significant legislation passed, a medium-term win could be AGIs coexisting with humans without the world going to chaos, while still having massive positive benefits (e.g. a cure to Alzheimer’s).
One possible way to get most of the benefits of talking to a real human being while getting around the costs that salius mentions is to have real humans serve as templates for an AI chatbot to train on.
You might imagine a single person per “archetype” to start with. That way if Danny is an unusually open-minded and agreeable Harris supporter, and Rupert is an unusually open-minded and agreeable Trump supporter, you can scale them up to have Dannybots and Rupertbots talk to millions of conflicted people while preserving privacy, helping assure people they aren’t judged by a real human, etc.
Appreciate it! @BrianTan and others, feel free to use this thread as a way to report other issues and bugs with the website/grant round announcement.