Wei Dai
Thanks, lots of interesting articles in this list that I missed despite my interest in this area.
One suggestion I have is to add some studies of failed attempts at building/reforming institutions, otherwise one might get a skewed view of the topic. (Unfortunately I don’t have specific readings to suggest.)
A related topic you don’t mention here (maybe due to lack of writings on it?) is maybe humanity should pause AI development and have a long (or even short!) reflection about what it wants to do next, e.g. resume AI development or do something else like subsidize intelligence enhancement (e.g. embryo selection) for everyone who wants it so more people can meaningfully participate in deciding the fate of our world. (I note that many topics on this reading list are impossible for most humans to fully understand, perhaps even with AI assistance.)
I claim that this area outscores regular AI safety on importance while being significantly more neglected
This neglect is itself perhaps one of the most important puzzles of our time. With AGI very plausibly just a few years away, why aren’t more people throwing money or time/effort at this cluster of problems just out of self interest? Why isn’t there more intellectual/academic interest in these topics, many of which seem so intrinsically interesting to me?
We have to make judgment calls about how to structure our reflection strategy. Making those judgment calls already gets us in the business of forming convictions. So, if we are qualified to do that (in “pre-reflection mode,” setting up our reflection procedure), why can’t we also form other convictions similarly early?
I’m very confused/uncertain about many philosophical topics that seem highly relevant to morality/axiology, such as the nature of consciousness and whether there is such a thing as “measure” or “reality fluid” (and if so what is it based on). How can it be right or safe to form moral convictions under such confusion/uncertainty?
It seems quite plausible that in the future I’ll have access to intelligence-enhancing technologies that will enable me to think of many new moral/philosophical arguments and counterarguments, and/or to better understand existing ones. I’m reluctant to form any convictions until that happens (or the hope of it ever happening becomes very low).
Also I’m not sure how I would form object-level moral convictions even if I wanted to. No matter what I decide today, why wouldn’t I change my mind if I later hear a persuasive argument against it? The only thing I can think of is to hard-code something to prevent my mind being changed about a specific idea, or to prevent me from hearing or thinking arguments against a specific idea, but that seems like a dangerous hack that could mess up my entire belief system.
Therefore, it seems reasonable/defensible to think of oneself as better positioned to form convictions about object-level morality (in places where we deem it safe enough).
Do you have any candidates for where you deem it safe enough to form object-level moral convictions?
I put the full report here so you don’t have to wait for them to email it to you.
Anyone with thoughts on what went wrong with EA’s involvement in OpenAI? It’s probably too late to apply any lessons to OpenAI itself, but maybe not too late elsewhere (e.g., Anthropic)?
While drafting this post, I wrote down and then deleted an example of “avoiding/deflecting questions about risk” because the person I asked such a question is probably already trying to push their organization to take risks more seriously, and probably had their own political considerations for not answering my question, so I don’t want to single them out for criticism, and also don’t want to damage my relationship with this person or make them want to engage less with me or people like me in the future.
Trying to enforce good risk management via social rewards/punishments might be pretty difficult for reasons like these.
My main altruistic endeavor involves thinking and writing about ideas that seem important and neglected. Here is a list of the specific risks that I’m trying to manage/mitigate in the course of doing this. What other risks am I overlooking or not paying enough attention to, and what additional mitigations I should be doing?
Being wrong or overconfident, distracting people or harming the world with bad ideas.
Think twice about my ideas/arguments. Look for counterarguments/risks/downsides. Try to maintain appropriate uncertainties and convey them in my writings.
The idea isn’t bad, but some people take it too seriously or too far.
Convey my uncertainties. Monitor subsequent discussions and try to argue against people taking my ideas too seriously or too far.
Causing differential intellectual progress in an undesirable direction, e.g., speeding up AI capabilities relative to AI safety, spreading ideas that are more useful for doing harm than doing good.
Check ideas/topics for this risk. Self-censor ideas or switch research topics if the risk seems high.
Being first to talk about some idea, but not developing/pursuing it as vigorously as someone else might if they were first, thereby causing a net delay in intellectual or social progress.
Not sure what to do about this one. So far not doing anything except to think about it.
PR/political risks, e.g., talking about something that damages my reputation or relationships, and in the worst case harms people/causes/ideas associated with me.
Keep this in mind and talk more diplomatically or self-censor when appropriate.
Managing risks while trying to do good
@Will Aldred I forgot to mention that I do have the same concern about “safety by eating marginal probability” on AI philosophical competence as on AI alignment, namely that progress on solving problems lower in the difficulty scale might fool people into having a false sense of security. Concretely, today AIs are so philosophically incompetent that nobody trusts them to do philosophy (or almost nobody), but if they seemingly got better, but didn’t really (or not enough relative to appearances), a lot more people might and it could be hard to convince them not to.
Thanks for the comment. I agree that what you describe is a hard part of the overall problem. I have a partial plan, which is to solve (probably using analytic methods) metaphilosophy for both analytic and non-analytic philosophy, and then use that knowledge to determine what to do next. I mean today the debate between the two philosophical traditions is pretty hopeless, since nobody even understands what people are really doing when they do analytic or non-analytic philosophy. Maybe the situation will improve automatically when metaphilosophy has been solved, or at least we’ll have a better knowledge base for deciding what to do next.
If we can’t solve metaphilosophy in time though (before AI takeoff), I’m not sure what the solution is. I guess AI developers use their taste in philosophy to determine how to filter the dataset, and everyone else hopes for the best?
Just talking more about this problem would be a start. It would attract more attention and potentially resources to the topic, and make people who are trying to solve it feel more appreciated and less lonely. I’m just constantly confused why I’m the only person who frequently talks about it in public, given how obvious and serious the problem seems to me. It was more understandable before ChatGPT put AI on everyone’s radar, but now it’s just totally baffling. And I appreciate you writing this comment. My posts on the topic usually get voted up, but with few supporting comments, making me unsure who actually agrees with me that this is an important problem to work on.
If you’re a grant maker, you can decide to fund research in this area, and make some public statements to that effect.
If might be useful to think in terms of a “AI philosophical competence difficulty scale” similar to Sammy Martin’s AI alignment difficulty scale and “safety by eating marginal probability”. I tend to focus on the higher end of that scale, where we need to achieve a good explicit understanding of metaphilosophy, because I think solving that problem is the only way to reduce risk to a minimum, and it also fits my inclination for philosophical problems, but someone more oriented towards ML research could look for problems elsewhere on the difficulty scale, for example fine-tuning a LLM to do better philosophical reasoning, to see how far that can go. Another idea is to fine-tune a LLM for pure persuasion, and see if that can be used to create an AI that deemphasizes persuasion techniques that don’t constitute valid reasoning (by subtracting the differences in model weights somehow).
Some professional philosopher(s) may actually be starting a new org to do research in this area, so watch out for that news and check how you can contribute. Again providing funding will probably be an option.
Think about social aspects of the problem. What would it take for most people or politicians to take the AI philosophical competence problem seriously? Or AI lab leaders? What can be done if they never do?
Think about how to evaluate (purported) progress in the area. Are there clever ways to make benchmarks that can motivate people to work on the problem (and not be easily Goodharted against)?
Just to reemphasize, talk more about the problem, or prod your favorite philosopher or AI safety person to talk more about it. Again it’s totally baffling the degree to which nobody talks about this. I don’t think I’ve even once heard a professional philosopher publicly express a concern that AI might be relatively incompetent in philosophy, even as some opine freely on other aspects of AI. There are certainly obstacles for people to work on the problem like your reasons 1-3, but for now the bottleneck could just as well be in the lack of social proof that the problem is worth working on.
- 21 Jan 2024 1:14 UTC; 2 points) 's comment on AI doing philosophy = AI generating hands? by (LessWrong;
How should we deal with the possibility/risk of AIs inherently disfavoring all the D’s that Vitalik wants to accelerate? See my Twitter thread replying to his essay for more details.
AI doing philosophy = AI generating hands?
The universe can probably support a lot more sentient life if we convert everything that we can into computronium (optimized computing substrate) and use it to run digital/artificial/simulated lives, instead of just colonizing the universe with biological humans. To conclude that such a future doesn’t have much more potential value than your 2010 world, we would have to assign zero value to such non-biological lives, or value each of them much less than a biological human, or make other very questionable assumptions. The Newberry 2021 paper that Vasco Grilo linked to has a section about about this:
If a significant fraction of humanity’s morally-relevant successors were instantiated digitally, rather than biologically, this would have truly staggering implications for the expected size of the future. As noted earlier, Bostrom (2014) estimates that 10^35 human lives could be created over the entire future, given known physical limits, and that 10^58 human lives could be created if we allow for the possibility of digital persons. While these figures were not intended to indicate a simple scaling law 31, they do imply that digital persons can in principle be far, far more resource efficient than biological life. Bostrom’s estimate of the number of digital lives is also conservative, in that it assumes all such lives will be emulations of human minds; it is by no means clear that whole-brain emulation represents the upper limit of what could be achieved. For a simple example, one can readily imagine digital persons that are similar to whole-brain emulations, but engineered so as to minimise waste energy, thereby increasing resource efficiency.
I think, as a matter of verifiable fact, that if people solve the technical problems of AI alignment, they will use AIs to maximize their own economic consumption, rather than pursue broad utilitarian goals like “maximize the amount of pleasure in the universe”.
If you extrapolate this out to after technological maturity, say 1 million years from now, what does selfish “economic consumption” look like? I tend to think that people’s selfish desires will be fairly easily satiated once everyone is much much richer and the more “scalable” “moral” values would dominate resource consumption at that point, but it might just be my imagination failing me.
I think mundane economic forces are simply much more impactful.
Why does “mundane economic forces” cause resources to be consumed towards selfish ends? I think economic forces select for agents who want to and are good at accumulating resources, but will probably leave quite a bit of freedom in how those resources are ultimately used once the current cosmic/technological gold rush is over. It’s also possible that our future civilization uses up much of the cosmic endowment through wasteful competition, leaving little or nothing to consume in the end. Is that’s your main concern?
(By “wasteful competition” I mean things like military conflict, costly signaling, races of various kinds that accumulate a lot of unnecessary risks/costs. It seems possible that you categorize these under “selfishness” whereas I see them more as “strategic errors”.)
To be sure, ensuring AI development proceeds ethically is a valuable aim, but I claim this goal is *not *the same thing as “AI alignment”, in the sense of getting AIs to try to do what people want.
There was at least one early definition of “AI alignment” to mean something much broader:
The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.
I’ve argued that we should keep using this broader definition, in part for historical reasons, and in part so that AI labs (and others, such as EAs) can more easily keep in mind that their ethical obligations/opportunities go beyond making sure that AI does what people want. But it seems that I’ve lost that argument so it’s good to periodically remind people to think more broadly about their obligations/opportunities. (You don’t say this explicitly, but I’m guessing it’s part of your aim in writing this post?)
(Recently I’ve been using “AI safety” and “AI x-safety” interchangeably when I want to refer to the “overarching” project of making the AI transition go well, but I’m open to being convinced that we should come up with another term for this.)
That said, I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general. The way most people form their morality is scary to me, and personally I would push humanity to be more philosophically competent/inclined before pushing it to be less selfish.
- Long Reflection Reading List by 24 Mar 2024 16:27 UTC; 66 points) (
- 15 Jan 2024 15:58 UTC; 10 points) 's comment on AI doing philosophy = AI generating hands? by (
Not exactly what you’re asking for, but you could use it as a reference for all of the significant risks that different people have brought up, to select which ones you want to further research and address in your response post.
Tell me more about these “luxurious AI safety retreats”? I haven’t been to an AI safety workshop in several years, and wonder if something has changed. From searching the web, I found this:
and this:
I was there for an AI workshop earlier this year in Spring and stayed for 2 or 3 days, so let me tell you about the ‘luxury’ of the ‘EA castle’: it’s a big, empty, cold, stone box, with an awkward layout. (People kept getting lost trying to find the bathroom or a specific room.) Most of the furnishings were gone. Much of the layout you can see in Google Maps was nonfunctional, and several wings were off-limits or defunct, so in practice it was maybe a quarter of the size you’d expect from the Google Maps overview. There were clearly extensive needs for repair and remodeling of a lot of ancient construction, and most of the gardens are abandoned as too expensive to maintain. It is, as a real estate agent might say, very ‘historical’ and a ‘good fixer-upper’.
And not much visible evidence of luxury.
I thought about this and wrote down some life events/decisions that probably contributed to becoming who I am today.
Immigrating to the US at age 10 knowing no English. Social skills deteriorated while learning language, which along with lack of cultural knowledge made it hard to make friends during teenage and college years, which gave me a lot of free time that I filled by reading fiction and non-fiction, programming, and developing intellectual interests.
Was heavily indoctrinated with Communist propaganda while in China, but leaving meant I then had no viable moral/philosophical/political foundations. Parents were too busy building careers as new immigrants and didn’t try to teach me values/traditions. So I had a lot of questions that I didn’t have ready answers to, which perhaps contributed to my intense interest in philosophy (ETA: and economics and game theory).
Had an initial career in cryptography, but found it a struggle to compete with other researchers on purely math/technical skills. Realized that my comparative advantage was in more conceptual work. Crypto also taught me to be skeptical of my own and other people’s ideas.
Had a bad initial experience with academic research (received nonsensical peer review when submitting a paper to a conference) so avoided going that route. Tried various ways to become financially independent, and managed to “retire” in my late 20s to do independent research as a hobby.
A lot of these can’t really be imitated by others (e.g., I can’t recommend people avoid making friends in order to have more free time for intellectual interests). But here are some practical advice I can think of:
I think humanity really needs to make faster philosophical progress, so try your hand at that even if you think of yourself as more of a technical person. Same may be true for solving social/coordination problems. (But see next item.)
Somehow develop a healthy dose of self-skepticism so that you don’t end up wasting people’s time and attention arguing for ideas that aren’t actually very good.
It may be worth keeping an eye out for opportunities to “get rich quick” so you can do self-supported independent research. (Which allows you to research topics that don’t have legible justifications or are otherwise hard to get funding for, and pivot quickly as the landscape and your comparative advantage both change over time.)
ETA: Oh, here’s a recent LW post where I talked about how I arrived at my current set of research interests, which may also be of interest to you.
You seem to be assuming that people’s extrapolated views in world A will be completely uncorrelated with their current views/culture/background, which seems a strange assumption to make.
People’s extrapolated views could be (in part) selfish or partial, which is an additional reason that extrapolated views of you at different times may be closer than that of strangers.
People’s extrapolated views not converging doesn’t directly imply “it’s much much less likely that the world we end up with even if we save it is close to the ideal one by my lights” because everyone could still get close to what they want through trade/compromise, or you (and/or others with extrapolated views similar to yours) could end up controlling most of the future by winning the relevant competitions.
It’s not clear that applying a heavy discount to world A makes sense, regardless of the above, because we’re dealing with “logical risk” which seems tricky in terms of decision theory.