LessWrong dev & admin as of July 5th, 2022.
RobertM
Speak the truth, even if your voice trembles
I’m pretty confused about what’s going on here. The person who made this accusation made it on Twitter under their real name using an unlocked account, and the accusation remains public to date. Is the concern here that the accused did not previously know of the accusation against them, but would be made aware of it by this discussion?
(I’m not sure whether I’d want them named in absence of a request to the contrary, but I don’t understand the implied threat model and think other explanations for the request are plausible, given the whole “public tweet” thing.)
There’s definitely no censorship of the topic on LessWrong. Obviously I don’t know for sure why discussion is sparse, but my guess is that people mostly (and, in my opinion, correctly) don’t think it’s a particularly interesting or fruitful topic to discuss on LessWrong, or that the degree to which it’s an interesting subject is significantly outweighed by mindkilling effects.
Edit: with respect to the rest of the comment, I disagree that rationalists are especially interested in object-level discussion of the subjects, but probably are much more likely to disapprove of the idea that discussion of the subject should be verboten.I think the framing where Bostrom’s apology is a subject which has to be deliberately ignored is mistaken. Your prior for whether something sees active discussion on LessWrong is that it doesn’t, because most things don’t, unless there’s a specific reason you’d expect it to be of interest to the users there. I admit I haven’t seen a compelling argument for there being a teachable moment here, except the obvious “don’t do something like that in the first place”, and perhaps “have a few people read over your apology with a critical eye before posting it” (assuming that didn’t in fact happen). I’m sure you could find a way to tie those in to the practice of rationality, but it’s a bit of a stretch.
When they asked a different Bay Area rationality organiser, they were told that their talk on diversity may have been “epistemically weak” and “not truth-seeking” enough.
So, to clarify, a guess from an unrelated party about why this talk might have resulted in a lack of an invitation pattern-matched to language used by other people in a way that has no (obvious to me) relationship to blacklists...?
I’m not sure what this was intended to demonstrate.
I am curious how you would distinguish a blacklist from the normal functioning of an organization when making hiring decisions. I guess maybe “a list of names with no details as to why you want to avoid hiring them” passed around between organizations would qualify as the first but not the second? I obviously can’t say with surety that no such thing exists elsewhere, but I would be pretty surprised to learn about any major organizations using one.
I’m not discussing naming the accuser, but the accused.
I do not think we have an obligation to avoid discussing object-level details of sexual assault claims when those claims have already been made publicly, if it seems like discussing them would otherwise be useful.
We can either become a movement of people who seem dedicated to a particular set of conclusions about the world, or we can become a movement of people united by a shared commitment to using reason and evidence to do the most good we can.
The former is a much smaller group, easier to coordinate our focus, but it’s also a group that’s more easily dismissed. People might see us as a bunch of nerds[1] who have read too many philosophy papers[2] and who are out of touch with the real world.
The latter is a much bigger group.
I’m aware that this is not exactly the central thrust of the piece, but I’d be interested if you could expand on why we might expect the former to be a smaller group than the latter.
I agree that a “commitment to using reason and evidence to do the most good we can” is a much better target to aim for than “dedicated to a particular set of conclusions about the world”. However, my sense is that historically there have been many large and rapidly growing groups of people that fit the second description, and not very many of the first. I think this was true for mechanistic reasons related to how humans work rather than being accidents of history, and think that recent technological advances may even have exaggerated the effects.
I think the modal no-Anthropic counterfactual does not have an alignment-agnostic AI company that’s remotely competitive with OpenAI, which means there’s no external target for this Amazon investment. It’s not an accident that Anthropic was founded by former OpenAI staff who were substantially responsible for OpenAI’s earlier GPT scaling successes.
At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
Random aside, but I think this paragraph is unjustified in both its core argument (that the referenced theory-first efforts propagated actively misleading ways of thinking about alignment) and none of the citations provide the claimed support.
The first post (re: evolutionary analogy as evidence for a sharp left turn) sees substantial pushback in the comments, and that pushback seems more correct to me than not, and in any case seems to misunderstand the position it’s arguing against.
The second post presents an interesting case for a set of claims that are different from “there is no distinction between inner and outer alignment”; I do not consider it to be a full refutation of that conceptual distinction. (See also Steven Byrnes’ comment.)
The third post is at best playing games with the definitions of words (or misunderstanding the thing it’s arguing against), at worst is just straightforwardly wrong.
I have less context on the fourth post, but from a quick skim of both the post and the comments, I think the way it’s most relevant here is as a demonstration of how important it is to be careful and precise with one’s claims. (The post is not making an argument about whether AIs will be “rigid utility maximizing consequentialists”, it is making a variety of arguments about whether coherence theorems necessarily require that whatever ASI we might build will behave in a goal-directed way. Relatedly, Rohin’s comment a year after writing that post indicated that he thinks we’re likely to develop goal-directed agents; he just doesn’t think that’s entailed by arguments from coherence theorems, which may or may not have been made by e.g. Eliezer in other essays.)
My guess is that you did not include the fifth post as a smoke test to see if anyone was checking your citations, but I am having trouble coming up with a charitable explanation for its inclusion in support of your argument.
I’m not really sure what my takeaway is here, except that I didn’t go scouring the essay for mistakes—the citation of Quintin’s post was just the first thing that jumped out at me, since that wasn’t all that long ago. I think the claims made in the paragraph are basically unsupported by the evidence, and the evidence itself is substantially mischaracterized. Based on other comments it looks like this is true of a bunch of other substantial claims and arguments in the post:
- ^
Though I’m sort of confused about what this back-and-forth is talking about, since it’s referencing behind-the-scenes stuff that I’m not privy to.
I do not think the orthogonality thesis is a motte-and-bailey. The only evidence I know of that suggests that the goals developed by an ASI trained with something resembling modern methods would by default be picked from a distribution that’s remotely favorable to us is the evidence we have from evolution[1], but I really think that ought to be screened off. The goals developed by various animal species (including humans) as a result of evolution are contingent on specific details of various evolutionary pressures and environmental circumstances, which we know with confidence won’t apply to any AI trained with something resembling modern methods.
Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
Anticipating the argument that, since we’re doing the training, we can shape the goals of the systems—this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don’t have either, right now.
- ^
Which, mind you, is still unfavorable; I think the goals of most animal species, were they to be extrapolated outward to superhuman levels of intelligence, would not result in worlds that we would consider very good. Just not nearly as unfavorable as what I think the actual distribution we’re facing is.
- ^
Many things about this comment seem wrong to me.
Yudkowsky’s suggestions seem entirely appropriate if you truly believe, like him, that AI x-risk is probability ~100%.
These proposals would plausibly be correct (to within an order of magnitude) in terms of the appropriate degree of response with much lower probabilities of doom (i.e. 10-20%). I think you need to actually run the math to say that this doesn’t make sense.
unproven and unlikely assumptions, like that an AI could build nanofactories by ordering proteins to be mixed over email
This is a deeply distorted understanding of Eliezer’s threat model, which is not any specific story that he can tell, but the brute fact that something smarter than you (and him, and everyone else) will come up with something better than that.
In the actual world, where the probability of extinction is signficantly less than 100%, are these proposals valuable?
I do not think it is ever particularly useful to ask “is someone else’s conclusion valid given my premises, which are importantly different from theirs”, if you are attempting to argue against someone’s premises. Obviously “A ⇒ B” & “C” does not imply “B”, and it especially does not imply “~A”.
It seems like they will just get everyone else labelled luddites and fearmongerers, especially if years and decades go by with no apocalypse in sight.
This is an empirical claim about PR, which:
does not seem obviously correct to me
has little to say about the object-level arguments
falls into pattern of suggesting that people should optimize for how others perceive us, rather than optimizing for communicating our true beliefs about the world.
Let’s taboo the word “care”. I expect the average longtermist thinks that deaths from famines and floods are about as bad as the average non-longtermist EA. Problems do not become “less bad” simply because other problems exist.
Having different priorities, stemming from different beliefs about e.g. what things matter and how effectively we can address them, is orthogonal to relative evaluations of how bad any individual problem is.
I think people should take a step back and take a bird’s-eye view of the situation:
The author persistently conflates multiple communities: “tech, EA (Effective Altruists), rationalists, cybersecurity/hackers, crypto/blockchain, Burning Man camps, secret parties, and coliving houses”. In the Bay Area, “tech” is literally a double-digit percentage of the population.
The first archived snapshot of the website of the author’s consultancy (“working with survivors, communities, institutions, and workplaces to prevent and heal from sexual harassment and sexual assault”) was recorded in August 2022.
According to the CEA Community Health team: “The author emailed the Community Health team about 7 months ago, when she shared some information about interpersonal harm; someone else previously forwarded us some anonymous information that she may have compiled. Before about 7 months ago, we hadn’t been in contact with her.”
This would have been late July 2022.
From the same comment by the CEA Community Health team: “We have emailed the author to tell her we will not be contracting her services.”
Implied: the author attempted to sell her professional services to CEA.
The author, in the linked piece: “To be clear, I’m not advocating bans of the accused or accusers—I am advocating for communities to do more, for thorough investigations by trained/experienced professionals, and for accountability if an accusation is found credible. Untrained mediators and community representatives/liaisons who are only brought on for their popularity and/or nepotistic ties to the community, without thought to expertise, experience, or qualifications, such as the one in the story linked above (though there are others), often end up causing the survivors greater trauma.” (Emphasis mine.)
The author: “In February 2023, I calculated that I personally knew of/dealt with thirty different incidents in which there was a non-trivial chance the Centre for Effective Altruism or another organization(s) within the EA ecosystem could potentially be legally liable in a civil suit for sexual assault, or defamation/libel/slander for their role/action (note: I haven’t added the stories I’ve received post-February to this tally, nor do I know if counting incidents an accurate measure (eg, accused versus accusers) also I’ve gotten several stories since that time; nor is this legal advice and to get a more accurate assessment, I’d want to present the info to a legal team specializing in these matters). Each could cost hundreds of thousands and years to defend, even if they aren’t found liable. Of course, without discovery, investigation, and without consulting legal counsel, this is a guess/speculative, and I can’t say whether they’d be liable or rise to the level of a civil suit—not with certainty without formal legal advice and full investigations.” (Emphasis in original.)
The author: “In response to my speculation, the community health team denied they knew of my work prior to August 2022, and that it was not connected to EA. Three white community health team members have strongly insinuated that I’ve lied and treated me – an Asian-American – in much the gaslighting, silencing way that survivors reporting rape fear being treated. Many of the women who have publicly spoken up about sexual misconduct in EA are of Asian descent. As I stated in the previous paragraph, I haven’t yet consulted with lawyers, but I personally believe this is defamatory. Additionally, the Centre and Effective Ventures Foundation are in headquartered in a jurisdiction that is much more harsh on defamation than the one I’m in.” (Emphasis in original.)
The author: “Unlike most of these mediators and liaisons, I have training/formal education, mentorship, and years of specific experience. If/When I choose to consult with lawyers about the events described in the paragraph above, there might be a settlement if my speculations of liability are correct (or just to silence me on the sexual misconduct and rapes I do know of). If (again, speculative) that doesn’t happen and we continue into a discovery process, I’m curious as to what could be uncovered.” (Emphasis in original.)
I don’t doubt that the author cares about preventing sexual assault, and mitigating the harms that come from it. They do also seem to care about something that requires dropping dark hints of potential legal remedies they might pursue, with scary-sounding numbers and mentions of venue-shopping attached to them.
There is no button you can press on demand to publish an article in either a peer-reviewed journal or a mainstream media outlet.
Publishing pieces in the media (with minimal 3rd-party editing) is at least tractable on the scale of weeks, if you have a friendly journalist. The academic game is one to two orders of magnitude slower than that. If you want to communicate your views in real-time, you need to stick to platforms which allow that.
I do think media comms is a complementary strategy to direct comms (which MIRI has been using, to some degree). But it’s difficult to escape the fact that information posted on LW, the EA forum, or Twitter (by certain accounts) makes its way down the grapevine to relevant decision-makers surprisingly often, given how little overhead is involved.
Importantly, switching to Signal has no comparable costs. The only cost I can think of is that the UX (User Experience) might be slightly better for Facebook Messenger than Signal.
Have you conducted a user survey to this effect? I personally find Signal’s UX to be substantially worse than Messenger’s (for the relevant use-cases), and strongly expect that most people who’ve used both would have similar feelings.
I also think this significantly overstates the potential risk reduction, since incautious users are those that are least likely to switch to Signal, so the gains are mostly limited to users who are already more careful by nature.
The argument w.r.t. capabilities is disanalogous.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities—though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because “alignment” is:
an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than “observable external behavior” (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the “dangerous capabilities” regime, or if not strongly bound in theory, then strongly bound in practice
I do think this disagreement is substantially downstream of a disagreement about what “alignment” represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
He proposes instituting an international treaty, which seems to be aiming for the reference class of existing treaties around the proliferation of nuclear and biological weapons. He is not proposing that the United States issue unilateral threats of nuclear first strikes.
“I don’t currently have much sympathy for someone who’s highly confident that AI takeover would or would not happen (that is, for anyone who thinks the odds of AI takeover … are under 10% or over 90%).”
I find this difficult to square with the fact that:
Absent highly specific victory conditions, the default (P = 1 - ε) outcome is takeover.
Of the three possibilities you list, interpretability seems like the only one that’s actually seen any traction, but:
there hasn’t actually been very much progress beyond toy problems
it’s not clear why we should expect it to generalize to future paradigms
we have no idea how to use any “interpretation” to actually get to a better endpoint
interpretability, by itself, is insufficient to avoid takeover, since you lose as soon as any other player in the game messes up even once
The other potential hopes you enumerate require people in the world to attempt to make a specific thing happen. For most of them, not only is practically nobody working on making any of those specific things happen, but many people are actively working in the opposite direction. In particular, with respect to the “Limited AI” hope, the leading AI labs are pushing quite hard on generality, rather than on narrow functionality. This has obviously paid off in terms of capability gains over “narrow” approaches. Being able to imagine a world where something else is happening does not tell us how to get to that world.
I can imagine having an “all things considered” estimation (integrating model uncertainty, other people’s predictions, etc) of under 90% for failure. But I don’t understand writing off the epistemic position of someone who has an “inside view” estimation of >90% failure, especially given the enormous variation of probability distributions that people have over timelines (which I agree are an important, though not overwhelming, factor when it comes to estimating chances of failure). Indeed, an “extreme” inside view estimation conditional on short timelines seems much less strange to me than a “moderate” one. (The only way a “moderate” estimation makes sense to me is if it’s downstream of predicting the odds of success for a specific research agenda, such as in John Wentworth’s The Plan − 2022 Update, and I’m not even sure one could justifiably give a specific research agenda 50% odds of success nearly a decade out as the person who came up with it, let alone anyone looking in from the outside.)
If an alignment-minded person is currently doing capabilities work under the assumption that they’d be replaced by an equally (or more) capable researcher less concerned about alignment, I think that’s badly mistaken. The number of people actually pushing the frontier forward is not all that large. Researchers at that level are not fungible; the differences between the first-best and second-best available candidates for roles like that are often quite large. The framing of an arms race is mistaken; the prize for “winning” is that you die sooner. Dying later is better. If you’re in a position like that I’d be happy to talk to you, or arrange for you to talk to another member of the Lightcone team.
I do not significantly credit the possibility that Google (or equivalent) will try to make life difficult for people who manage to successfully convince the marginal capabilities researcher to switch tracks, absent evidence. I agree that historical examples of vaguely similar things exist, but the ones I’m familiar with don’t seem analogous, and we do in fact have fairly strong evidence about the kinds of antics that various megacorps get up to, which seem to be strongly predicted by their internal culture.
ETA: feel free to ignore the below, given your caveat, though you may find it helpful if you choose to write an expanded form of any of the arguments later to have some early objections.
Correct me if I’m wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense (since if it is, effectively all of them break down as reasons for optimism)? To wit:
Resource allocation is relatively equal (and relatively free of violence) among humans because even humans that don’t very much value the well-being of others don’t have the power to actually expropriate everyone else’s resources by force. (We have evidence of what happens when those conditions break down to any meaningful degree; it isn’t super pretty.)
I do not think GPT-4 is meaningful evidence about the difficulty of value alignment. In particular, the claim that “GPT-4 seems to be honest, kind, and helpful after relatively little effort” seems to be treating GPT-4′s behavior as meaningfully reflecting its internal preferences or motivations, which I think is “not even wrong”. I think it’s extremely unlikely that GPT-4 has preferences over world states in a way that most humans would consider meaningful, and in the very unlikely event that it does, those preferences almost certainly aren’t centrally pointed at being honest, kind, and helpful.
re: endogenous reponse to AI—I don’t see how this is relevant once you have ASI. To the extent that it might be relevant, it’s basically conceding the argument: that the reason we’ll be safe is that we’ll manage to avoid killing ourselves by moving too quickly. (Note that we are currently moving at pretty close to max speed, so this is a prediction that the future will be different from the past. One that some people are actively optimising for, but also one that other people are optimizing against.)
re: perfectionism—I would not be surprised if many current humans, given superhuman intelligence and power, created a pretty terrible future. Current power differentials do not meaningfully let individual players flip every single other player the bird at the same time. Assuming that this will continue to be true is again assuming the conclusion (that AI will not be superhuman in any relevant sense). I also feel like there’s an implicit argument here about how value isn’t fragile that I disagree with, but I might be reading into it.
I’m not totally sure what analogy you’re trying to rebut, but I think that human treatment of animal species, as a piece of evidence for how we might be treated by future AI systems that are analogously more powerful than we are, is extremely negative, not positive. Human efforts to preserve animal species are a drop in the bucket compared to the casual disregard with which we optimize over them and their environments for our benefit. I’m sure animals sometimes attempt to defend their territory against human encroachment. Has the human response to this been to shrug and back off? Of course, there are some humans who do care about animals having fulfilled lives by their own values. But even most of those humans do not spend their lives tirelessly optimizing for their best understanding of the values of animals.
Just noting that this reply seems to be, to me, very close to content-free, in terms of addressing object-level concerns. I think you could compress it to “I did due diligence” without losing very much.
If you’re constrained in your ability to discuss things on the object-level, i.e. due to promises to keep certain information secret, or other considerations like “discussing policy work in advance of it being done tends to backfire”, I would appreciate that being said explicitly. As it is, I can’t update very much on it.
ETA: to be clear, I’m not sure I how I feel about the broader norm of requesting costly explanations when something looks vaguely off. My first instinct is “against”, but if I were to adopt a policy of not engaging with such requests (unless they actually managed to surface something I’d consider a mistake I didn’t realize I’d made), I’d make that policy explicit.