Doing alignment research with Vivek Hebbar’s team at MIRI.
Thomas Kwa
Fixed
Misleading phrase in a GiveWell Youtube ad
What do you mean by “compassionate”?
Should the EA Forum team stop optimizing for engagement?
I heard that the EA forum team tries to optimize the forum for engagement (tests features to see if they improve engagement). There are positives to this, but on net it worries me. Taken to the extreme, this is a destructive practice, as it wouldnormalize and encourage clickbait;
cause thoughtful comments to be replaced by louder and more abundant voices (for a constant time spent thinking, you can post either 1 thoughtful comment or several hasty comments. Measuring session length fixes this but adds more problems);
cause people with important jobs to spend more time on EA Forum than is optimal;
avoid community members and “EA” itself from keeping their identities small, as politics is an endless source of engagement;
distract from other possible directions of improvement, like giving topics proportionate attention, adding epistemic technology like polls and prediction market integration, improving moderation, and generally increasing quality of discussion.
I’m not confident that EA Forum is getting worse, or that tracking engagement is currently net negative, but we should at least avoid failing this exercise in Goodhart’s Law.
I was thinking of reasons why I feel like I get less value from EA Forum. But this is not the same as reasons EAF might be declining in quality. So the original list would miss more insidious (to me) mechanisms by which EAF could actually be getting worse. For example I often read something like “EA Forum keeps accumulating more culture/jargon; this is questionably useful, but posts not using the EA dialect are received increasingly poorly.” There are probably more that I can’t think of, and it’s harder for me to judge these...
Yeah, I don’t think it’s possible for controlled substances due to the tighter regulation.
The epistemic spot checker could also notice flaws in reasoning; I think Rohin Shah has done this well.
This is also on LessWrong.
Note that people in US/UK and presumably other places can buy drugs on the grey market (e.g. here) for less than standard prices. Although I wouldn’t trust these 100%, they should be fairly safe because they’re certified in other countries like India; gwern wrote about this here for modafinil and the basic analysis seems to hold for many antidepressants. The shipping times advertised are fairly long but potentially still less hassle than waiting for a doctor’s appointment for each one.
Thanks. It looks reassuring that the correlations aren’t as large as I thought. (How much variance is in the first principal component in log odds space though?) And yes, I now think the arguments I had weren’t so much for arithmetic mean as against total independence / geometric mean, so I’ll edit my comment to reflect that.
The main assumption of this post seems to be that, not only are the true values of the parameters independent, but a given person’s estimates of stages are independent. This is a judgment call I’m weakly against.
Suppose you put equal weight on the opinions of Aida and Bjorn. Aida gives 10% for each of the 6 stages, and Bjorn gives 99%, so that Aida has an overall x-risk probability of 10^-6 and Bjorn has around 94%.
If you just take the arithmetic mean between their overall estimates, it’s like saying “we might be in worlds where Aida is correct, or worlds where Bjorn is correct”
But if you take the geometric mean or decompose into stages, as in this post, it’s like saying “we’re probably in a world where each of the bits of evidence Aida and Bjorn have towards each proposition are independently 50% likely to be valid, so Aida and Bjorn are each more correct about 2-4 stages”.
These give you vastly different results, 47% vs 0.4%. Which one is right? I think there are two related arguments to be made against the geometric mean, although they don’t push me all the way towards using the arithmetic mean:
Aida and Bjorn’s wildly divergent estimates on probably come from some underlying difference in their models of the world, not as independent draws. In this case where Aida is more optimistic about Bjorn on each of the 6 stages, it is unlikely that this is due to independent draws. I think this kind of multidimensional difference in optimism between alignment researchers is actually happening, so any model should take this into account.
If we learn that Bjorn was wrong about stage 1, then we should put less weight on his estimates for stages 2-6. (My guess is there’s some copula that corresponds to a theoretically sensible way to update away from Bjorn’s position treating his opinions as partially correlated, but I don’t know enough statistics)
- 19 Oct 2022 18:09 UTC; 15 points) 's comment on ‘Dissolving’ AI Risk – Parameter Uncertainty in AI Future Forecasting by (
Probabilities of probabilities can make sense if you specify what they’re over. Say the first level is the difficulty of the alignment problem, and the second one is our actions. The betting odds on doom collapse, but you can still say meaningful things, e.g. if we think there’s a 50% chance alignment is 1% x-risk and a 50% chance it’s 99% x-risk, then the tractability is probably low either way (e.g. if you think the success curve is logistic in effort).
Quiet comments
The ability to submit a comment without it showing up in “Recent Discussion”. Among other things, this would allow discussion of controversial content without it stealing our collective attention from good content. Moderation already helps with this, but I would still have a use for quiet comments.
Surely reducing the number of players, making it more likely that US entities develop AGI (who might be more or less careful, more or less competent, etc. than Chinese entities), and (perhaps) increasing conflict all matter for alignment? There are several factors here that push in opposite directions, and this comment is not an argument for why the sum is zero to negative.
List of reasons I think EA takes better actions than most movements, in no particular order:
taking weird ideas seriously; being willing to think carefully about them and dedicate careers to them
being unusually goal-directed
being unusually truth-seeking
this makes debates non-adversarial, which is easy mode
openness to criticism, plus a decent method of filtering it
high average intelligence. Doesn’t imply rationality but doesn’t hurt.
numeracy and scope-sensitivity
willingness to use math in decisions when appropriate (e.g. EV calculations) is only part of this
less human misalignment: EAs have similar goals and so EA doesn’t waste tons of energy on corruption, preventing corruption, negotiation, etc.
relative lack of bureaucracy
various epistemic technologies taken from other communities: double-crux, forecasting
ideas from EA and its predecessors: crucial considerations, the ITN framework, etc.
taste: for some reason, EAs are able to (hopefully correctly) allocate more resources to AI alignment than overpopulation or the energy decline, for reasons not explained by the above.
Structured debate mechanisms are not on this list, and I doubt they would make a huge difference because the debates are non-adversarial, but if one could be found it would be a good addition to the list, and therefore a source of lots of positive impact.
I think the marginal value of donating now is low, perhaps even lower than on the average day. From the article you linked:
In the hours after the amber alert was announced, the Give Blood website appeared to be inundated with people wanting to book appointments.
People landing on the homepage were told they were in a “queue” before being able to choose a date and location for their donation.
I have some qualms with the survey wording.
Conditional on a Misaligned AGI being exposed to high-impact inputs, it will scale (in aggregate) to the point of permanently disempowering roughly all of humanity
I answered 70% for this question, but the wording doesn’t feel quite right. I put >80% that a sufficiently capable misaligned AI would disempower humanity, but the first AGI deployed is likely to not be maximally capable unless takeoff is really fast. It could neither initiate a pivotal act/process nor disempower humanity, then over the next days to years (depending on takeoff speeds) different systems could become powerful enough to disempower humanity.
One way in which Unaligned AGI might cease to be a risk is if we develop a test for Misalignment, such that Misaligned AGIs are never superficially attractive to deploy. What is your best guess for the year when such a test is invented?
Such a test might not end the acute risk period, because people might not trust the results and could still deploy misaligned AGI. The test would also have to extrapolate into the real world, farther than any currently existing benchmark. It would probably need to rely on transparency tools far in advance of what we have today, and because this region of the transparency tech tree also contains alignment solutions, the development of this test should not be treated as uncorrelated with other alignment solutions.
Even then, I also think there’s a good chance this test is very difficult to develop before AGI. The misalignment test and alignment problem aren’t research problems that we are likely to solve independently of AGI, they’re dramatically sped up by being able to iterate on AI systems and get more than one try on difficult problems.
Also, conditional on aligned ASI being deployed, I expect this test to be developed within a few days. So the question should say “conditional on AGI not being developed”.
One way in which Unaligned AGI might cease to be a risk is if we have a method which provably creates Aligned AGIs (‘solving the Alignment Problem’). What is your best guess for the year when this is first accomplished?
I.E. The year when it becomes possible (not necessarily practical / economic) to build an AGI and know it is definitely Aligned.Solving the alignment problem doesn’t mean we can create a provably aligned AGI. Nate Soares says
Following Eliezer, I think of an AGI as “safe” if deploying it carries no more than a 50% chance of killing more than a billion people:
When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. [...] Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’.
Notably absent from this definition is any notion of “certainty” or “proof”. I doubt we’re going to be able to prove much about the relevant AI systems, and pushing for proofs does not seem to me to be a particularly fruitful approach (and never has; the idea that this was a key part of MIRI’s strategy is a common misconception about MIRI).
(I’m helping Vivek and Nate run the consequentialist cognition MATS stream)
Yes, both of those are correct. The formatting got screwed up in a conversion, and should be fixed soon.
In the future, you could send Vivek or me a DM to contact our project specifically. I don’t know what the official channel for general questions about MATS is.
It’s free on Coinbase and FTX.
Manifold markets related to this: