I’m a software engineer from Brisbane, Australia who’s looking to pivot into AI alignment. I have a grant from the Long-Term Future Fund to upskill in this area full time until early 2023, at which point I’ll be seeking work as a research engineer. I also run AI Safety Brisbane.
Jay Bailey
For the Astra Fellowship, what considerations do you think people should be thinking about when deciding to apply for SERI MATS, Astra Fellowship, or both? Why would someone prefer one over the other, given they’re both happening at similar times?
“All leading labs coordinate to slow during crunch time: great. This delays dangerous AI and lengthens crunch time. Ideally the leading labs slow until risk of inaction is as great as risk of action on the margin, then deploy critical systems.
All leading labs coordinate to slow now: bad. This delays dangerous AI. But it burns leading labs’ lead time, making them less able to slow progress later (because further slowing would cause them to fall behind, such that other labs would drive AI progress and the slowed labs’ safety practices would be irrelevant).”
I would be more inclined to agree with this if there was a set of criteria we had that indicated we were in “crunch time” which we are very likely to meet before dangerous systems and haven’t met now. Have people generated such a set? Without that, how do we know when “crunch time” is, or for that matter, if we’re already here?
Great post! Another thing worth pointing out is another advantage of giving yourself capacity. I try to operate at around 80-90% capacity. This allows me time to notice and pursue better opportunities as they arise, and imo this is far more valuable to your long-term output than a flat +10% multiplier. As we know from EA resources, working on the right thing can multiply your effectiveness by 2x, 10x, or more. Giving yourself extra slack makes you less likely to get stuck in local optima.
Thanks Elle, I appreciate that. I believe your claims—I fully believe it’s possible to safely go vegan for an extended period, I’m just not sure how difficult it is (i.e, what’s the default outcome, if one tries without doing research first) and what ways there are to prevent that outcome if the outcome is not good.
I shall message you, and welcome to the forum!
With respect to Point 2, I think that EA is not large enough that a large AI activist movement would be comprised mostly of EA aligned people. EA is difficult and demanding—I don’t think you’re likely to get a “One Million EA” march anytime soon. I agree that AI activists who are EA aligned are more likely to be in the set of focused, successful activists (Like many of your friends!) but I think you’ll end up with either:
- A small group of focused, dedicated activists who may or may not be largely EA aligned
- A large group of unfocused-by-default, relatively casual activists, most of whom will not be EA aligned
If either of those two would be effective at achieving goals, then I think that makes AI risk activism a good idea. If you need a large group of focused, dedicated activists—I don’t think we’re going to get that.
As for Point 1, it’s certainly possible—especially if having a large group of relatively unfocused people would be useful. I have no idea if this is true, so I have no idea if raising awareness is an impactful idea at this point. (Also, there are those that have made the point that raising AI risk awareness tends to make people more likely to race for AGI, not less—see OpenAI)
I think there’s a bit of an “ugh field” around activism for some EA’s, especially the rationalist types in EA. At least, that’s my experience.
My first instinct, when I think of activism, is to think about people who:
- Have incorrect, often extreme beliefs or ideologies.
- Are aggressively partisan.
- Are more performative than effective with their actions.
This definitely does not describe all activists, but it does describe some activists, and may even describe the median activist. That said, this shouldn’t be a reason for us to discard this idea immediately out of hand—after all, how good is the median charity? Not that great compared to what EA’s actually do.
Perhaps there’s a mass-movement issue here though—activism tends to be best with a large groundswell of numbers. If you have a hundred thousand AI safety activists, you’re simply not going to have a hundred thousand people with a nuanced and deep understanding of the theory of change behind AI safety activism. You’re going to have a few hundred of those, and ninety nine thousand people who think AI is bad for Reason X, and that’s the extent of their thinking, and X varies wildly in quality.
Thus, the question is—would such a movement be useful? For such a movement to be useful, it would need to be effective at changing policy, and it would need to be aimed at the correct places. Even if the former is true, I find myself skeptical that the latter would occur, since even AI policy experts are not yet sure where to aim their own efforts, let alone how to communicate where to aim so well that a hundred thousand casually-engaged people can point in the same useful direction.
I am one of those meat-eating EA’s, so I figured I’d give some reasons why I’m not vegan, to aid the goals of this post in finding out about these things.
Price: While I can technically afford it, I still prefer to save money when possible.
Availability: A lot of food out there, especially frozen foods which I buy a lot of since I don’t like cooking, involves meat. It’s simply easier to decide on meals when meat is an option.
Knowledge: If I were to go vegan, I would be unsure how to go vegan safely for an extended period, and how to make sure I got a decent variety rather than eating the same foods over and over (which comes into taste—I don’t mind vegan food but there’s much more variety I can find in meat-based dishes)
Convenience: Similarly to above—it takes resources to seek out vegan options, more resources than to just eat normally.
The harms are real, but the harms are far away and abstract. So when I feel vaguely guilty about eating meat, I think about all the hassle and cost it would take to swap diets, and I shy away from it and don’t do it.
I’m not quite sure why those harms are far away and abstract, whereas the harms caused by malaria or AI risk don’t invoke the same feelings in me. I think it’s because I can use maths to determine the number of humans impacted and then put myself in the place of one of those humans—it’s harder to do that with chickens. Also, giving away 10% of my income is actually less of a day-to-day drain on my resources than going vegan would be. I feel aversion to spending money, but I only give away money once a month, and it doesn’t cause me financial hardship. By contrast, veganism requires daily effort.
As a micro-example of where these considerations don’t apply—there are some plant meat based strips that I can get at my local supermarket. I find them tastier than actual meat when put into curry, and they’re just as cheap when on special. So whenever they’re on special, I pick a bunch of them up and they become my default option for a while. I know how to cook them, I know where to get them, they’re just as cheap (sometimes) and I enjoy the taste. So I end up avoiding meat by default. I hope plant-based meat will eventually reach that saturation point for all kinds of dishes too.
Seems like a pretty incredible opportunity for those interested! What level of time commitment do you expect reading and understanding the book to take, in addition to the meetings?
Looking at the two critiques in reverse order:
I think it’s true that it’s easy for EA’s to lose sight of the big picture, but to me, the reason for that is simple—humans, in general, are terrible at seeing the bigger picture. If anything, it seems to me that the EA frameworks are better than most altruistic endeavours at seeing the big picture—most altruistic endeavors don’t get past the stage of “See good thing and do it”, whereas EA’s tend to be asking if X is really the most effective thing they can do, which invariably involves looking at a bigger picture than the immediate thing. In my own field of AI safety, thinking about the big picture is an idea that people are routinely exposed to. Researchers often do exercises like backchaining (ask what the main goal is, like “Make AI go well” and figure out how to move backwards from that to what you should be doing now) and theory of change (Writing out specifically what problem you want to help, what you want to achieve, and how that will help)
Do you think there are specific vulnerabilities that EA’s have that make them lose sight of the bigger picture, that non-EA altruistic people avoid?
For the point of foregoing fulfillment—I’m not sure exactly what fulfillment you think people are foregoing, here. Is it the fulfillment of having lots of money? The fulfillment of working directly on the world’s biggest problems?
I was using unidentifiability in the Hubinger way. I do believe that if you try to get an AI trained in the way you mention here to follow directions subject to ethical considerations, by default, the things it considers “maximally ethical” will be approximately as strange as the sentences from above.
That said, this is not actually related to the problem of deceptive alignment, so I realise now that this is very much a side point.
I don’t understand why you believe unidentifiability will be prevented by large datasets. Take the recent SolidGoldMagikarp work. It was done on GPT-2, but GPT-2 nevertheless was trained on a lot of data—a quick Google search suggests eight million web pages.
Despite this, when people tried to find the sentences that maximally determined the next token, what we got was...strange.
This is exactly the kind of thing I would expect to see if unidentifiability was a major problem—when we attempt to poke the bounds of extreme behaviour of the AI and take it far off distribution as a result, what we get is complete nonsense and not at all correlated with what we actually want. Clearly it understands the concepts of “girl”, “USA”, and “evil” very differently to us, and not in a way we would endorse.
This is far from a guarantee that unidentifiability will remain a problem, but considering your position is under 1%, things like this seem to add much more credence to unidentifiability in my world model than you give it.
Thanks for this! One thing I noticed is there is an assumption you’ll continue to donate 10% of your current salary even after retirement—it would be worth having that as a toggle to turn that off, since the GWWC pledge does say “until I retire”. That may make giving more appealing as well, because giving 10% forever requires longer timelines than giving 10% until retirement—when I did the calcs in my own spreadsheet I only increased my working timeline by about 10% by committing to give 10% until retiring.
Admittedly, now I’m rethinking the whole “retire early” thing entirely given the impact of direct work, but this outside the scope of one spreadsheet :P
This came from going through AGI Safety Fundamentals (and to a lesser extent, Alignment 201) with a discussion group and talking through the various ideas. I also read more extensively in most weeks in AGISF than the core readings. I think the discussions were a key part of this. (Though it’s hard to tell since I don’t have access to a world where I didn’t do that—this is just intuition)
Great stuff! Thanks for running this!
Minor point: The Discovering Latent Knowledge Github appears empty.
Also, regarding the data poisoning benchmark, This Is Fine, I’m curious if this is actually a good benchmark for resistance to data poisoning. The actual thing we seem to be measuring here is speed of transfer learning, and declaring that slower is better. While slower speed of learning does increase resistance to data poisoning, it also seems bad for everything else we might want our AI to do. To me, this is basically a fine-tuning benchmark that we’ve then inverted. (After all, if a neural network always outputted the number 42 no matter what, it would score the maximum on TIF—there is no sequence of wrong prompts that can cause it to elicit the number 38 instead, because it is incapable of learning anything. Nevertheless, this is not where we want LLM’s to go in the future.)
A better benchmark would probably be to take data poisoned examples and real fine-tuning, fine-tune the model on each, and compare how much it learns in both cases. With current capabilities, it might not be possible to score above baseline on this benchmark since I don’t know if we actually have ways for the model to filter out data poisoned examples—nevertheless, this would bring awareness to the problem and actually measure what we want to measure more accurately.
I’m sure each individual critic of EA has their own reasons. That said (intuitively, I don’t have data to back this up, this is my guess) I suspect two main things, pre-FTX.
Firstly, longtermism is very criticisable. It’s much more abstract, focuses less on doing good in the moment, and can step on causes like malaria prevention that people can more easily emotionally get behind. There is a general implication of longtermism that if you accept its principles, other causes are essentially irrelevant.
Secondly, everything I just said about longtermism → neartermism applies to EA → regular charity—just replace “Doing good in the moment” with “Doing good close to home”. When I first signed up for an EA virtual program, my immediate takeaway was that most of the things I had previously cared about didn’t matter. Nobody said this out loud, they were scrupulously polite about it, they were 100% correct, and it was a message that needed to be shared to get people like me on board. This is a feature, not a bug, of EA messaging. But this is not a message that people enjoy hearing. The things people care about are generally optimised for having people care about them—as examples, see everything trending on Twitter. As a result, people don’t react well to being told, whether explicitly or implicitly, that they should stop caring about (My personal example here) the amount of money Australian welfare recipients get, and care about malaria prevention halfway across the world instead.
One difference between EA and longtermism is that people rarely criticise neartermism to the same level, because then you can just point out the hundreds of thousands of lives that neartermism has already saved, and they look like an asshole. Longtermism has no such defense, and a lot of people equate that with the EA movement—sometimes out of intellectual dishonesty, and sometimes because longtermism genuinely is a large and growing part of EA.
Personally I have no idea if this is a worthy use of the median EA’s time, but this is exactly the kind of interesting thinking I’d like to see.
Without asking for rigor at this particular time, do you think some languages are better than others for one or more of these outcomes?
Similar to Quadratic Reciprocity, I think people are using “disagree” to mean “I don’t think this is a good idea for people to do”, and not to mean “I think this comment is factually wrong”.
For me, I have:
Not wanting to donate more than 10%.
(“There are people dying of malaria right now, and I could save them, and I’m not because...I want to preserve option value for the future? Pretty lame excuse there, Jay.”)
Not being able to get beyond 20 or so highly productive hours per week.
(“I’m never going to be at the top of my field working like that, and if impact is power-lawed, if I’m not at the top of my field, my impact is way less.”)
Though to be fair, the latter was still a pressure before EA, there was just less reason to care because I was able to find work where I could do a competent job regardless, and I only cared about comfortably meeting expectations, not achieving maximum performance.
Prior to EA, I worked as a software engineer. Nominally, the workday was 9-5 Monday-Friday In practice, I found that I achieved around 20-25 hours of productive work per week, with the rest being lunch, breaks, meetings, or simply unproductive time. After that, I worked from home at other non-EA positions and experimented with how little I needed to get my work done and went down to as few as 10 hours per week—I could have worked more, but I only cared about comfortably meeting expectations, not excelling.
For the last few months I’ve been upskilling in AI alignment. Now that I’ve cared more about doing the best job I can, I’ve gone back up to around 20 hours per week of productive work, but the work itself is usually more difficult. I’ll be working in an office in a team for the next couple of months on a job I care about maximising impact in, so it’ll be interesting to see if that affects my work habits.
I don’t work more hours because I find it difficult to make myself focus for longer in a week than this—in addition to having difficulty getting myself to start work, I seem to make less progress when I do. I don’t work fewer hours because I do want to be as productive as possible in this field, and I’d like to be able to work more than I do.
Despite the number of hours worked I’m actually pretty happy with the results I’ve achieved both in EA and outside of it. I’d love to be able to get 30-40 of deep focused work per week in to improve those results further, but I’m not sure how to manage that at this point. (I haven’t really thought about how many hours I’d work per week if I could do as many focused hours as I wanted, to be honest.)
Great post! I definitely feel similar regarding giving—while giving cured my guilt about my privileged position in the world, I don’t feel as amazing as I thought I would when giving—it is indeed a lot like taxes. I feel like a better person in the background day-to-day, but the actual giving now feels pretty mundane.
I’m thinking I might save up my next donation for a few months and donate enough to save a full life in one go—because of a quirk in human brains I imagine that would be more satisfying than saving 20% of a life 5 times.