I’m a software engineer from Brisbane, Australia who’s looking to pivot into AI alignment. I have a grant from the Long-Term Future Fund to upskill in this area full time until early 2023, at which point I’ll be seeking work as a research engineer. I also run AI Safety Brisbane.
Jay Bailey
This is exactly right, and the main reason I wrote this up in the first place. I wanted this to serve as a data point for people to be able to say “Okay, things have gone a little off the rails, but things aren’t yet worse than they were for Jay, so we’re still probably okay.” Note that it is good to have a plan for when you should give up on the field, too—it should just allow for some resilience and failures baked in. My plan was loosely “If I can’t get a job in the field, and I fail to get funded twice, I will leave the field”.
Also contributing to positive selection effects is that you’re more likely to see the more impressive results in the field, because they’re more impressive. That gives your brain a skewed idea of what the median person in the field is doing. Our brain thinks “Average piece of alignment research we see” is “Average output of alignment researchers”.
The counterargument to this is “Well, shouldn’t we be aiming for better than median? Shouldn’t these impressive pieces be our targets to reach?” I think so, yes, but I believe in incremental ambition as well—if one is below-average in the field, aiming to be median first, then good, then top-tier rather than trying to immediately be top-tier seems to me a reasonable approach.
Reflections on my first year of AI safety research
Welcome to the Forum!
This post falls into a pretty common Internet failure mode, which is so ubiquitous outside of this forum that it’s easy to not realise that any mistake has even been made—after all, everyone talks like this. Specifically, you don’t seem to consider whether your argument would convince someone who genuinely believes these views. I am only going to agree with your answer to your trolley problem if I am already convinced invertebrates have no moral value...and in that case, I don’t need this post to convince me that invertebrate welfare is counterproductive. There isn’t any argument for why someone who does not currently agree with you should change their mind.
It is worth considering what specific reasons people who care about invertebrate reasoning have, and trying to answer those views directly. This requires putting yourself in their shoes and trying to understand why they might consider invertebrates to have actual moral worth.
”So what’s the problem? Why don’t I just let the invertebrate-lovers go do their thing, while I do mine? The problem is that those arguing for the invertebrate cause as an issue of moral importance have brought bad arguments to the table.”
This is much more promising, and I’d like to see actual discussion of what these arguments are, and why they’re bad.
Great post! I definitely feel similar regarding giving—while giving cured my guilt about my privileged position in the world, I don’t feel as amazing as I thought I would when giving—it is indeed a lot like taxes. I feel like a better person in the background day-to-day, but the actual giving now feels pretty mundane.
I’m thinking I might save up my next donation for a few months and donate enough to save a full life in one go—because of a quirk in human brains I imagine that would be more satisfying than saving 20% of a life 5 times.
For the Astra Fellowship, what considerations do you think people should be thinking about when deciding to apply for SERI MATS, Astra Fellowship, or both? Why would someone prefer one over the other, given they’re both happening at similar times?
“All leading labs coordinate to slow during crunch time: great. This delays dangerous AI and lengthens crunch time. Ideally the leading labs slow until risk of inaction is as great as risk of action on the margin, then deploy critical systems.
All leading labs coordinate to slow now: bad. This delays dangerous AI. But it burns leading labs’ lead time, making them less able to slow progress later (because further slowing would cause them to fall behind, such that other labs would drive AI progress and the slowed labs’ safety practices would be irrelevant).”
I would be more inclined to agree with this if there was a set of criteria we had that indicated we were in “crunch time” which we are very likely to meet before dangerous systems and haven’t met now. Have people generated such a set? Without that, how do we know when “crunch time” is, or for that matter, if we’re already here?
- Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope by 12 Oct 2023 11:24 UTC; 73 points) (
- 13 Oct 2023 8:48 UTC; 0 points) 's comment on Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope by (
Great post! Another thing worth pointing out is another advantage of giving yourself capacity. I try to operate at around 80-90% capacity. This allows me time to notice and pursue better opportunities as they arise, and imo this is far more valuable to your long-term output than a flat +10% multiplier. As we know from EA resources, working on the right thing can multiply your effectiveness by 2x, 10x, or more. Giving yourself extra slack makes you less likely to get stuck in local optima.
Thanks Elle, I appreciate that. I believe your claims—I fully believe it’s possible to safely go vegan for an extended period, I’m just not sure how difficult it is (i.e, what’s the default outcome, if one tries without doing research first) and what ways there are to prevent that outcome if the outcome is not good.
I shall message you, and welcome to the forum!
With respect to Point 2, I think that EA is not large enough that a large AI activist movement would be comprised mostly of EA aligned people. EA is difficult and demanding—I don’t think you’re likely to get a “One Million EA” march anytime soon. I agree that AI activists who are EA aligned are more likely to be in the set of focused, successful activists (Like many of your friends!) but I think you’ll end up with either:
- A small group of focused, dedicated activists who may or may not be largely EA aligned
- A large group of unfocused-by-default, relatively casual activists, most of whom will not be EA aligned
If either of those two would be effective at achieving goals, then I think that makes AI risk activism a good idea. If you need a large group of focused, dedicated activists—I don’t think we’re going to get that.
As for Point 1, it’s certainly possible—especially if having a large group of relatively unfocused people would be useful. I have no idea if this is true, so I have no idea if raising awareness is an impactful idea at this point. (Also, there are those that have made the point that raising AI risk awareness tends to make people more likely to race for AGI, not less—see OpenAI)
I think there’s a bit of an “ugh field” around activism for some EA’s, especially the rationalist types in EA. At least, that’s my experience.
My first instinct, when I think of activism, is to think about people who:
- Have incorrect, often extreme beliefs or ideologies.
- Are aggressively partisan.
- Are more performative than effective with their actions.
This definitely does not describe all activists, but it does describe some activists, and may even describe the median activist. That said, this shouldn’t be a reason for us to discard this idea immediately out of hand—after all, how good is the median charity? Not that great compared to what EA’s actually do.
Perhaps there’s a mass-movement issue here though—activism tends to be best with a large groundswell of numbers. If you have a hundred thousand AI safety activists, you’re simply not going to have a hundred thousand people with a nuanced and deep understanding of the theory of change behind AI safety activism. You’re going to have a few hundred of those, and ninety nine thousand people who think AI is bad for Reason X, and that’s the extent of their thinking, and X varies wildly in quality.
Thus, the question is—would such a movement be useful? For such a movement to be useful, it would need to be effective at changing policy, and it would need to be aimed at the correct places. Even if the former is true, I find myself skeptical that the latter would occur, since even AI policy experts are not yet sure where to aim their own efforts, let alone how to communicate where to aim so well that a hundred thousand casually-engaged people can point in the same useful direction.
I am one of those meat-eating EA’s, so I figured I’d give some reasons why I’m not vegan, to aid the goals of this post in finding out about these things.
Price: While I can technically afford it, I still prefer to save money when possible.
Availability: A lot of food out there, especially frozen foods which I buy a lot of since I don’t like cooking, involves meat. It’s simply easier to decide on meals when meat is an option.
Knowledge: If I were to go vegan, I would be unsure how to go vegan safely for an extended period, and how to make sure I got a decent variety rather than eating the same foods over and over (which comes into taste—I don’t mind vegan food but there’s much more variety I can find in meat-based dishes)
Convenience: Similarly to above—it takes resources to seek out vegan options, more resources than to just eat normally.
The harms are real, but the harms are far away and abstract. So when I feel vaguely guilty about eating meat, I think about all the hassle and cost it would take to swap diets, and I shy away from it and don’t do it.
I’m not quite sure why those harms are far away and abstract, whereas the harms caused by malaria or AI risk don’t invoke the same feelings in me. I think it’s because I can use maths to determine the number of humans impacted and then put myself in the place of one of those humans—it’s harder to do that with chickens. Also, giving away 10% of my income is actually less of a day-to-day drain on my resources than going vegan would be. I feel aversion to spending money, but I only give away money once a month, and it doesn’t cause me financial hardship. By contrast, veganism requires daily effort.
As a micro-example of where these considerations don’t apply—there are some plant meat based strips that I can get at my local supermarket. I find them tastier than actual meat when put into curry, and they’re just as cheap when on special. So whenever they’re on special, I pick a bunch of them up and they become my default option for a while. I know how to cook them, I know where to get them, they’re just as cheap (sometimes) and I enjoy the taste. So I end up avoiding meat by default. I hope plant-based meat will eventually reach that saturation point for all kinds of dishes too.
Seems like a pretty incredible opportunity for those interested! What level of time commitment do you expect reading and understanding the book to take, in addition to the meetings?
Looking at the two critiques in reverse order:
I think it’s true that it’s easy for EA’s to lose sight of the big picture, but to me, the reason for that is simple—humans, in general, are terrible at seeing the bigger picture. If anything, it seems to me that the EA frameworks are better than most altruistic endeavours at seeing the big picture—most altruistic endeavors don’t get past the stage of “See good thing and do it”, whereas EA’s tend to be asking if X is really the most effective thing they can do, which invariably involves looking at a bigger picture than the immediate thing. In my own field of AI safety, thinking about the big picture is an idea that people are routinely exposed to. Researchers often do exercises like backchaining (ask what the main goal is, like “Make AI go well” and figure out how to move backwards from that to what you should be doing now) and theory of change (Writing out specifically what problem you want to help, what you want to achieve, and how that will help)
Do you think there are specific vulnerabilities that EA’s have that make them lose sight of the bigger picture, that non-EA altruistic people avoid?
For the point of foregoing fulfillment—I’m not sure exactly what fulfillment you think people are foregoing, here. Is it the fulfillment of having lots of money? The fulfillment of working directly on the world’s biggest problems?
I was using unidentifiability in the Hubinger way. I do believe that if you try to get an AI trained in the way you mention here to follow directions subject to ethical considerations, by default, the things it considers “maximally ethical” will be approximately as strange as the sentences from above.
That said, this is not actually related to the problem of deceptive alignment, so I realise now that this is very much a side point.
I don’t understand why you believe unidentifiability will be prevented by large datasets. Take the recent SolidGoldMagikarp work. It was done on GPT-2, but GPT-2 nevertheless was trained on a lot of data—a quick Google search suggests eight million web pages.
Despite this, when people tried to find the sentences that maximally determined the next token, what we got was...strange.
This is exactly the kind of thing I would expect to see if unidentifiability was a major problem—when we attempt to poke the bounds of extreme behaviour of the AI and take it far off distribution as a result, what we get is complete nonsense and not at all correlated with what we actually want. Clearly it understands the concepts of “girl”, “USA”, and “evil” very differently to us, and not in a way we would endorse.
This is far from a guarantee that unidentifiability will remain a problem, but considering your position is under 1%, things like this seem to add much more credence to unidentifiability in my world model than you give it.
Thanks for this! One thing I noticed is there is an assumption you’ll continue to donate 10% of your current salary even after retirement—it would be worth having that as a toggle to turn that off, since the GWWC pledge does say “until I retire”. That may make giving more appealing as well, because giving 10% forever requires longer timelines than giving 10% until retirement—when I did the calcs in my own spreadsheet I only increased my working timeline by about 10% by committing to give 10% until retiring.
Admittedly, now I’m rethinking the whole “retire early” thing entirely given the impact of direct work, but this outside the scope of one spreadsheet :P
This came from going through AGI Safety Fundamentals (and to a lesser extent, Alignment 201) with a discussion group and talking through the various ideas. I also read more extensively in most weeks in AGISF than the core readings. I think the discussions were a key part of this. (Though it’s hard to tell since I don’t have access to a world where I didn’t do that—this is just intuition)
Great stuff! Thanks for running this!
Minor point: The Discovering Latent Knowledge Github appears empty.
Also, regarding the data poisoning benchmark, This Is Fine, I’m curious if this is actually a good benchmark for resistance to data poisoning. The actual thing we seem to be measuring here is speed of transfer learning, and declaring that slower is better. While slower speed of learning does increase resistance to data poisoning, it also seems bad for everything else we might want our AI to do. To me, this is basically a fine-tuning benchmark that we’ve then inverted. (After all, if a neural network always outputted the number 42 no matter what, it would score the maximum on TIF—there is no sequence of wrong prompts that can cause it to elicit the number 38 instead, because it is incapable of learning anything. Nevertheless, this is not where we want LLM’s to go in the future.)
A better benchmark would probably be to take data poisoned examples and real fine-tuning, fine-tune the model on each, and compare how much it learns in both cases. With current capabilities, it might not be possible to score above baseline on this benchmark since I don’t know if we actually have ways for the model to filter out data poisoned examples—nevertheless, this would bring awareness to the problem and actually measure what we want to measure more accurately.
I’m sure each individual critic of EA has their own reasons. That said (intuitively, I don’t have data to back this up, this is my guess) I suspect two main things, pre-FTX.
Firstly, longtermism is very criticisable. It’s much more abstract, focuses less on doing good in the moment, and can step on causes like malaria prevention that people can more easily emotionally get behind. There is a general implication of longtermism that if you accept its principles, other causes are essentially irrelevant.
Secondly, everything I just said about longtermism → neartermism applies to EA → regular charity—just replace “Doing good in the moment” with “Doing good close to home”. When I first signed up for an EA virtual program, my immediate takeaway was that most of the things I had previously cared about didn’t matter. Nobody said this out loud, they were scrupulously polite about it, they were 100% correct, and it was a message that needed to be shared to get people like me on board. This is a feature, not a bug, of EA messaging. But this is not a message that people enjoy hearing. The things people care about are generally optimised for having people care about them—as examples, see everything trending on Twitter. As a result, people don’t react well to being told, whether explicitly or implicitly, that they should stop caring about (My personal example here) the amount of money Australian welfare recipients get, and care about malaria prevention halfway across the world instead.
One difference between EA and longtermism is that people rarely criticise neartermism to the same level, because then you can just point out the hundreds of thousands of lives that neartermism has already saved, and they look like an asshole. Longtermism has no such defense, and a lot of people equate that with the EA movement—sometimes out of intellectual dishonesty, and sometimes because longtermism genuinely is a large and growing part of EA.
My p(doom) went down slightly (From around 30% to around 25%) mainly as a result of how GPT-4 caused governments to begin taking AI seriously in a way I didn’t predict. My timelines haven’t changed—the only capability increase of GPT-4 that really surprised me was its multimodal nature. (Thus, governments waking up to this was a double surprise, because it clearly surprised them in a way that it didn’t surprise me!)
I’m also less worried about misalignment and more worried about misuse when it comes to the next five years, due to how LLM”s appear to behave. It seems that LLM’s aren’t particularly agentic by default, but can certainly be induced to perform agent-like behaviour—GPT-4′s inability to do this well seems to be a capability issue that I expect to be resolved in a generation or two. Thus, I’m less worried about the training of GPT-N but still worried about the deployment of GPT-N. It makes me put more credence in the slow takeoff scenario.
This also makes me much more uncertain about the merits of pausing in the short-term, like the next year or two. I expect that if our options were “Pause now” or “Pause after another year or two”, the latter is better. In practice, I know the world doesn’t work that way and slowing down AI now likely slows down the whole timeline, which complicates things. I still think that government efforts like the UK’s AISI are net-positive (I’m joining them for a reason, after all) but I think a lot of the benefit to reducing x-risk here is building a mature field around AI policy and evaluations before we need it—if we wait until I think the threat of misaligned AI is imminent, that may be too late.