Co-Director of Equilibria Network: https://ââeq-network.org/ââ
I try to write as if I were having a conversation with you in person.
I would like to claim that my current safety beliefs are a mix between Paul Christianoâs, Andrew Critchâs and Def/âAcc.
Jonas Hallgren đ¸
The number of applications will affect the counterfactual value of applying. Now, saying your expected number might lower the number of people who will apply, but I would still appreciate having a range of expected applicants for the AI Safety roles.
What is the expected amount of people applying for the AI Safety roles?
Iâm getting the vibe that your priors are on the world to some extent, being in a multipolar scenario in the future. Iâm interested in more specifically what your predictions are for multipolarity versus singleton given the shard-theory thinking as it seems unlikely for recursive self-improvement to happen in the way described given what I understand of your model?
Great post; I enjoyed it.
Iâve got two things to say, the first one being that GPT is a very nice brainstorming tool as it generates many more ideas than you could yourself that you can then prune from, which is nice.
Secondly, Iâve been doing âpeer coachingâ with some EA people using reclaim.ai (not sponsored) to automatically book meetings each week where we take turns being the mentor and mentee answering the five following questions:
- Whatâs on your mind?
- When would todayâs setting be a success?
- Where are you right now?
- How do you get where you want to go?
- What are the actions/âfirst steps to get there?
- Ask for feedbackI really like the framing of meetings with yourself, Iâll definitely try that out.
Alright, that makes sense; thank you!
Isnât estimated value calculated by the probability times the utility and as a consequence isnât the higher risk part wrong if one simply looks at it like this? (20% to 10% would be 10x the impact of 2% to 1%)
(I could be missing something here, please correct me in that case)
I didnât mean it in this sense. I think the lesson you drew from it is fair in general, I was just reacting to the things I felt you pulled under the rug, if that makes sense.
Sorry, Pablo, I meant that I got a lot more epistemically humble, I should have thought about how I phrased it more. It was more that I went from the opinion that many worlds is probably true to: âOh man, there are some weird answers to the Wignerâs friend thought experiment and I should not give a major weight to any.â So Iâm more like maybe 20% on many worlds?
That being said I am overconfident from time to time and itâs fair to point that out from me as well. Maybe you were being overconfident in saying that I was overconfident? :D
I will say that I thought the consciousness p zombie distinction was very interesting and a good example of overconfidence as this didnât come across in my previous comment.
Generally, some good points across the board that I agree with. Talking with some physicist friends helped me debunk the many worlds thing Yud has going. Similarly his animal consciousness stuff seems a bit crazy as well. I will also say that I feel that youâre coming off way to confident and inflammatory when it comes to the general tone. The AI Safety argument you provided was just dismissal without much explanation. Also, when it comes to the consciousness stuff I honestly just get kind of pissed reading it as I feel youâre to some extent hard pandering to dualism.
I totally agree with you that Yudkowsky is way overconfident in the claims that he makes. Ironically enough it also seems that you to some extent are as well in this post since youâre overgeneralizing from insufficient data. As a fellow young person, I recommend some more caution when it comes to solid claims about stuff where you have little knowledge (you cherry-picked data on multiple occasions in this post).
Overall you made some good points though, so still a thought-provoking read.
Maybe frame it more as if youâre talking to a child. Yes you can tell the child to follow something but how are you certain that it will do it?
Similarly, how can we trust the AI to actually follow the prompt? To trust it we would fundamentally have to understand the AI or safeguard against problems if we donât understand it. The question then becomes how your prompt is represented in machine language, which is very hard to answer.
To reiterate, ask yourself, how do you know that the AI will do what you say?
(Leike responds to this here if anyone is interested)
John Wentworth has a post on Godzilla strategies where he claims that putting an AGI to solve the alignment problem is like asking Godzilla to make a larger Godzilla behave. How will you ensure you donât overshoot the intelligence of the agent youâre using to solve alignment and fall into the âGodzilla trapâ?
AdÂvice for new alÂignÂment peoÂple: Info Max
Max TegÂmarkâs new Time arÂtiÂcle on how weâre in a Donât Look Up sceÂnario [Linkpost]
TL;DR: I totally agree with the general spirit of this post, we need people to solve alignment, and weâre not on track. Go and work on alignment but before you do, try to engage with the existing research, there are reasons why it exists. There are a lot of things not getting worked on within AI alignment research, and I can almost guarantee you that within six months to a year, you can find things that people havenât worked on.
So go and find these underexplored areas in a way where you engage with what people have done before you!
Thereâs no secret elite SEAL team coming to save the day. This is it. Weâre not on track.
If timelines are short and we donât get our act together, weâre in a lot of trouble. Scalable alignmentâaligning superhuman AGI systemsâis a real, unsolved problem. Itâs quite simple: current alignment techniques rely on human supervision, but as models become superhuman, humans wonât be able to reliably supervise them.
But my pessimism on the current state of alignment research very much doesnât mean Iâm an Eliezer-style doomer. Quite the opposite, Iâm optimistic. I think scalable alignment is a solvable problemâand itâs an ML problem, one we can do real science on as our models get more advanced. But we gotta stop fucking around. We need an effort that matches the gravity of the challenge.[1]
I also agree in that Eliezerâs style of doom seems uncalled for and that this is a solvable but difficult problem. My personal p(doom) is something around 20%, and I think this seems quite reasonable.
Barely anyone is going for the throat of solving the core difficulties of scalable alignment. Many of the people who are working on alignment are doing blue-sky theory, pretty disconnected from actual ML models. Most of the rest are doing work thatâs vaguely related, hoping it will somehow be useful, or working on techniques that might work now but predictably fail to work for superhuman systems.
Now I do want to give pushback on this claim as I see a lot of people who havenât fully engaged with the more theoretical alignment landscape making this claim. There are only 300 people working on alignment, but those people are actually doing things, and most of them arenât doing blue in the sky theory.
A note on the ARC claim:
But his research now (âheuristic argumentsâ) is roughly âtrying to solve alignment via galaxy-brained math proofs.â As much as I respect and appreciate Paul, Iâm really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.[4]This is essentially a claim about the methodology of science in that working on existing systems gives more information and breakthroughs compared to working on a blue-sky theory. The current hypothesis for this is that it is just a lot more information-rich to do real-world research. This is, however, not the only way to get real-world feedback loops. Christiano is not working on blue sky theory; heâs using real-world feedback loops in a different way; he looks at the real world and looks for information thatâs already there.
A discovery of this type is, for example, the tragedy of the commons; whilst we could have created computer simulations to see the process in action, itâs 10x easier to look at the world and see the real-time failures. He tells stories and sees where they fail in the future as his research methodology. This gives bits of information on where to do future experiments, like how we would be able to tell that humans would fail to stop overfishing without actually running an experiment on it.
This is also what John Wentworth does with his research; he looks at the real world as a reference frame which is quite rich in information. Now a good question is why we havenât seen that many empirical predictions from Agent Foundations. I believe it is because alignment is quite hard, and specifically, it is hard to define agency in a satisfactory way due to some really fuzzy problems (boundaries, among others) and, therefore, hard to make predictions.
We donât want to mathematize things too early either, as doing so would put us into a predefined reference frame that it might be hard to escape from. We want to find the right ballpark for agents since if we fail we might base evaluations on something that turns out to be false.
In general, thereâs a difference in the types of problems in alignment and empirical ML; the reference class of a âsharp-left turnâ is different from something empirically verifiable as it is unclearly defined, so a good question is how we should turn one into the other. This question of how we take recursive self-improvement, inner misalignment and agent foundations into empirically verifiable ML experiments is actually something that most of the people I know in AI Alignment are currently actively working on.
This post from Alexander Turner is a great example of doing this as they try âjust retargeting the searchâ
Other people are trying other things, such as bounding the maximisation in RL into quantilisers. This would, in turn, make AI more âcontentâ with not maximising. (fun parallel to how utilitarianism shouldnât be unbounded)
I could go on with examples, but what I really want to say here is that alignment researchers are doing things; itâs just hard to realise why theyâre doing things when youâre not doing alignment research yourself. (If you want to start, book my calendly and I might be able to help you.)
So what does this mean for an average person? You can make a huge difference by going in and engaging with arguments and coming up with counter-examples, experiments and theories of what is actually going on.
I just want to say that itâs most likely paramount to engage with the existing alignment research landscape before as itâs free information and easy to fall into traps if you donât. (a good resource for avoiding some traps is Johnâs Why Not Just sequence)
Thereâs a couple of years worth of research there; it is not worth rediscovering from the ground up. Still, this shouldnât stop you, go and do it; you donât need a hero licence.
Glad to hear it!
The Benefits of DistilÂlaÂtion in Research
Great tool; Iâve enjoyed it and used it for two years. I (a random EA) would recommend it.
Thank you for this! Iâm hoping that this enables me to spend a lot less time on hiring in the future. I feel that this is a topic that could easily have taken me 3x the effort to understand if I hadnât gotten some very good resources from this post so I will definitely check out the book and again, awesome post!
Thank you for this post! I will make sure to read the 5â5 books that I havenât read yet, especially excited about Joseph Heinrichâs book from 2020, had read The Secret of Our Success before but not that one.
I actually come from an AI Safety interest when it comes to moral progress. The question is to some extent for me on how we can set up AI systems so that they continuously improve âmoral progressâ as we donât want to leave our fingerprints on the future.
In my opinion, the larger AI Safety dangers come from âbig data hellâ like the ones described in Yuah Noah Harariâs Homo Deus or Paul Christianoâs slow take-off scenarios.
Therefore we want to figure out how to set up AIs in such a way that automatically improves moral progress in the structure of their use. Iâm also a believer that AI will most likely in the future go through a similar process to the one described in The Secret of Our Success and that we should prepare appropriate optimisation functions for it.
So, if you ever feel like we might die from AI, I would love to see some work in that direction!
(happy to talk more about it if youâre up for it.)