Aaron_Scher

Karma: 580

I’m Aaron, I’ve done Uni group organizing at the Claremont Colleges for a bit. Current cause prioritization is AI Alignment.

It’s not obvious that getting dangerous AI later is better

Aaron_ScherSep 23, 2023, 5:35 AM

23 points

9 comments16 min readEA link

Aaron_Scher Sep 23, 2023, 12:49 AM
3 points
0 ∶ 0
on: Ingredients for creating disruptive research teams
I’m curious why you don’t include intellectually aggressive culture in the summary? It seems like this was a notable part of a few of the case studies. Did the others just not mention this, or is there information indicating they didn’t have this culture? I’m curious how widespread this feature is. e.g.,
The intellectual atmosphere seems to have been fairly aggressive. For instance, it was common (and accepted) that some researchers would shout “bullshit” and lecture the speaker on why they were wrong.

Aaron_Scher Sep 18, 2023, 10:18 PM
3 points
0 ∶ 0
in reply to: NickLaing’s comment on: AI Pause Will Likely Backfire
we need capabilities to increase so that we can stay up to date with alignment research
I think one of the better write-ups about this perspective is Anthropic’s Core Views on AI Safety.
From its main text, under the heading The Role of Frontier Models in Empirical Safety, a couple relevant arguments are:
- Many safety concerns arise with powerful systems, so we need to have powerful systems to experiment with
- Many safety methods require large/powerful models
- Need to understand how both problems and our fixes change with model scale (if model gets bigger, does it look like safety technique is still working)
- To get evidence of powerful models being dangerous (which is important for many reasons), you need the powerful models.

Aaron_Scher Sep 18, 2023, 9:51 PM
6 points
0 ∶ 1
in reply to: NickLaing’s comment on: AI Pause Will Likely Backfire
Not responding to your main question:
Second in a theoretical situation where capabilities research globally stopped overnight, isn’t this just free-extra-time for the human race where we aren’t moving towards doom? That feels pretty valuable and high EV in and of itself.
I’m interpreting this as saying that buying humanity more time, in and of itself, is good.
I don’t think extra time pre-transformative-AI is particularly valuable except its impact on existential risk. Two reasons for why I think this:
- Astronomical waste argument. Time post-transformative-AI is way more valuable than time now, assuming some (but strong version not necessary) aggregating/total utilitarianism. If I was trading clock-time seconds now for seconds a thousand years from now, assuming no difference in existential risk, I would probably be willing to trade every historical second of humans living good lives for like a minute a thousand years from now, because it seems like we could have a ton of (morally relevant) people in the future, and the moral value derived from their experience could be significantly greater than current humans.
- The moral value of the current world seems plausibly negative due to large amounts of suffering. Factory farming, wild animal suffering, humans experiencing suffering, and more, seem like they make the total sign unclear. Under moral views that weigh suffering more highly than happiness, there’s an even stronger case for the current world being net-negative. This is one of those arguments that I think is pretty weird and almost never affects my actions, but it is relevant to the question of whether extra time for the human race is positive EV.
- Third argument about how AI sooner could help reduce other existential risks. e.g., normal example of AI speeding up vaccine research, or weirder example of AI enabling space colonization, and being on many planets makes x-risk lower. I don’t personally put very much weight on this argument, but it’s worth mentioning.

Aaron_Scher Sep 9, 2023, 5:24 AM
6 points
0 ∶ 0
on: An Incomplete List of Things I Think EAs Probably Shouldn’t Do
I’m glad you wrote this post. Mostly before reading this post, I wrote a draft for what I want my personal conflict of interest policy to be, especially with regard to personal and professional relationships. Changing community norms can be hard, but changing my norms might be as easy as leaving a persuasive comment! I’m open to feedback and suggestions here for anybody interested.

Aaron_Scher Aug 24, 2023, 7:30 PM
1 point
0 ∶ 0
in reply to: RyanCarey’s comment on: RyanCarey’s Shortform
I think Ryan is probably overall right that it would be better to fund people for longer at a time. One counter-consideration that hasn’t been mentioned yet: longer contracts implicitly and explicitly push people to keep doing something — that may be sub-optimal — because they drive up switching costs.
If you have to apply for funding once a year no matter what you’re working on, the “switching costs” of doing the same thing you’ve been doing are similar to the cost of switching (of course they aren’t in general, but with regard to funding they might be). I think it’s unlikely but not crazy that the effects of status quo bias might be severe enough that funders artificially imposing switching costs on “continuing/non-switching” results in net better results. I expect that in the world where grants usually last 1 year, people switch what they’re doing more than the world where grants are 3 years, and it’s plausible these changes are good for impact.
Some factors that seem cruxy here:
- how much can be gained through realistic switching (how bad is current allocation of people, how much better are the things people would do if switching costs were relatively zero as they sorta are, how much worse of things would people be doing if continuing-costs were low).
- it seems very likely that this consideration should affect grants to junior and early-career people that are bouncing around more but it probably doesn’t apply much to more senior folks who have been working on a project for awhile (on the other hand, maybe you want the junior people investing more in long term career plans, thus switching less, it depends on the person).
- Could relatively-zero switching costs actually hurt things because people switch too much due to over-correcting (e.g., holden karnofsky gets excited about AI evaluations so a bunch of people work on it and then governance gets big so people switch to that, and then etc.)
- How much does it help to have people with a lot of experience on particular things (narrow experts vs. generalists)
- If grants were more flexible (or grant-makers communicated better and the social norms were clearer; in fact people sometimes do return grants or change what they’re working on mid-grant) maybe you could fund people for 3 years while giving them affordance to switch so you still capture the value from people switching to more impactful things.
Personally, I’ve found that being funded by a grant at all makes me less likely to switch what I’m doing. I expect the amount I “want” (not upon reflection/hindsight) to switch what I’m doing is too much, so for me this effect might be net positive, but there are probably also some times where this gets in the way of me making impactful switches. If grant-makers were more accessible to talk about this, i.e., not significantly time constrained, they could probably cause a better allocation of resources. Overall, I’m not sure how compelling this counter-consideration is, but it seems worth mulling over.

Aaron_Scher Aug 24, 2023, 6:37 PM
1 point
0 ∶ 0
in reply to: Austin’s comment on: RyanCarey’s Shortform
What does FRO stand for?

Aaron_Scher Jul 20, 2023, 7:23 PM
15 points
5 ∶ 0
on: I’m interviewing Jan Leike, co-lead of OpenAI’s new Superalignment project. What should I ask him?
How is the super-alignment team going to interface with the rest of the AI alignment community, and specifically what kind of work from others would be helpful to them (e.g., evaluations they would want to exist in 2 years, specific problems in interpretability that seem important to solve early, curricula for AIs to learn about the alignment problem while avoiding content we may not want them reading)?
To provide more context on my thinking that leads to this question: I’m pretty worried that OpenAI is making themselves a single point of failure in existential security . Their plan seems to be a less-disingenuous version of “we are going to build superintelligence in the next 10 years, and we’re optimistic that our alignment team will solve catastrophic safety problems, but if they can’t then humanity is screwed anyway, because as mentioned, we’re going to build the god machine. We might try to pause if we can’t solve alignment, but we don’t expect that to help much.” Insofar as a unilateralist is taking existentially risky actions like this and they can’t be stopped, other folks might want to support their work to increase the chance of the super-alignment team’s success. Insofar as I want to support their work, I currently don’t know what they need.
Another framing behind this question is just “many people in the AI alignment community are also interested in solving this problem, how can they indirectly collaborate with you (some people will want to directly collaborate, but this has corporate-closed-ness limitation).

Aaron_Scher Jul 13, 2023, 3:22 AM
3 points
0 ∶ 0
in reply to: Linch’s comment on: Linch’s Shortform
I am not aware of modeling here, but I have thought about this a bit. Besides what you mention, some other ways I think this story may not pan out (very speculative):
1. At the critical time, the cost of compute for automated researchers may be really high such that it’s actually not cost effective to buy labor this way. This would mainly be because many people want to use the best hardware for AI training or productive work, and this demand just overwhelms suppliers and prices skyrocket. This is like the labs and governments paying a lot more except that they’re buying things which are not altruistically-motivated research. Because autonomous labor is really expensive, it isn’t a much better deal than 2023 human labor.
2. A similar problem is that there may not be a market for buying autonomous labor because somebody is restricting this. Perhaps a government implements compute controls including on inference to slow AI progress (because they think that rapid progress would lead to catastrophe from misalignment). Perhaps the lab that develops the first of these capable-of-autonomous-research models restricts who can use it. To spell this out more, say GPT-6 is capable of massively accelerating research, then OpenAI may only make it available to alignment researchers for 3 months. Alternatively, they may only make it available to cancer researchers. In the first case, it’s probably relatively cheap to get autonomous alignment research (I’m assuming OpenAI is subsidizing this, though this may not be a good assumption). In the second case you can’t get useful alignment research with your money because you’re not allowed to.
3. It might be that the intellectual labor we can get out of AI systems at the critical time is bottlenecked by human labor (i.e., humans are needed to: review the output of AI debates, give instructions to autonomous software engineers, or construct high quality datasets). In this situation, you can’t buy very much autonomous labor with your money because autonomous labor isn’t the limiting factor on progress. This is pretty much the state of things in 2023; AI systems help speed up human researchers, but the compute cost of them doing so is still far below the human costs, and you probably didn’t need to save significant money 5 years ago to make this happen.
My current thinking is that there’s a >20% chance that EA-oriented funders should be saving significant money to spend on compute for autonomous researchers, and it is an important thing for them to gain clarity on. I want to point out that there is probably a partial-automation phase (like point 3 above) before a full-automation phase. The partial-automation phase has less opportunity to usefully spend money on compute (plausibly still in the tens of millions of dollars), but our actions are more likely to matter. After that comes the full-automation phase where money can be scalably spent to e.g., differentially speed up alignment vs. AI capabilities research by hundreds of millions of dollars, but there’s a decent chance our actions don’t matter then.
As you mention, perhaps our actions don’t matter then because humans don’t control the future. I would emphasize that if we have fully autonomous, no humans in the loop, research happening without already having good alignment of those systems, it’s highly likely that we get disempowered. That is, it might not make sense to aim to do alignment research at that point because either the crucial alignment work was already done, or we lose. Conditional on having aligned systems at this point, having saved money to spend on altruistically motivated cognitive work probably isn’t very important because economic growth gets going really fast and there’s plenty of money to be spent on non-alignment altruistic causes. On the other hand, something something at that point it’s the last train on it’s way to the dragon and it sure would be sad to not have money saved to buy those bed-nets.

Aaron_Scher Jul 11, 2023, 11:54 PM
7 points
1 ∶ 0
in reply to: rgb’s comment on: Principles for AI Welfare Research
A few weeks ago I did a quick calculation for the amount of digital suffering I expect in the short term, which probably gets at your question about these sizes, for the short term. tldr of my thinking on the topic:
- There is currently a global compute stock of ~1.4e21 FLOP/s (each second, we can do about that many floating point operations).
- It seems reasonable to expect this to grow ~40x in the next 10 years based on naively extrapolating current trends in spending and compute efficiency per dollar. That brings us to 1.6e23 FLOP/s in 2033.
- Human brains do about 1e15 FLOP/s (each second, a human brain does about 1e15 floating point operations worth of computation)
- We might naively assume that future AIs will have similar consciousness-compute efficiency to humans. We’ll also assume that 63% of the 2033 compute stock is being used to run such AIs (makes the numbers easier).
- Then the number of human-consciousness-second-equivalent AIs that can be run each second in 2033 is 1e23 / 1e15 = 1e8, or 100 million.
- For reference, there are probably around 31 billion land animals being factory farmed each second. I make a few adjustments based on brain size and guesses about the experience of suffering AIs and get that digital suffering in 2033 seems to be similar in scale to factory farming.
- Overall my analysis is extremely uncertain, and I’m unsurprised if it’s off by 3 orders of magnitude in either direction. Also note that I am only looking at the short term.
You can read the slightly more thorough, but still extremely rough and likely wrong BOTEC here

Aaron_Scher Jun 30, 2023, 1:35 AM
4 points
0 ∶ 0
in reply to: zhengdong’s comment on: [Link Post] Interesting shallow round-up of reasons to be skeptical that transformative AI or explosive economic growth are coming soon
Thanks for your response. I’ll just respond to a couple things.
Re Constitutional AI: I agree normatively that it seems bad to hand over judging AI debates to AIs^[1]. I also think this will happen. To quote from the original AI Safety via Debate paper,
Human time is expensive: We may lack enough human time to judge every debate, which we can address by training ML models to predict human reward as in Christiano et al. [2017]. Most debates can be judged by the reward predictor rather than by the humans themselves. Critically, the reward predictors do not need to be as smart as the agents by our assumption that judging debates is easier than debating, so they can be trained with less data. We can measure how closely a reward predictor matches a human by showing the same debate to both.
Re
We’d also really contest the ‘perform very similarly to human raters’ is enough—it’d be surprising if we already have a free lunch, no information lost, way to simulate humans well enough to make better AI.
I also find this surprising, or at least I did the first 3 times I came across medium-quality evidence pointing this direction. I don’t find it as surprising any more because I’ve updated my understanding of the world to “welp, I guess 2023 AIs actually are that good on some tasks.” Rather than making arguments to try and convince you, I’ll just link some of the evidence that I have found compelling, maybe you will too, maybe not: Model Written Evals, MACHIAVELLI benchmark, Alpaca (maybe the most significant for my thinking), this database, Constitutional AI.
I’m far from certain that this trend, of LLMs being useful for making better LLMs and for replacing human feedback, continues rather than hitting a wall in the next 2 years, but it does seem more likely than not to me, based on my read of the evidence. Some important decisions in my life rely on how soon this AI stuff is happening (for instance if we have 20+ years I should probably aim to do policy work), so I’m pretty interested in having correct views. Currently, LLMs improving the next generation of AIs via more and better training data is one of the key factors in how I’m thinking about this. If you don’t find these particular evidences compelling and are able to explain why, that would be useful to me!
1. ^
  I’m actually unsure here. I expect there are some times where it’s fine to have no humans in the loop and other times where it’s critical. It generally gives me the ick to take humans out of the loop, but I expect there are some times where I would think it’s correct.

Aaron_Scher Jun 29, 2023, 6:35 AM
15 points
0 ∶ 0
on: [Link Post] Interesting shallow round-up of reasons to be skeptical that transformative AI or explosive economic growth are coming soon
The article doesn’t seem to have a comment section so I’m putting some thoughts here.
- Economic growth: I don’t feel I know enough about historical economic growth to comment on how much to weigh the “the trend growth rate of GDP per capita in the world’s frontier economy has never exceeded three percent per year.” I’ll note that I think the framing here is quite different than that of Christiano’s Hyperbolic Growth, despite them looking at roughly the same data as far as I can tell.
- Scaling current methods: the article seems to cherrypick the evidence pretty significantly and makes the weak claim that “Current methods may also not be enough.” It is obvious that my subjective probability that current methods are enough should be <1, but I have yet to come across arguments that push that credence below say 50%.
  - “Scaling compute another order of magnitude would require hundreds of billions of dollars more spending on hardware.” This is straightforwardly false. The table included in the article, from the Chinchilla paper with additions, is a bit confusing because it doesn’t include where we are now, and because it lists only model size rather than total training compute (FLOP). Based on Epoch’s database of models, PaLM 2 is trained with about 7.34e24 FLOP, and GPT-4 is estimated at 2.10e25 (note these are not official numbers). This corresponds to being around the 280B param (9.9e24 FLOP) or 520B param (3.43e25 FLOP) rows in the table. In this range, tens of millions of dollars are being spent on compute for the biggest training runs now. It should be obvious that you can get a couple more orders of magnitude more compute before hitting hundreds of billions of dollars. In fact, the 10 Trillion param row in the table, listed at $28 billion, corresponds to a total training compute of 1.3e28 FLOP, which is more than 2 orders of magnitude above the biggest publicly-known models are estimated. I agree that cost may soon become a limiting factor, but the claim that an order of magnitude would push us into hundreds of billions is clearly wrong given that currently costs are tens of millions.
  - Re cherrypicking data, I guess one of the most important points that seems to be missing from this section is the rate of algorithmic improvement. I would point to Epoch’s work here.
- “Constitutional AI, a state-of-the-art alignment technique that has even reached the steps of Capitol Hill, also does not aim to remove humans from the process at all: “rather than removing human supervision, in the longer term our goal is to make human supervision as efficacious as possible.”″ This seems to me like a misunderstanding of Constitutional AI, for which a main component is “RL from AI Feedback.” Constitutional AI is all about removing humans from the loop in order to get high quality data more efficiently. There’s a politics thing where developers don’t want to say they’re removing human supervision, and it’s also true that human supervision will probably play a role in data generation in the future, but the human:total (AI+human) contribution to data ratio is surely going to go down. For example research using AIs where we used to use humans, see also Anthropic’s paper Model Written Evaluations, and the AI-labeled MACHIAVELLI benchmark. More generally, I would bet the trend toward automating datasets and benchmarks will continue, even if humans remain in the loop somewhat; insofar as humans are a limiting factor, developers will try to make them less necessary, and we already have AIs that perform very similarly to human raters at some tasks.
- “We are constantly surprised in our day jobs as a journalist and AI researcher by how many questions do not have good answers on the internet or in books, but where some expert has a solid answer that they had not bothered to record. And in some cases, as with a master chef or LeBron James, they may not even be capable of making legible how they do what they do.” Not a disagreement, but I do wonder how much of this is a result of information being diffuse and just hard to properly find, a kind of task I expect AIs to be good at. For instance, 2025 language models equipped with search might be similarly useful to if you had a panel of relevant experts you could ask questions to.
- Noting that section 3: “Even if technical AI progress continues, social and economic hurdles may limit its impact” matters for some outcomes and not for others. It matters given the authors define “transformative AI in terms of its observed economic impact.” It matters for many outcomes I care about like human well-being, that are related to economic impacts. It applies less to worries around existential risk and human disempowerment, for which powerful AIs may pose risks even while not causing large economic impacts ahead of time (e.g., bioterrorism doesn’t require first creating a bunch of economic growth).
  - Overall I think the claim of section 3 is likely to be right. A point pushing the other direction is that there may be a regulatory race to the bottom where countries want to enable local economic growth from AI and so relax regulations, think medical tourism for all kinds of services.
- “Yet as this essay has outlined, myriad hurdles stand in the way of widespread transformative impact. These hurdles should be viewed collectively. Solving a subset may not be enough.” I definitely don’t find the hurdles discussed here to be sufficient to make this claim. It feels like there’s a motte and bailey, where the easy to defend claim is “these 3+ hurdles might exist, and we don’t have enough evidence to discount any of them”, and the harder to defend claim is “these hurdles disjunctively prevent transformative AI in the short term, so all of them must be conquered to get such AI.” I expect this shift isn’t intended by the authors, but I’m noting that I think it’s a leap.
- “Scenarios where AI grows to an autonomous, uncontrollable, and incomprehensible existential threat must clear the same difficult hurdles an economic transformation must.” I don’t think this is the case. For example, section 3 seems to not apply as I mentioned earlier. I think it’s worth noting that AI safety researcher Eliezer Yudkowsky has made a similar argument to what you make in section 3, and he is also thinks existential catastrophe in the near term is likely. I think the point your making here is directionally right, however, that AI which poses existential risk is likely to be transformative in the sense you’re describing. That is, it’s not necessary for such AI to be economically transformative, and there are a couple other ways catastrophically-dangerous AI can bypass the hurdles you lay out, but I think it’s overall a good bet that existentially dangerous AIs are also capable of being economically transformative, so the general picture of hurdles, insofar as they are real, will affect such risks as well [I could easily see myself changing my mind about this with more thought]. I welcome more discussion on this point and have some thoughts myself, but I’m tired and won’t include them in this comment; happy to chat privately about where “economically transformative” and “capable of posing catastrophic risks” lie on various spectrums.
While my comment has been negative and focused on criticism, I am quite glad this article was written. Feel free to check out a piece I wrote, laying out some of my thinking around powerful AI coming soon, which is mostly orthogonal to this article. This comment was written sloppily, partially as my off-the-cuff notes while reading, sorry for any mistakes and impolite tone.

Aaron_Scher Apr 20, 2023, 2:06 AM
5 points
2 ∶ 0
in reply to: Ren Ryba’s comment on: A freshman year during the AI midgame: my approach to the next year
I’m not Buck, but I can venture some thoughts as somebody who thinks it’s reasonably likely we don’t have much time.
Given that “I’m skeptical that humans will go extinct in the near future” and that you prioritize preventing suffering over creating happiness, it seems reasonable for you to condition your plan on humanity surviving the creation of AGI. You might then back-chain from possible futures you want to steer toward or away from. For instance, if AGI enables space colonization, it sure would be terrible if we just had planets covered in factory farms. What is the path by which we would get there, and how can you change it so that we have e.g., cultured meat production planets instead. I think this is probably pretty hard to do; the term “singularity” has been used partially to describe that we cannot predict what would happen after it. That said, the stakes are pretty astronomical such that I think it would be pretty reasonable for >20% of animal advocacy effort to be specifically aimed at preventing AGI-enabled futures with mass animal suffering. This is almost the opposite of “we have ~7 years to deliver (that is, realise) as much good as we can for animals.” Instead it might be better to have an attitude like “what happens after 7 years is going to be a huge deal in some direction, let’s shape it to prevent animal suffering.”
I don’t know what kind of actions would be recommended by this thinking. To venture a guess: trying to accelerate meat alternatives, doing lots of polling around public opinions on moral questions around eating meat (with the goal of hopefully finding that humans think factory farming is wrong so a friendly AI system might adopt such a goal as well; human behavior in this regard seems like a particularly bad basis on which to train AIs). Pretty uncertain about these two idea and I wouldn’t be surprised if they’re actually quite bad.

Aaron_Scher Mar 16, 2023, 6:24 PM
3 points
1 ∶ 0
in reply to: ludwigbald’s comment on: A BOTEC-Model for Comparing Impact Estimations in Community Building
I agree that persuasion frames are often a bad way to think about community building.

I also agree that community members should feel valuable, much in the way that I want everybody in the world to feel valued/loved.

I probably disagree about the implications, as they are affected by some other factors. One intuition that helps me is to think about the donors who donate toward community building efforts. I expect that these donors are mostly people who care about preventing kids from dying of malaria, and many donors also donate lots of money towards charities that can save a kid’s like for $5000. They are, I assume, donating toward community building efforts because they think these efforts are on average a better deal, costing less than $5000 for a live saved in expectation.

For mental health reasons, I don’t think people should generally hold themselves to this bar and be like “is my expected impact higher than where money spent on me would go otherwise?” But I think when you’re using other peoples altruistic money to community build, you should definitely be making trade offs, crunching numbers, and otherwise be aiming to maximize the impact from those dollars.

Furthermore, I would be extremely worried if I learned that community builders aren’t attempting to quantify their impact or think about these things carefully (noting that I have found it very difficult to quantify impact here). Community building is often indistinguishable (at least from the outside) from “spending money on ourselves” and I think it’s reasonable to have a super high bar for doing this in the name of altruism.

Noting again that I think it’s hard to balance mental health with the whacky terrible state of the world where a few thousand dollars can save a life. Making a distinction between personal dollars and altruistic dollars can perhaps help folks preserve their mental health while thinking rigorously about how to help others the most. Interesting related ideas:

https://www.lesswrong.com/posts/3p3CYauiX8oLjmwRF/purchase-fuzzies-and-utilons-separately https://forum.effectivealtruism.org/posts/zu28unKfTHoxRWpGn/you-have-more-than-one-goal-and-that-s-fine

Aaron_Scher Mar 4, 2023, 11:04 PM
1 point
0 ∶ 0
in reply to: DavidW’s comment on: Deceptive Alignment is <1% Likely by Default
Sorry about the name mistake. Thanks for the reply. I’m somewhat pessimistic about us two making progress on our disagreements here because it seems to me like we’re very confused about basic concepts related to what we’re talking about. But I will think about this and maybe give a more thorough answer later.

Aaron_Scher Feb 28, 2023, 8:57 AM
7 points
0 ∶ 0
on: Deceptive Alignment is <1% Likely by Default
Edit: corrected name, some typos and word clarity fixed
Overall I found this post hard to read and I spent far too long trying to understand it. I suspect the author is about as confused about key concepts as I am. David, thanks for writing this, I am glad to see writing on this topic and I think some of your points are gesturing in a useful and important direction. Below are some tentative thoughts about the arguments. For each core argument I first try to summarize your claim and then respond, hopefully this makes it clearer where we actually disagree vs. where I am misunderstanding.
High level: The author makes a claim that the risk of deception arising is <1%, but they don’t provide numbers elsewhere. They argue that 3 conditions must all be satisfied for deception but neither of them are likely. The “how likely” affects that 1% number. My evaluation of the arguments (below) is that for each of these conjunctive conditions my rough probabilities (where higher means deception more likely) are: (totally unsure can’t reason about it) * (unsure but maybe low) * (high), yielding an unclear but probably >1% probability.
- Key claims from post:
  - Why I expect an understanding of the base objective to happen before goal-directedness: “Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction. Because a pre-trained model will already have high-level representations of key base goal concepts, all it will have to do to become aligned is to point them.” Roughly the argument is that pretraining on tons of data will give a good idea of the base objective but by not cause goal-directed behavior, and then we can just make the model do the base objective thing.
    My take: It’s not obvious what the goals of pre-trained language models are or what the goals of RLHFed models; plausibly they both have a goal like “minimize loss on the next token” but the RLHF one is doing that on a different distribution. I am generally confused about what it means for a language model to have goals. Overall I’m just so unsure about this that I can’t reasonably put a probability on models developing an understanding of the base objective before goal directedness, but I wouldn’t confidently say this number is high or low. An example of the probability being high is if goal-directedness only emerges in response to RL (this seems unlikely); an example of the probability being low would be if models undergoing pre-training become goal-directed around predicting next tokens early on in training. Insofar as David thinks this probability is high, I do not understand why.
  - Why I expect an understanding of the base objective to happen significantly before optimizing across episodes/long-term goal horizons: You only get long-term goals via gradient descent finding them, but this is unlikely to happen because gradient descent operates on a hyper-local horizon. Training runs + oversight will be quite long periods, so even if gradient descent moves you to “slightly-long-term goals,” these won’t perform well.
    My take: This argument makes the most sense to me, or at least I think we can reason about it easier than the others. Pointing in the other direction, phase changes seem somewhat likely here; humans (sometimes) generally don’t care about outcomes in the world 100 or 1,000 years out, and then they get sold on longtermism and suddenly care about 10,000 years out. “On what time-span do I care about my goals” is plausibly a measure that will be discontinuous. Perhaps this looks like the goal “where am I minimizing loss” jumping horizons from “next token” to “this sentence/response” and perhaps “all of my performance ever” or “all of the performance of models similar to myself ever” or “all of the performance of models similar to myself in the multiverse”. I’m also unconfident about how likely this is, including still being confused about having goals or understanding base objectives, but I would not be surprised if the author turned out to be right that models understand the base objective before doing long-term goal optimization. Unfortunately I expect that competitive pressures will lead AI developers to want their AIs to pursue long-term objectives, and that might mess things up.
  - Why I expect understanding of the base objective to happen significantly before situational awareness: The deception-relevant situational awareness, which involves understanding one’s future gradient updates, is unlikely to be selected for by gradient descent. This is because, assuming the model has a goal acting only in the current training batch, reasoning about future gradient updates is unlikely to improve short-term reward by much. Models might also get situational awareness via human generated prompts, but this is unlikely because we won’t want them to have this awareness of their own gradient updates.
    My take: I think that GPT-3 probably has the relevant situational awareness in that its world model understands gradient descent. With the current approach to LLMs it seems this just comes pre-loaded into the models. I am not saying GPT-3 is dangerous due to this information, I am saying that the factual information relevant to situational awareness gets baked into pre-training at some point by default, and this could very well be before the model has an understanding of its base objective.
    David writes “For a model to have this property, it would need goal horizons that extended at least beyond the next gradient update.” I’m not sure I buy this, but let’s assume it for now. We care about how likely all 3 of these required-for-deception characteristics are conjunctively. If we’re saying that the 3rd requires the 2nd then we can just condition on the 2nd: Given that the model gets long-term planning before an understanding of the base objective, how likely is dangerous situational awareness (before understanding of base objective)? Seems pretty likely now. For the question “how likely are we to get deception-relevant situational awareness before base objective?” I’m probably like 90% conditioning on long-term goals and still pretty high without conditioning. Yet again I am confused by what understanding the base objective means here.

Aaron_Scher Feb 27, 2023, 8:44 PM
2 points
0 ∶ 0
in reply to: Pseudaemonia’s comment on: How can we improve discussions on the Forum?
FWIW I often vote on posts at the top without scrolling because I listened to the post via the Nonlinear podcast library or read it on a platform that wasn’t logged in. Not all that important of a consideration, but worth being aware of.

Aaron_Scher Jan 24, 2023, 7:22 PM
50 points
8 ∶ 0
in reply to: Pato’s comment on: My highly personal skepticism braindump on existential risk from artificial intelligence.
Here are my notes which might not be easier to understand, but they are shorter and capture the key ideas:
- Uneasiness about chains of reasoning with imperfect concepts
  - Uneasy about conjunctiveness: It’s not clear how conjunctive AI doom is (AI doom being conjunctive would mean that Thing A and Thing B and Thing C all have to happen or be true in order for AI doom; this is opposed to being disjunctive where either A, or B, or C would be sufficient for AI Doom), and Nate Soares’s response to Carlsmith’s powerseeking AI report is not a silver bullet; there is social pressure in some places to just accept that Carlsmith’s report uses a biased methodology and to move on. But obviously there’s some element of conjunctiveness that has to be dealt with.
  - Don’t trust the concepts: a lot of the early AI Risk discussion’s came before Deep Learning. Some of the concepts should port over to near-term-likely AI systems, but not all of them (e.g., Alien values, Maximalist desire for world domination)
    Uneasiness about in-the-limit reasoning: Many arguments go something like this: an arbitrarily intelligent AI will adopt instrumental power seeking tendencies and this will be very bad for humanity; progress is pushing toward that point, so that’s a big deal. Often this line of reasoning assumes we hit in-the-limit cases around or very soon after we hit greater than human intelligence; this may not be the case.
    AGI, so what?: Thinking AGI will be transformative doesn’t mean maximally transformative. e.g., the Industrial revolution was such, because people adapted to it
  - I don’t trust chains of reasoning with imperfect concepts: When your concepts are not very clearly defined/understood, it is quite difficult to accurately use them in complex chains of reasoning.
- Uneasiness about selection effects at the level of arguments
  - “there is a small but intelligent community of people who have spent significant time producing some convincing arguments about AGI, but no community which has spent the same amount of effort looking for arguments against”
  - The people who don’t believe the initial arguments don’t engage with the community or with further arguments. If you look at the reference class “people who have engaged with this argument for more than 1 hour” and see that they all worry about AI risk, you might conclude that the argument is compelling. However, you are ignoring the major selection effects in who engages with the argument for an hour. Many other ideological groups have a similar dynamic: the class “people who have read the new testament” is full of people who believe in the Christian god, which might lead you to believe that the balance of evidence is in their favor — but of course, that class of people is highly selected for those who already believe in god or are receptive to such a belief.
  - “the strongest case for scepticism is unlikely to be promulgated. If you could pin folks bouncing off down to explain their scepticism, their arguments probably won’t be that strong/have good rebuttals from the AI risk crowd. But if you could force them to spend years working on their arguments, maybe their case would be much more competitive with proponent SOTA”
  - Ideally we want to sum all the evidence for and all the evidence against and compare. What happens instead is skeptics come with 20 evidence and we shoot them down with 50 evidence for AI risk. In reality there could be 100 evidence against and only 50 evidence for, and we would not know this if we didn’t have really-well-informed skeptics or we weren’t summing their arguments over time.
  - “It is interesting that when people move to the Bay area, this is often very “helpful” for them in terms of updating towards higher AI risk. I think that this is a sign that a bunch of social fuckery is going on.”
    “More specifically, I think that “if I isolate people from their normal context, they are more likely to agree with my idiosyncratic beliefs” is a mechanisms that works for many types of beliefs, not just true ones. And more generally, I think that “AI doom is near” and associated beliefs are a memeplex, and I am inclined to discount their specifics.”
- Miscellanea
  - Difference between in-argument reasoning and all-things-considered reasoning: Often the gung-ho people don’t make this distinction.
  - Methodological uncertainty: forecasting is hard
  - Uncertainty about unknown unknowns: Most of the unknown unknowns seem likely to delay AGI, things like Covid and nuclear war
  - Updating on virtue: You can update based on how morally or epistemically virtuous somebody is. Historically, some of those pushing AI Risk were doing so not for the goal of truth seeking but for the goal of convincing people
  - Industry vs AI safety community: Those in industry seem to be influenced somewhat by AI Safety, so it is hard to isolate what they think
- Conclusion
  - Main classes of things pointed out: Distrust of reasoning chains using fuzzy concepts, Distrust of selection effects at the level of arguments, Distrust of community dynamics
  - Now in a position where it may be hard to update based on other people’s object-level arguments
What links here?
- Notes on “the hot mess theory of AI misalignment” by JakubK (Apr 21, 2023, 10:07 AM; 44 points)
- Notes on “the hot mess theory of AI misalignment” by JakubK (LessWrong; Apr 21, 2023, 10:07 AM; 16 points)

Aaron_Scher Jan 20, 2023, 3:29 AM
9 points
2 ∶ 2
in reply to: GideonF’s comment on: Doing EA Better
This evidence doesn’t update me very much.
I would prefer an EA Forum without your critical writing on it, because I think your critical writing has similar problems to this post...
I interpret this quote to be saying, “this style of criticism — which seems to lack a ToC and especially fails to engage with the cruxes its critics have, which feels much closer to shouting into the void than making progress on existing disagreements — is bad for the forum discourse by my lights. And it’s fine for me to dissuade people from writing content which hurts discourse”
Buck’s top-level comment is gesturing at a “How to productively criticize EA via a forum post, according to Buck”, and I think it’s noble to explain this to somebody even if you don’t think their proposals are good. I think the discourse around the EA community and criticisms would be significantly better if everybody read Buck’s top level comment, and I plan on making it the reference I send to people on the topic.
Personally I disagree with many of the proposals in this post and I also wish the people writing it had a better ToC, especially one that helps make progress on the disagreement, e.g., by commissioning a research project to better understand a relevant consideration, or by steelmanning existing positions held by people like me, with the intent to identify the best arguments for both sides.

Aaron_Scher Jan 15, 2023, 7:43 PM
6 points
1 ∶ 1
on: Someone should write a detailed history of effective altruism
I expect a project like this is not worth the cost. I imagine doing this well would require dozens of hours of interviews with people who are more senior in the EA movement, and I think many of those people’s time is often quite valuable.

Regarding the pros you mention:
1. I’m not convinced that building more EA ethos/identity based around shared history is a good thing. I expect this would make it even harder to pivot to new things or treat EA as a question, it also wouldn’t be unifying for many folks (e.g. who having been thinking about AI safety for a decade or who don’t buy longtermism). According to me, the bulk of people who call themselves EAs, like most groups, are too slow to update on new arguments and information and I would expect that having a written and agreed upon history would not help with this. Then again, my point might be made better if I could reference common historical cases of what I mean lol
2. I don’t see how this helps build trust.
3. I don’t see how having a written history makes the movement less likely to die. I also don’t know what it looks like for the EA movement to die or how bad this actually is; the EA movement is largely instrumental toward other things I care about: reducing suffering, increasing the chances of good stuff in the universe, my and my friends’ happiness to a lesser extent.
4. This does seem like a value add to me, though the project I’m imagining only does a medium job at this given it’s goal is not “chronology of mistakes and missteps”. Maybe worth checking out https://www.openphilanthropy.org/research/some-case-studies-in-early-field-growth/
With ideas like this I sometimes ask myself “why hasn’t somebody done this yet”. Some reasons that come to mind: too busy doing other things they think are important, might come across as self aggrandizing, who’s going to read it?-and ways I expect it to get read are weird and indoctorinaty (“welcome to the club, here’s a book about our history”, as opposed to “oh, you want to do lots of good, here are some ideas that might be useful”), it doesn’t directly improve the world and the indirect path to impact is shakier than other meta things.

I’m not saying this is necessarily a bad idea. But so far I don’t see strong reasons to do this over the many other things openphil/cea/Kelsey piper/interviewees could be doing.

Aaron_Scher

It’s not ob­vi­ous that get­ting dan­ger­ous AI later is better

It’s not obvious that getting dangerous AI later is better