I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about
Mark Xu
How to do theoretical research, a personal perspective
Be Specific About Your Career
Ben Pace, Ben Khun, Ben Todd, Ben West, and Ben Garfinkel should all become the same person, to avoid confusion.
‘Dropping out’ isn’t a Plan
Your Time Might Be More Valuable Than You Think
Money Can’t (Easily) Buy Talent
Strong Evidence is Common
I expect 10 people donating 10% of their time to be less effective than 1 person using 100% of their time because you don’t get to reap the benefits of learning for the 10% people. Example: if people work for 40 years, then 10 people donating 10% of their time gives you 10 years with 0 experience, 10 with 1 year, 10 with 2 years, and 10 with 3 years; however, if someone is doing EA work full-time, you get 1 year with 0 exp, 1 with 1, 1 with 2, etc. I expect 1 year with 20 years of experience to plausibly be as good/useful as 10 with 3 years of experience. Caveats to the simple model:
labor-years might be more valuable during the present
if you’re volunteering for a thing that is similar to what you spend the other 90% of your time doing, then you still get better at the thing you’re volunteering for
I make a similar argument here.
I will now consider everything that Carl writes henceforth to be in a parenthetical.
Meta-EA Needs Models
Defusing AGI Danger
Many people in EA depart from me here: they see choices that do not maximize impacts as personal mistakes. Imagine a button that, if you press it, would cause you to always take the impact-maximizing action for the rest of your life, even if it entails great personal sacrifice. Many (most?) longtermist EAs I talk to say they would press this button – and I believe them. That’s not true of me; I’m partially aligned with EA values (since impact is an important consideration for me), but not fully aligned.
I think there are people (e.g. me) that value things besides impact and would also press the button because of golden-rule type reasoning. Many people optimize for impact to the point where it makes them less happy.
I’m excited about more efficient matching between people who want career advice and people who are not-maximally-qualified to give it, but can still give aid nonetheless. For example, when planning my career, I often find it helpful to talk to other students making similar decisions, even though they’re more “more qualified” than me. I suspect that other students/people feel similarly and one doesn’t need to be a career coach to be helpful.
A title like “How many lives might have been saved given an earlier COVID-19 vaccine rollout?” would have given me much more information about what the post was about than the current title, which I find very vague.
This creates weird incentives, e.g. I could construct a plausible-but-false view, make a post about it, then make a big show of changing my mind. I don’t think the amounts of money involved make it worth it, but I’m wary of incentivizing things that are so easily gamed.
When I look at most forecasting questions, they seem goodharty in a very strong sense. For example, the goodhart tower for COVID might look something like:
1. How hard should I quarantine?
2. How hard I should quarantine is affected by how “bad” COVID will be.
3. How “bad” COVID should be caches out into something like “how many people”, “when vaccine coming”, “what is death rate”, etc.
By the time something I care about becomes specific enough to be predictable/forecastable, it seems like most of the thing I actually cared about has been lost.
Do you have a sense of how questions can be better constructed to lose less of the thing that might have inspired the question?
kindle’s are smaller, have backlights, and the kindle store is a good user experience.
Note: I work for ARC.
I would consider someone a “pretty good fit” (whatever that means) for alignment research if they started out with a relatively technical background, e.g. an undegrad degree in math/cs, but not really having engaged with alignment before and they were able to come up with a decent proposal after:
~10 hours of engaging with the ELK doc.
~10 hours of thinking about the document and resolving confusions they had, which might involve asking some questions to clarify the rules and the setup.
~10 hours of trying to come up with a proposal.
If someone starts from having thought about alignment a bunch, I would consider them a potentially “pretty good researcher” if they were able to come up with a decent proposal in 2-8 hours. I expect many existing (alignment) researchers to be able to come up with proposals in <1 hour.
Note that I’m saying “if (can come up with proposal in N hours), then (might be good alignment researcher)” and not saying the other implication also holds, e.g. it is not the case that “if (might be good alignment researcher), then (can come up with proposal in N hours)”
How optimistic about “amplification” forecast schemes, where forecasters answer questions like “will a panel of experts say <answer> when considering <question> in <n> years?”
I think this model is kind of misleading, and that the original astronomical waste argument is still strong. It seems to me that a ton of the work in this model is being done by the assumption of constant risk, even in post-peril worlds. I think this is pretty strange. Here are some brief comments:
If you’re talking about the probability of a universal quantifier, such as “for all humans x, x will die”, then it seems really weird to say that this remains constant, even when the thing you’re quantifying over grows larger.
For instance, it seems clear that if there were only 100 humans, the probability of x-risk would be much higher than if there were 10^6 humans. So it seems like if there are 10^20 humans, it should be harder to cause extinction than 10^10 humans.
Assuming constant risk has the implication that human extinction is guaranteed to happen at some point in the future, which puts sharp bounds on the goodness of existential risk reduction.
It’s not that hard to get exponentially decreasing probability on universal quantifiers if you assume independence in survival amongst some “unit” of humanity. In computing applications, it’s not that hard to drive down the probability of error exponentially in the resources allocated, because each unit of resource can ~halve the probability of error. Naively, each human doesn’t want to die, so there are # humans rolls for surviving/solving x-risk.
It seems like the probability of x-risk ought to be inversely proportional to the current estimated amount of value at stake. This seems to follow if you assume that civilization acts as a “value maximizer” and it’s not that hard to reduce x-risk. Haven’t worked it out, so wouldn’t be surprised if I was making some basic error here.
Generally, it seems like most of the risk is going to come from worlds where the chance of extinction isn’t actually a universal quantifier, and there’s some correlation amongst seemingly independent roles for survival. In particularly bad cases, humans go extinct if there exists someone that wants to destroy the universe, so we actually see an extremely rapid increasing probability of extinction as we get more humans. These worlds would require extremely strong coordination and governance solutions.
These worlds are also slightly physically impossible because parts of humanity will rapidly become causally isolated from each other. I don’t know enough cosmology to have an intuition for which way the functional form will ultimately go.
Generally, it seems like the naive view is that as humans get richer/smarter, they’ll allocate more and more resources towards not dying. At equilibrium, it seems reasonable to first-order-assume we’ll drive existential risk down until the marginal cost equals the marginal benefit, so the key question is how this equilibrium behaves. It seems like my guess is that it will depend heavily on the total amount of value available in the future, determined by physical constraints (and potentially more galaxy-brained considerations).
This view seems to allow you to recover more the more naive astronomical waste perspective.
This makes me feel like the model makes kind of strong assumptions about the amount it will ultimately cost to drive down existential risk. E.g. you seem to imply that rl = 0.0001 is small, but an independent chance that large each century suggests that the probability humanity survives for ~10^10 years is ~0. This feels quite absurd to me.
The sentence: “Note that for the Pessimist, this is a reduction of 200,000%”, but humans routinely reduce the probabilities of failures by more than 200,000% via engineering efforts and produce highly complex and artifacts like computers, airplanes, rockets, satellites, etc. It feels like you should naively expect “breaking” human civilization to be harder than breaking an airplane, especially when civilization is actively trying to ensure that it doesn’t go extinct.
Also, you seem to assume each century has some constant value v eventually, which seems reasonable to me, but the implication “Warming (slightly) on short-termist cause areas” relies on an assumption that the current century is close to value v, when it seems like even pretty naive bounds (e.g. percent of sun’s energy), suggest that the current century is not even within a factor of 10^9 of the long-run value-per-century humanity could reach.
Assuming that value grows quadratically seems also quite weird, because of analysis like eternity in 6 hours, which seems to imply that a resource-maximizing civilization will undergo a period of incredibly rapid expansion to achieve per-century rates of value much higher than the current century, and then have nowhere else to go. A better model from my perspective is logistic growth of value, with the upper bound given by some weak proxy like “suppose that value is linear in the amount of energy a civilization uses, then take the total amount of value in the year 2020”, with the ultimate unit being “value in 2020″. This would produce much higher numbers, and give a more intuitive sense of “astronomical waste.”
I like the process of proposing concrete models for things as a substrate for disagreement, and I appreciate that you wrote this. It feels much better to articulate objections like “I don’t think this particular parameter should be constant in your model” than to have abstract arguments. I also like how it’s now more clear that if you do believe that risk in post-peril worlds is constant, then the argument for longtermism is much weaker (although I think still quite strong because of my comments about v).