Looks like someone should attempt a pivotal act. If you think you might be the right person for the job—you probably are!
David Johnston
A brief theory of why we think things are good or bad
I would take the proposal to be AI->growth->climate change or other negative growth side effects
I can see how this gets you for each each item , but not . One of the advantages Ozzie raises is the possibility to keep track of correlations in value estimates, which requires more than the marginal expectations.
So constructing a value ratio table means estimating a joint distribution of values from a subset of pairwise comparisons, then sampling from the distribution to fill out the table?
In that case, I think estimating the distribution is the hard part. Your example is straightforward because it features independent estimates, or simple functional relationships.
The only piece of literature I had in mind was von Neumann and Morgenstern’s representation theorem. It says: if you have a set of probability distributions over a set of outcomes and for each pair of distributions you have a preference (one is better than the other, or they are equal) and if this relation satisfies the additional requirements of transitivity, continuity and independence from alternatives, then you can represent the preferences with a utility function unique up to affine transformation.
Given that this is a foundational result for expected utility theory, I don’t think it is unusual to think of a utility function as a representation of a preference relation.
Do you envision your value ratio table to be underwritten by a unique utility function? That is, could we assign a single number to every outcome such that the table cell corresponding to three outcomes pair is always equal to ? These utilities could be treated as noisy estimates, which allows for correlations between and for some pairs.
My remarks concern what a value ratio table might be if it is more than just a “visualisation” of a utility function.
Because we are more likely to see no big changes than to see another big change.
if the risk is usually quite low (e.g. 0.001 % per century), but sometimes jumps to a high value (e.g. 1 % per century), the cumulative risk (over all time) may still be significantly below 100 % (e.g. 90 %) if the magnitude of the jumps decreases quickly, and risk does not stay high for long.
I would call this model “transient deviation” rather than “random walk” or “regular oscillation”
We can still get H4 if the amplitude of the oscillation or random walk decreases over time, right?
The average needs to fall, not the amplitude. If we’re looking at risk in percentage points (rather than, say, logits, which might be a better parametrisation), small average implies small amplitude, but small amplitude does not imply small average.
Only if the sudden change has a sufficiently large magnitude, right?
The large magnitude is an observation—we have seen risk go from quite low to quite high over a short period of time. If we expect such large magnitude changes to be rare, then we might expect the present conditions to persist.
FWIW I think the general kind of model underlying what I’ve written is a joint distribution that models value something like
Thought about this some more. This isn’t a summary of your work, it’s an attempt to understand it in my terms. Here’s how I see it right now: we can use pairwise comparisons of outcomes to elicit preferences, and people often do, but they typically choose to insist that each outcome has a value representable as a single number and use the pairwise comparisons to decide which number to assign each outcome. Insisting that each outcome has a value is a constraint on preferences that can allow us to compute which outcome is preferred between two outcomes for which we do not have direct data.
I see this post as arguing that we should instead represent preferences as a table of value ratios. This is not about eliciting preferences, but representing them. Why would we want to represent them like this? At first glance:
If the important thing is we represent preferences as a table, then we can capture every important comparison with a table of binary preferences
If we want to impose additional constraints so that we can extrapolate preferences, preference ratios seems to push us back to assigning one or more values to every outcome
What makes value ratios different from other schemes with multiple valuation functions is that value ratios give us a value function for each outcome we investigate. That is, there is a one-to-one correspondence between outcomes and value functions.
Here is a theory of why that might be useful: When we talk about the value of outcomes (such as “$5”), we are actually talking about that outcome in some context (such as “$5 for me now” or “$5 for someone who is very poor, now”). Preference relations can and do treat these outcomes as different depending on the context - $5 for me is worth less than $5 for someone who is very poor. Because of this, a value scale based on “$5-equivalents” will be different depending on the context of the $5.
A key proposition to motivate value ratios, Proposition 1: every outcome which we consider comes with a unique implied mixture of contexts. That is, if I say “the value of $5”, I mean where is the mixture of contexts implied by my having said “$5″.
This means, if I want to compare “the value of $10m” to “the value of saving a child’s life”, I have two options: I can compare to or I can compare to . These might give me different answers, and the correct comparison depends which applied context I am considering these options in.
A value ratio could therefore be considered a table where each column is a context and each row specifies the relative value of the given item in that context. Note that, under this interpretation, we should not expect , unless . This is because items have different values in different contexts.
This can be extended to distributions over value ratios, in which case perhaps each sample comes with a context sampled from the distribution of contexts for that column of the table (I’m not entirely sure that works, but maybe it does). This can allow us to represent within-column correlations if we know that one outcome is times better than another, regardless of context.
I don’t think proposition 1 is plausible if we interpret it strictly. I’m pretty sure at different times people talk about the value of $5 with different implied contexts, and at other times I think people probably make some effort to consider the value of quite different outcomes in a common context. However, I think there still might be something to it. Whenever you’re weighing up different outcomes, you definitely have an implicit context in mind. Furthermore, there probably is a substantial correlation between the context and the outcome—if two different people are considering the value of saving a child’s life then there probably is substantial overlap between the contexts they’re considering. Moreover, it’s plausible that context sensitivity is an issue for the kinds of value comparisons that EAs want to make.
I don’t think it’s all you are doing, that’s why I wrote the rest of my comment (sorry to be flippant).
The point of bringing up binary comparisons is that a table of binary comparisons is a more general representation than a single utility function.
If all we are doing is binary comparisons between a set of items, it seems to me that it would be sufficient to represent relative values as a binary—i.e., is item1 better, or item2? Or perhaps you want a ternary function—you could also say they’re equal.
Using a ratio instead of a binary indicator for relative values suggests that you want to use the function to extrapolate. I’m not sure that this approach helps much with that, though. For example,costOfp001DeathChance = ss(10 to 10k) // Cost of a 0.001% chance of death, in dollars
chanceOfDeath001 = ss(-1 * costOfp001DeathChance * dollar1) // Cost of a 0.001% chance of deathdoes not tell me how many $ a 0.01% chance of death is worth; rather, it tells me how many times better it is than $1. Without a function f(outcome in $)->value, this doesn’t enable a comparison to any other amount of dollars. We can, of course, add such a function to our estimation, but if we do then I think the function is doing much more than the value ratios to enable us to extrapolate our value judgements. Unless we have f(outcome2)=f(outcome1)*outcome2/outcome1, I don’t see how we can use ratios at all, but if we do have it then we’re back to single values.
The alternative approach seems to me to be to treat it as a machine learning problem—given binary value judgements, build a binary classifier that tells you whether item1 or item2 is better. I expect that if we had value ratios instead of binary comparisons we might do a bit better here, but they might also be harder to elicit.
AFAIK the official MIRI solution to AI risk is to win the race to AGI but do it aligned.
Part of the MIRI theory is that winning the AGI race will give you the power to stop anyone else from building AGI. If you believe that, then it’s easy to believe that there is a race, and that you sure don’t want to lose.
That’s what I meant by “phase transition”
It cannot both be controllable because it’s weak and also uncontrollabile.
That said, I expect more advanced techniques will be needed for more advanced AI; I just think control techniques probably keep up without sudden changes in control requirements.
Also LLMs are more controllable than weaker older designs (compare GPT4 vs Tay).
I’d love to hear from people who don’t “have adhd”. I have a diagnosis myself but I have trouble believing I’m all that unusual. I tried medication for a while, but I didn’t find it that helpful with regard to the bottom line outcome of getting things done, and I felt uncomfortable with the idea of taking stimulants regularly for many years. I’d certainly benefit from being more able to finish projects, though!
People will continue to prefer controllable to uncontrollable AI and continue to make at least a commonsense level of investment in controllability; that is, they invest as much as naively warranted by recent experience and short term expectations, which is less than warranted by a sophisticated assessment of uncertainty about misalignment, though the two may converge as “recent experience” involves more and more capable AIs. I think this minimal level of investment in control is very likely (99%+).
Next, the proposed sudden/surprising phase transition that breaks controllability properties never materialises so that commonsense investment turns out to be enough for an OK outcome. I think this is about 65%.
Next, AI-enhanced human politics also manages to generate an OK outcome. About 70%.
That’s 45%, but the bar is perhaps higher than you have in mind (I’m also counting non misalignment paths to bad outcomes). There are also worlds where the problem is harder but more is invested & still ends up being enough. Not sure how much weight goes there, somewhat less.
Don’t take these probabilities too seriously.
I’m writing quickly because I think this is a tricky issue and I’m trying not to spend too long on it. If I don’t make sense, I might have misspoken or made a reasoning error.
One way I thought about the problem (quite different to yours, very rough): variation in existential risk rate depends mostly on technology. At a wide enough interval (say, 100 years of tech development at current rates), change in existential risk with change in technology is hard to predict, though following Aschenbrenner and Xu’s observations it’s plausible that it tends to some equilibrium in the long run. You could perhaps model a mixture of a purely random walk and walks directed towards uncertain equilibria.
Also, technological growth probably has an upper limit somewhere, though quite unclear where, so even the purely random walk probably settles down eventually.
There’s uncertainty over a) how long it takes to “eventually” settle down, b) how much “randomness” there is as we approach an equilibrium c) how quickly equilibrium is approached, if it is approached.
I don’t know what you get if you try to parametrise that and integrate it all out, but I would also be surprised if it put and overwhelmingly low credence in a short sharp time of troubles.
I think “one-off displacement from equilibrium” probably isn’t a great analogy for tech-driven existential risk.
I think “high and sustained risk” seems weird partly because surviving for a long period under such conditions is weird, so conditioning on survival usually suggests that risk isn’t so high after all—so in many cases risk really does go down for survivors. But this effect only applies to survivors, and the other possibility is that we underestimated risk and we die. So I’m not sure that this effect changes conclusions. I’m also not sure how this affects your evaluation of your impact on risk—probably makes it smaller?
I think this observation might apply to your thought experiment, which conditions on survival.
I don’t see this. First, David’s claim is that a short time of perils with low risk thereafter seems unlikely—which is only a fraction of hypothesis 4, so I can easily see how you could get H3+H4_bad:H4_good >> 10:1
I don’t even see why it’s so implausible that H3 is strongly preferred to H4. There are many hypotheses we could make about time varying risk:
- Monotonic trend (many varieties)
- Oscillation (many varieties)
- Random walk (many varieties)
- …
If we aren’t trying to carefully consider technological change (and ignoring AI seems to force us not to do this carefully) then it’s not at all clear how to weigh all the different options. Many possible weightings do support hypothesis 3 over hypothesis 4:
- If we expect regular oscillation or time symmetric random walks, then I think we usually get H3 (integrated oscillation = high risk; the lack of risk in the past suggests that period of oscillation is long)
- If we expect rare, sudden changes then we get H3
- Monotonic trend obviously favours H3
If I imagine going through this this exercise, I wouldn’t be that surprised to see H3 strongly favoured over H4 - but I don’t really see it as a very valuable exercise. The risk under consideration is technologically driven, so not considering technology very carefully seems to be a mistake.
There are several AGI pills one can swallow. I think the prospects for a treaty would be very bright if CCP and USG were both uncontrollability-pilled. If uncontrollability is true, strong cases for it are valuable.
On the other hand, if uncontrollability is false, Aschenbrenner’s position seems stronger (I don’t mean that it necessarily becomes correct, just that it gets stronger).