richard_ngo comments on On Deference and Yudkowsky’s AI Risk Estimates

richard_ngo Jun 24, 2022, 11:03 PM
8 points
0 ∶ 0
Yepp, thanks for the clear rephrasing. My original arguments for this view were pretty messy because I didn’t have it fully fleshed out in my mind before writing this comment thread, I just had a few underlying intuitions about ways I thought Ben was going wrong.

Upon further reflection I think I’d make two changes to your rephrasing.

First change: in your rephrasing, we assign people weights based on the quality of their beliefs, but then follow their recommended policies. But any given way of measuring the quality of beliefs (in terms of novelty, track record, etc) is only an imperfect proxy for quality of policies. For example, Kurzweil might very presciently predict that compute is the key driver of AI progress, but suppose (for the sake of argument) that the way he does so is by having a worldview in which everything is deterministic, individuals are powerless to affect the future, etc. Then you actually don’t want to give many resources to Kurzweil’s policies, because Kurzweil might have no idea which policies make any difference.

So I think I want to adjust the rephrasing to say: in principle we should assign people weights based on how well their past recommended policies for someone like you would have worked out, which you can estimate using things like their track record of predictions, novelty of ideas, etc. But notably, the quality of past recommended policies is often not very sensitive to credences! For example, if you think that there’s a 50% chance of solving nanotech in a decade, or a 90% chance of solving nanotech in a decade, then you’ll probably still recommend working on nanotech (or nanotech safety) either way.

Having said all that, since we only get one rollout, evaluating policies is very high variance. And so looking at other information like reasoning, predictions, credences, etc, helps you distinguish between “good” and “lucky”. But fundamentally we should think of these as approximations to policy evaluation, at least if you’re assuming that we mostly can’t fully evaluate whether their reasons for holding their views are sound.

Second change: what about the case where we don’t get to allocate resources, but we have to actually make a set of individual decisions? I think the theoretically correct move here is something like: let policies spend their weight on the domains which they think are most important, and then follow the policy which has spent most weight on that domain.

Some complications:
- I say “domains” not “decisions” because you don’t want to make a series of related decisions which are each decided by a different policy, that seems incoherent (especially if policies are reasoning adversarially about how to undermine each other’s actions).
- More generally, this procedure could in theory be sensitive to bargaining and negotiating dynamics between different policies, and also the structure of the voting system (e.g. which decisions are voted on first, etc). I think we can just resolve to ignore those and do fine, but in principle I expect it gets pretty funky.
Lastly, two meta-level notes:
- I feel like I’ve probably just reformulated some kind of reinforcement learning. Specifically the case where you have a fixed class of policies and no knowledge of how they relate to each other, so you can only learn how much to upweight each policy. And then the best policy is not actually in your hypothesis space, but you can learn a simple meta-policy of when to use each existing policy.
- It’s very ironic that in order to figure out how much to defer to Yudkowsky we need to invent a theory of idealised cooperative decision-making. Since he’s probably the person whose thoughts on that I trust most, I guess we should meta-defer to him about what that will look like...
- Rohin Shah Jun 25, 2022, 9:07 AM
  7 points
  0 ∶ 0
  Parent
  First change:
  In your Kurzweil example I think the issue is not that you assigned weights based on hypothetical-Kurzweil’s beliefs, but that hypothetical-Kurzweil is completely indifferent over policies. I think the natural fix is “moral parliament” style decision making where the weights can still come from beliefs but they now apply more to preferences-over-policies. In your example hypothetical-Kurzweil has a lot of weight but never has any preferences-over-policies so doesn’t end up influencing your decisions at all.
  That being said, I agree that if you can evaluate quality of past recommended policies well, without a ton of noise, that would be a better signal than accuracy of beliefs. This just seems extremely hard to do, especially given the selection bias in who comes to your attention in the first place, and idk how I’d do it for Eliezer in any sane way. (Whereas you get to see people state many more beliefs and so there are a lot more data points that you can evaluate if you look at beliefs.)
  But notably, the quality of past recommended policies is often not very sensitive to credences!
  I think you’re thinking way too much about credences-in-particular. The relevant notion is not “credences”, it’s that-which-determines-how-much-influence-the-person-has-over-your-actions. In this model of deference the relevant notion is the weights assigned in step 2 (however you calculate them), and the message of Ben’s post would be “I think people assign too high a weight to Eliezer”, rather than anything about credences. I don’t think either Ben or I care particularly much about credences-based-on-deference except inasmuch as they affect your actions.
  I do agree that Ben’s post looks at credences that Eliezer has given and considers those to be relevant evidence for computing what weight to assign Eliezer. You could take a strong stand against using people’s credences or beliefs to compute weights, but that is at least a pretty controversial take (that I personally don’t agree with), and it seems different from what you’ve been arguing so far (except possibly in the parent comment).
  Second change:
  This change seems fine. Personally I’m pretty happy with a rough heuristic of “here’s how I should be splitting my resources across worldviews” and then going off of intuitive “how much does this worldview care about this decision” + intuitive trading between worldviews rather than something more fleshed out and formal but that seems mostly a matter of taste.
  - richard_ngo Jun 28, 2022, 1:18 AM
    5 points
    0 ∶ 0
    Parent
    In your Kurzweil example I think the issue is not that you assigned weights based on hypothetical-Kurzweil’s beliefs, but that hypothetical-Kurzweil is completely indifferent over policies.
    Your procedure is non-robust in the sense that, if Kurzweil transitions from total indifference to thinking that one policy is better by epsilon, he’ll throw his full weight behind that policy. Hmm, but then in a parliamentary approach I guess that if there are a few different things he cares epsilon about, then other policies could negotiate to give him influence only over the things they don’t care about themselves. Weighting by hypothetical-past-impact still seems a bit more elegant, but maybe it washes out.
    (If we want to be really on-policy then I guess the thing which we should be evaluating is whether the person’s worldview would have had good consequences when added to our previous mix of worldviews. And one algorithm for this is assigning policies weights by starting off from a state where they don’t know anything about the world, then letting them bet on all your knowledge about the past (where the amount they win on bets is determined not just by how correct they are, but also how much they disagree with other policies). But this seems way too complicated to be helpful in practice.)
    I agree that if you can evaluate quality of past recommended policies well, without a ton of noise, that would be a better signal than accuracy of beliefs. This just seems extremely hard to do, especially given the selection bias in who comes to your attention in the first place, and idk how I’d do it for Eliezer in any sane way.
    I think I’m happy with people spending a bunch of time evaluating accuracy of beliefs, as long as they keep in mind that this is a proxy for quality of recommended policies. Which I claim is an accurate description of what I was doing, and what Ben wasn’t: e.g. when I say that credences matter less than coherence of worldviews, that’s because the latter is crucial for designing good policies, whereas the former might not be; and when I say that all-things-considered estimates of things like “total risk level” aren’t very important, that’s because in principle we should be aggregating policies not risk estimates between worldviews.
    I also agree that selection bias could be a big problem; again, I think that the best strategy here is something like “do the standard things while remembering what’s a proxy for what”.
    - Rohin Shah Jun 28, 2022, 8:45 AM
      7 points
      0 ∶ 0
      Parent
      Meta: This comment (and some previous ones) get a bunch into “what should deference look like”, which is interesting, but I’ll note that most of this seems unrelated to my original claim, which was just “deference* seems important for people making decisions now, even if it isn’t very important in practice for researchers”, in contradiction to a sentence on your top-level comment. Do you now agree with that claim?
      *Here I mean deference in the sense of how-much-influence-various-experts-have-over-your-actions. I initially called this “credences” because I thought you were imagining a model of deference in which literal credences determined how much influence experts had over your actions.
      Your procedure is non-robust in the sense that, if Kurzweil transitions from total indifference to thinking that one policy is better by epsilon, he’ll throw his full weight behind that policy.
      Agreed, but I’m not too worried about that. It seems like you’ll necessarily have some edge cases like this; I’d want to see an argument that the edge cases would be common before I switch to something else.
      The chain of approximations could look something like:
      The correct thing to do is to consider all actions / policies and execute the one with the highest expected impact.
      First approximation: Since there are so many actions / policies, it would take too long to do this well, and so we instead take a shortcut and consider only those actions / policies that more experienced people have thought of, and execute the ones with the highest expected impact. (I’m assuming for now that you’re not in the business of coming up with new ideas of things to do.)
      Second approximation: Actually it’s still pretty hard to evaluate the expected impact of the restricted set of actions / policies, so we’ll instead do the ones that the experts say is highest impact. Since the experts disagree, we’ll divide our resources amongst them, in accordance with our predictions of which experts have highest expected impact across their portfolios of actions. (This is assuming a large enough pile of resources that it makes sense to diversify due to diminishing marginal returns for any one expert.)
      Third approximation: Actually expected impact of an expert’s portfolio of actions is still pretty hard to assess, we can save ourselves decision time by choosing weights for the portfolios according to some proxy that’s easier to assess.
      It seems like right now we’re disagreeing about proxies we could use in the third approximation. It seems to me like proxies should be evaluated based on how close they reach the desired metric (expected future impact) in realistic use cases, which would involve both (1) how closely they align with “expected future impact” in general and (2) how easy they are to evaluate. It seems to me like you’re thinking mostly of (1) and not (2) and this seems weird to me; if you were going to ignore (2) you should just choose “expected future impact”. Anyway, individual proxies and my thoughts on them:
      Beliefs / credences: ⁵⁄₁₀ on easy to evaluate (e.g. Ben could write this post). ³⁄₁₀ on correlation with expected future impact. Doesn’t take into account how much impact experts think their policies could have (e.g. the Kurzweil example above).
      Coherence: ³⁄₁₀ on easy to evaluate (seems hard to do this without being an expert in the field). ²⁄₁₀ on correlation with expected future impact (it’s not that hard to have wrong coherent worldviews, see e.g. many pop sci books).
      Hypothetical impact of past policies: ¹⁄₁₀ on easy to evaluate (though it depends on the domain). ⁷⁄₁₀ on correlation with expected future impact (it’s not ⁹⁄₁₀ or ¹⁰⁄₁₀ because selection bias seems very hard to account for).
      As is almost always the case with proxies, I would usually use an intuitive combination of all the available proxies, because that seems way more robust than relying on any single one. I am not advocating for only relying on beliefs.
      Which I claim is an accurate description of what I was doing, and what Ben wasn’t
      I get the sense that you think I’m trying to defend “this is a good post and has no problems whatsoever”? (If so, that’s not what I said.)
      Summarizing my main claims about this deference model that you might disagree with:
      In practice, an expert’s beliefs / credences will be relevant information into deciding what weight to assign them,
      Ben’s post provides relevant information about Eliezer’s beliefs (note this is not taking a stand on other aspects of the post, e.g. the claim about how much people should defer to Eliezer)
      The weights assigned to experts are important / valuable to people who need to make decisions now (but they are usually not very important / valuable to researchers).
      - richard_ngo Jun 28, 2022, 9:17 PM
        11 points
        0 ∶ 0
        Parent
        Meta: I’m currently writing up a post with a fully-fleshed-out account of deference. If you’d like to drop this thread and engage with that when it comes out (or drop this thread without engaging with that), feel free; I expect it to be easier to debate when I’ve described the position I’m defending in more detail.
        I’ll note that most of this seems unrelated to my original claim, which was just “deference* seems important for people making decisions now, even if it isn’t very important in practice for researchers”, in contradiction to a sentence on your top-level comment. Do you now agree with that claim?
        I always agreed with this claim; my point was that the type of deference which is important for people making decisions now should not be very sensitive to the “specific credences” of the people you’re deferring to. You were arguing above that the difference between your and Eliezer’s views makes much more than a 2x difference; do you now agree that, on my account of deference, a big change in the deference-weight you assign to Eliezer plausibly leads to a much smaller change in your policy from the perspective of other worldviews, because the Eliezer-worldview trades off influence over most parts of the policy for influence over the parts that the Eliezer-worldview thinks are crucial and other policies don’t?
        individual proxies and my thoughts on them
        This is helpful, thanks. I of course agree that we should consider both correlations with impact and ease of evaluation; I’m talking so much about the former because not noticing this seems like the default mistake that people make when thinking about epistemic modesty. Relatedly, I think my biggest points of disagreement with your list are:
        1. I think calibrated credences are badly-correlated with expected future impact, because:
        a) Overconfidence is just so common, and top experts are often really miscalibrated even when they have really good models of their field
        b) The people who are best at having impact have goals other than sounding calibrated—e.g. convincing people to work with them, fighting social pressure towards conformity, etc. By contrast, the people who are best at being calibrated are likely the ones who are always stating their all-things-considered views, and who therefore may have very poor object-level models. This is particularly worrying when we’re trying to infer credences from tone—e.g. it’s hard to distinguish the hypotheses “Eliezer’s inside views are less calibrated than other peoples” and “Eliezer always speaks based on his inside-view credences, whereas other people usually speak based on their all-things-considered credences”.
        c) I think that “directionally correct beliefs” are much better-correlated, and not that much harder to evaluate, and so credences are especially unhelpful by comparison to those (like, ²⁄₁₀ before conditioning on directional correctness, and ¹⁄₁₀ after, whereas directional correctness is like ³⁄₁₀).
        2. I think coherence is very well-correlated with expected future impact (like, ⁵⁄₁₀), because impact is heavy-tailed and the biggest sources of impact often require strong, coherent views. I don’t think it’s that hard to evaluate in hindsight, because the more coherent a view is, the more easily it’s falsified by history.
        3. I think “hypothetical impact of past policies” is not that hard to evaluate. E.g. in Eliezer’s case the main impact is “people do a bunch of technical alignment work much earlier”, which I think we both agree is robustly good.
        Rohin Shah Jun 29, 2022, 7:50 AM
        5 points
        0 ∶ 0
        Parent
        You were arguing above that the difference between your and Eliezer’s views makes much more than a 2x difference;
        I was arguing that EV estimates have more than a 2x difference; I think this is pretty irrelevant to the deference model you’re suggesting (which I didn’t know you were suggesting at the time).
        do you now agree that, on my account of deference, a big change in the deference-weight you assign to Eliezer plausibly leads to a much smaller change in your policy from the perspective of other worldviews, because the Eliezer-worldview trades off influence over most parts of the policy for influence over the parts that the Eliezer-worldview thinks are crucial and other policies don’t?
        No, I don’t agree with that. It seems like all the worldviews are going to want resources (money / time) and access to that is ~zero-sum. (All the worldviews want “get more resources” so I’m assuming you’re already doing that as much as possible.) The bargaining helps you avoid wasting resources on counterproductive fighting between worldviews, it doesn’t change the amount of resources each worldview gets to spend.
        Going from allocating 10% of your resources to 20% of your resources to a worldview seems like a big change. It’s a big difference if you start with twice as much money / time as you otherwise would have, unless there just happens to be a sharp drop in marginal utility of resources between those two points for some reason.
        Maybe you think that there are lots of things one could do that have way more effect than “redirecting 10% of one’s resources” and so it’s not a big deal? If so can you give examples?
        I think calibrated credences are badly-correlated with expected future impact
        I agree overconfidence is common and you shouldn’t literally calculate a Brier score to figure out who to defer to.
        I agree that directionally-correct beliefs are better correlated than calibrated credences.
        When I say “evaluate beliefs” I mean “look at stated beliefs and see how reasonable they look overall, taking into account what other people thought when the beliefs were stated” and not “calculate a Brier score”; I think this post is obviously closer to the former than the latter.
        I agree that people’s other goals make it harder to evaluate what their “true beliefs” are, and that’s one of the reasons I say it’s only ³⁄₁₀ correlation.
        I think coherence is very well-correlated with expected future impact (like, ⁵⁄₁₀), because impact is heavy-tailed and the biggest sources of impact often require strong, coherent views. I don’t think it’s that hard to evaluate in hindsight, because the more coherent a view is, the more easily it’s falsified by history.
        Re: correlation, I was implicitly also asking the question “how much does this vary across experts”. Across the general population, maybe coherence is ⁷⁄₁₀ correlated with expected future impact; across the experts that one would consider deferring to I think it is more like ²⁄₁₀, because most experts seem pretty coherent (within the domains they’re thinking about and trying to influence) and so the differences in impact depend on other factors.
        Re: evaluation, it seems way more common to me that there are multiple strong, coherent, conflicting views that all seem compelling (see epistemic learned helplessness), which do not seem to have been easily falsified by history (in sufficiently obvious manner that everyone agrees which one is false).
        This too is in large part because we’re looking at experts in particular. I think we’re good at selecting for “enough coherence” before we consider someone an expert (if anything I think we do it too much in the “public intellectual” space), and so evaluating coherence well enough to find differences between experts ends up being pretty hard.
        I think “hypothetical impact of past policies” is not that hard to evaluate. E.g. in Eliezer’s case the main impact is “people do a bunch of technical alignment work much earlier”, which I think we both agree is robustly good.
        I feel like looking at any EA org’s report on estimation of their own impact makes it seem like “impact of past policies” is really difficult to evaluate?
        Eliezer seems like a particularly easy case, where I agree his impact is probably net positive from getting people to do alignment work earlier, but even so I think there’s a bunch of questions that I’m uncertain about:
        How bad is it that some people completely dismiss AI risk because they encountered Eliezer and found it off putting? (I’ve explicitly heard something along the lines of “that crazy stuff from Yudkowsky” from multiple ML researchers.)
        How many people would be working on alignment without Eliezer’s work? (Not obviously hugely fewer, Superintelligence plausibly still gets published, Stuart Russell plausibly still goes around giving talks about value alignment and its importance.)
        To what extent did Eliezer’s forceful rhetoric (as opposed to analytic argument) lead people to focus on the wrong problems?
        richard_ngo Jul 13, 2022, 7:41 PM
        3 points
        0 ∶ 0
        Parent
        I’ve now written up a more complete theory of deference here. I don’t expect that it directly resolves these disagreements, but hopefully it’s clearer than this thread.
        Going from allocating 10% of your resources to 20% of your resources to a worldview seems like a big change.
        Note that this wouldn’t actually make a big change for AI alignment, since we don’t know how to use more funding. It’d make a big change if we were talking about allocating people, but my general heuristic is that I’m most excited about people acting on strong worldviews of their own, and so I think the role of deference there should be much more limited than when it comes to money. (This all falls out of the theory I linked above.)
        Across the general population, maybe coherence is ⁷⁄₁₀ correlated with expected future impact; across the experts that one would consider deferring to I think it is more like ²⁄₁₀, because most experts seem pretty coherent (within the domains they’re thinking about and trying to influence) and so the differences in impact depend on other factors.
        Experts are coherent within the bounds of conventional study. When we try to apply that expertise to related topics that are less conventional (e.g. ML researchers on AGI; or even economists on what the most valuable interventions are) coherence drops very sharply. (I’m reminded of an interview where Tyler Cowen says that the most valuable cause area is banning alcohol, based on some personal intuitions.)
        I feel like looking at any EA org’s report on estimation of their own impact makes it seem like “impact of past policies” is really difficult to evaluate?
        The question is how it compares to estimating past correctness, where we face pretty similar problems. But mostly I think we don’t disagree too much on this question—I think epistemic evaluations are gonna be bigger either way, and I’m mostly just advocating for the “think-of-them-as-a-proxy” thing, which you might be doing but very few others are.
        Rohin Shah Jul 14, 2022, 6:52 AM
        2 points
        0 ∶ 0
        Parent
        Note that this wouldn’t actually make a big change for AI alignment, since we don’t know how to use more funding.
        Funding isn’t the only resource:
        You’d change how you introduce people to alignment (since I’d guess that has a pretty strong causal impact on what worldviews they end up acting on). E.g. if you previously flipped a 10%-weighted coin to decide whether to send them down the Eliezer track or the other track, now you’d flip a 20%-weighted coin, and this straightforwardly leads to different numbers of people working on particular research agendas that the worldviews disagree about. Or if you imagine the community as a whole acting as an agent, you send 20% of the people to MIRI fellowships and the remainder to other fellowships (whereas previously it would be 10%).
        (More broadly I think there’s a ton of stuff you do differently in community building, e.g. do you target people who know ML or people who are good at math?)
        You’d change what you used political power for. I don’t particularly understand what policies Eliezer would advocate for but they seem different, e.g. I think I’m more keen on making sure particular alignment schemes for building AI systems get used and less keen on stopping everyone from doing stuff besides one secrecy-oriented lab that can become a leader.
        Experts are coherent within the bounds of conventional study.
        Yeah, that’s what I mean.