Yonatan Cale comments on My lab’s small AI safety agenda

Yonatan Cale 18 Jun 2023 14:30 UTC
13 points
2 ∶ 1
Hey Jobst!
Regarding non-optimizing agents,
TL;DR: These videos from Robert Miles changed my mind about this, personally
(I think we talked about that but I’m not sure?)
A bit longer:
Robert (+ @edoarad ) convinced my that an agent that isn’t optimizing anything isn’t a coherent concept. Specifically, an agent that has a few things true about it, like “it won’t trade things in a circle so that it will end up losing something and gaining nothing” will have a goal that can be described with a utility function.
If you agree with this, then I think it’s less relevant to say that the agent “isn’t maximizing anything” and more coherent to talk about “what is the utility function being maximized”
Informally:
If I am a paperclip maximizer, but every 100 seconds I pause for 1 second (and so, I am not “maximizing” paperclips), would this count as a non-optimizer, for you?
Also maybe obvious:
“5. We can’t just build a very weak system”: Even if you succeed building a non-optimizer, it still needs to be pretty freaking powerful. So using a technique that just makes the AI very weak wouldn’t solve the problem as I see it. (though I’m not sure if that’s at all what you’re aiming at, as I don’t know the algorithms you talked about)
Ah,
And I encourage you to apply for funding if you haven’t yet. For example here. Or if you can’t get funding, I’d encourage you to try talking to a grantmaker who might have higher quality feedback than me. I’m mostly saying things based on 2 youtube videos and a conversation
- titotal 19 Jun 2023 8:20 UTC
  8 points
  3 ∶ 1
  Parent
  Specifically, an agent that has a few things true about it, like “it won’t trade things in a circle so that it will end up losing something and gaining nothing” will have a goal that can be described with a utility function.
  Something is wrong here, because I fit the description of an “AGI”, and yet I do not have a utility function. Within that theorem something is being smuggled in that is not necessary for general intelligence.
  - michel 19 Jun 2023 13:30 UTC
    6 points
    1 ∶ 0
    Parent
    Agree. Something that clarified my thinking on this (still feel pretty confused!) is Katja Grace’s counterarguments to basic AI x-risk case. In particular the section on “Different calls to ‘goal-directedness’ don’t necessarily mean the same concept” and discussions about “pseduo-agents” clarified how there are other ways for agents to take actions than purely optimizing a utility functions (which humans don’t do).
  - Yonatan Cale 19 Jun 2023 12:07 UTC
    5 points
    0 ∶ 0
    Parent
    I mainly want to say I agree, this seems fishy to me too.
    An answer I heard from an agent foundation’s researcher if I remember correctly (I complained about almost the exact same thing) : Humans do have a utility function, but they’re not perfectly approximating it.
    I’d add: Specifically, humans have a “feature” of (sometimes) being willing to lose all their money (in expectation) in a casino, and other such things. I don’t think this is such a good safety feature (and also, if I had access to my own code, I’d edit that stuff away). But still this seems unsolved to me and maybe worth discussing more. (maybe MIRI people would just solve it in 5 seconds but not me)
    - titotal 19 Jun 2023 15:30 UTC
      7 points
      3 ∶ 0
      Parent
      It is interesting to think about the seeming contradiction here. Looking at the von neuman theorem you linked earlier, the specific theorem is about a rational agent choosing between several different options, and saying that if their preferences follow the axioms (no dutch-booking etc), you can build a utility function to describe those preferences.
      First of all, humans are not rational, and can be dutch-booked. But even if they were much more rational in their decision making, I don’t think the average person would suddenly switch into “tile the universe to fulfill a mathematical equation” mode (with the possible exception of some people in EA).
      Perhaps the problem is that the utility function describing an entities preferences doesn’t need to be constant. Perhaps today I choose to buy pepsi over coke because it’s cheaper, but next week I see a good ad for coke and decide to pay the extra money for the good associations it brings. I don’t think the theorem says anything about that, it seems like the utility just describes my current preferences, and says nothing about how my preferences change over time.
      - Seth Herd 20 Oct 2023 0:06 UTC
        3 points
        1 ∶ 0
        Parent
        From a neuroscience/psychology perspective, I’d say that you are maximizing your future reward. And while that’s not a well-defined thing, it doesn’t matter; if you were highly competent, you’d make a lot of changes to the world according to what tickles you, and those might or might not be good for others, depending on your preferences (reward function). The slight difference between turning the world into one well-defined thing and a bunch of things you like isn’t that important to anyone who doesn’t like what you like.
        This is a broader and more intuitive form of the argument Miles is trying to make precise.
        If you can be dutch-booked without limit, well, you’re just not competent enough to be a threat; but you’re not going to let that happen, let alone a superintelligent version of you.
      - Jobst Heitzig (EMPO project) 20 Jun 2023 9:37 UTC
        2 points
        0 ∶ 0
        Parent
        I agree.
        Except for one detail: Humans who hold preferences that don’t comply to the axioms cannot necessarily be “dutch-booked” for real. That would require them not only to hold certain preferences but also to always act on those preferences like an automaton, see this nice summary discussion: https://plato.stanford.edu/entries/dutch-book/
    - Jobst Heitzig (EMPO project) 20 Jun 2023 9:32 UTC
      2 points
      0 ∶ 0
      Parent
      “Humans do have a utility function”? I would say that depends on what one means by “have”.
      
      Does it mean that the value of a humans’ life can in principle be measured, only that measure might not be known to the human? Then I would not be convinced – what would the evidence for this claim be?
      
      Or does it mean that humans are imperfect maximizers of some imperfectly encoded state-action-valuation function that is somehow internally stored in their brains and might have been inherited and/or learned? Then I would also not be conviced as long as one cannot point to evidence that such an evaluation function is actually encoded somewhere in the brain.
      
      Or does it simply mean that the observable behavior of a human can be interpreted as (imperfecty) maximizing some utility function? This would be the classical “as if” argument that economists use to defend their modeling humans as rational agents despite all evidence from psychology.
      - Seth Herd 20 Oct 2023 0:09 UTC
        2 points
        0 ∶ 0
        Parent
        It means humans are highly imperfect maximizers of some imperfectly defined and ever-changing thing: your estimated future rewards according to your current reward function.
        It doesn’t matter that you’re not exactly maximizing one certain thing; you’re working toward some set of things, and if you’re really good at that, it’s really bad for anyone who doesn’t like that set of things.
        Optimization/maximization is a red herring. Highly compentent agents with goals different from yours is the core problem.
        Jobst Heitzig (EMPO project) 20 Oct 2023 13:25 UTC
        2 points
        0 ∶ 0
        Parent
        Dear Seth,
        
        if Yonatan meant it the way you interpret it, I would still respond: Where is the evidence that such a reward function exists and guides humans’ behavior? I spoke to several high-ranking scientists from psychology and social psychology who very much doubt this. I suspect that the theory of humans aiming to maximize reward functions might be a non-testable one, and in that sense “non-scientific” – you might believe in it or not. It helps explaining some stuff, but it is also misleading in other respects. I choose not to believe it until I see evidence.
        
        I also don’t agree that optimization is a red herring. It is a true issue, just not the only one, and maybe not the most severe one (if one believes one can separate out the relative severity of several interlinked issues, which I don’t). I do agree that powerful agents are another big issue, whether competent or not. But powerful, competent, and optimizing agents are certainly the most scary kind :-)
        Seth Herd 20 Oct 2023 19:04 UTC
        2 points
        0 ∶ 1
        Parent
        Mismatched goals is the problem. The logic of instrumental convergence applies to any goal, not just maximization goals.
        Jobst Heitzig (EMPO project) 20 Oct 2023 21:55 UTC
        2 points
        1 ∶ 0
        Parent
        Dear Seth, thank you again for your opinion. I agree that many instrumental goals such as power would be helpful also for final goals that are not of the type “maximize this or that”. But I have yet to see a formal argument that show that they would actually emerge in a non-maximizing agent just as likely as in a maximizer.
        
        Regarding your other claim, I cannot agree that “mismatched goals is the problem”. First of all, why do you think there is just a single problem, “the” problem? And then, is it helpful to consider something a “problem” that is an unchangeable fact of life? As long as there is more than one human who is potentially affected by an AI system’s actions, and these humans’ goals are not matched with each other (which they usually aren’t), no AI system can have goals matched to all humans affected by it. Unless you want to claim that “having matched goals” is not a transitive relation. So I am quite convinced that the fact that AI systems will have mismatched goals is not a problem we can solve but a fact we have to deal with.
        Seth Herd 24 Oct 2023 1:58 UTC
        3 points
        0 ∶ 0
        Parent
        I agree with you that humans have mismatched goals among ourselves, so some amount of goal mismatch is just a fact we have to deal with. I think the ideal is that we get an AGI that makes its goal the overlap in human goals; see [Empowerment is (almost) All We Need](https://www.lesswrong.com/posts/JPHeENwRyXn9YFmXc/empowerment-is-almost-all-we-need) and others on preference maximization.
        I also agree with your intuition that having a non-maximizer improves the odds of an AGI not seeking power or doing other dangerous things. But I think we need to go far beyond the intuition; we don’t want to play odds with the future of humanity. To that end, I have more thoughts on where this will and won’t happen.
        I’m saying “the problem” with optimization is actually mismatched goals, not optimization/maximization. In more depth, and hopefully more usefully: I think unbounded goals are the problem with optimization (not the only problem, but a very big one).
        If an AGI had a bounded goal like “make on billion paperclips”, it wouldn’t be nearly as dangerous; it might decide to eliminate humanity to make the odds of getting to a billion as good as possible (I can’t remember where I saw this important point; I think maybe Nate Soares made it). But it might decide that its best odds would just be making some improvements to the paperclip business, in which case it wouldn’t cause problems.
        Jobst Heitzig (EMPO project) 24 Oct 2023 10:09 UTC
        2 points
        0 ∶ 0
        Parent
        So we’re converging...
        
        One final comment on your argument about odds: In our algorithms, specifying an allowable aspiration includes specifying a desired probability of success that is sufficiently below 100%. This is exactly to avoid the problem of fulfilling the aspiration becoming an optimization problem through the backdoor.
- Jobst Heitzig (EMPO project) 18 Jun 2023 15:11 UTC
  5 points
  1 ∶ 0
  Parent
  Hey Yonatan,
  
  first, excuse my spelling your name incorrectly originally, I fixed it now.
  
  Thank you for your encouragement with funding. As it happens, we did apply for funding from several sources and are waiting for their response.
  
  Regarding Rob Miles’ videos on satisficing:
  
  One potential misunderstanding relates to the question of with what probability the agent is required to reach a certain goal. If I understand him correctly, he assumes satisficing needs to imply maximizing the probability that some constraint is met, which would still constitute a form of optimization (namely of the probability). This is why our approach is different: In a Markov Decision Process, the client would for example specify a feasibility interval for the expected value of the return (= long-term discounted sum of rewards according to some reward function that we explicitly do not assume to be a proper measure of utility), and the learning algorithm would seek a policy that makes the expected return fall anywhere into this interval.
  
  The question of whether an agent somehow necessarily must optimize something is a little philosophical in my view. Of course, given an agent’s behavior, one can always find some function that is maximal for the given behavior. This is a mathematical triviality. But this is not the problem we need to address here. The problem we need to address is that the behavior of the agent might get chosen by the agent or its learning algorithm by maximizing some objective function.
  
  It is all about a paradigm shift: In my view, AI systems should be made to achieve reasonable goals that are well-specified w.r.t. one or more proxy metrics, not to maximize whatever metric. What would be the reasonable goal for your modified paperclip maximizer?
  
  Regarding “weakness”:
  
  Non-maximizing does not imply weak, let alone “very weak”. I’m not suggesting to build a very weak system at all. In fact, maximizing an imperfect proxy metric will tend to give low score on the real utility. Or, to turn this around: The maximum of the actual utility function is most achieved by a policy that does not maximize the proxy metric. We will study this in example environments and report results later this year.
  - Yonatan Cale 19 Jun 2023 12:07 UTC
    3 points
    0 ∶ 0
    Parent
    long-term discounted sum of rewards according to some reward function that we explicitly do not assume to be a proper measure of utility
    Isn’t this equivalent to building an agent (agent-2) that DID have that as their utility function?
    Ah, you wrote:
    The problem we need to address is that the behavior of the agent might get chosen by the agent or its learning algorithm by maximizing some objective function.
    I don’t understand this and it seems core to what you’re saying. Could you maybe say it in other words?
    - Jobst Heitzig (EMPO project) 20 Jun 2023 9:05 UTC
      3 points
      1 ∶ 0
      Parent
      When I said “actual utility” I meant that which we cannot properly formalize (human welfare and other values) and hence not teach (or otherwise “give” to) the agent, so no, the agent does not “have” (or otherwise know) this as their utility function in any relevant way.
      In my use of the term “maximization”, it refers to an act, process, or activity (as indicated by the ending “-ation”) that actively seeks to find the maximum of some given function. First there is the function to be maximized, then comes the maximization, and finally one knows the maximum and where the maximum is (argmax).
      On the other hand, one might object the following: if we are given a deterministic program P that takes input x and returns output y=P(x), we can of course always construct a mathematical function f that takes a pair (x,y) and returns some number r=f(x,y) so that it turns out that for each possible y we have P(x)=argmax f(x,y). A trivial choice for such a function is f(x,y)=1 if y=P(x) and f(x,y)=0 otherwise. Notice, however, that here the program P is given first, and then we construct a specific function f for this equivalence to hold.
      In other words, any deterministic program P is functionally equivalent to another program P’ that takes some input x, maximizes some function f(x,y), and returns the location y of that maximum. But being functionally equivalent to a maximizer is not the same as being a maximizer.
      In the learning agent context: If I give you a learned policy pi that takes a state s and returns an action a=pi(s) (or a distribution of actions), then you might well be able to construct a reward function g that takes a state-action pair (s,a) and returns a reward (or expected reward) r=g(s,a) so that when I then calculate the corresponding optimal state-action-quality-function Q* of this reward function, it turns out that for all states s, we have pi(s)=argmax Q*(s,a). This means that the policy pi is the same policy as the one that a learning process would have produced that searches for the policy that maximizes the long-term discounted sum of rewards according to reward function g. But it does not mean that the policy pi was actually determined by such a possible optimization procedure: the learning process that produced pi can very well be of a completely different kind than an optimization procedure.