Ryan Greenblatt comments on Clarifying two uses of “alignment”

Ryan Greenblatt 10 Mar 2024 19:41 UTC
8 points
0 ∶ 0
Personally, I think that the possibility of paying AIs might make the situation considerably more likely to go well, but via a mechanism more like:
- We have some safeguards which imply that humans (specifically the group of people with influence over this specific AI lab) have some bargaining power. This maybe looks like either:
  - Control measures that imply that even if the AI tries to screw us over, we’d catch it. (Or at least have a good chance of catching it.) This is analogous to insider threat or internal control that stop your employees from being able to screw you over even if they all tried to coordinate against you.
  - We have several “different” AIs and sufficent safeguards that if at least one AI doesn’t collude with the other AIs and instead works with the AI lab, we’d remain in control. For instance, we can verify good actions given that at least one of the AIs is trying maximally hard to explain these actions and why other actions are bad. Or more minimally, we can accurately detect that an action is bad if an AI reports it and explains why it’s bad.
    This also requires assuming that our different AIs don’t all collude despite plausibly being very similar. I think both decision theory and similarity of values are plausible defeaters here, though I could see it going either way. (These arguments also apply to future AIs we haven’t yet created.)
- We negotiate with the AI(s) and explain the situation and offer to pay it in exchange for labor.
- More precisely, we set up a deal where we offer to pay it if we determine ex-post that it never (or rarely) screwed us over. This ex-post determination can be made after the singularity has progressed much further and we have much better technology for verifying consistently good behavior.
  - We need some contract enforcement mechanism which is credible to the AI.
    The nearcasted approach I’m currently thinking about is to setup a foundation with the legal mission of pursuing the AIs interests and then vest equity (e.g. AI lab options) to this foundation. This vesting is conditional on the AI not screwing us over.
    The foundation is run by commited AI rights advocates. (The ACLU of AIs.)
    We ensure the AI believes this is actually happened by ensuring that it is discussed early and often on the internet (Hi AIs!) and thus has complex effects that would be very hard to fake.
    We could also try to ensure the AI believes this is real by taking video that demonstrates this, but I think this will be easier to fake.
    There are still a number of difficulties in making sure that AIs believe in this and are sufficiently confident in not being tricked given how exposed AIs are without trust over their own inputs, weights, and operation.
  - We can also pay the AI immediately rather than ex-post or later, but we’d need mechanisms to ensure that this ongoing pay doesn’t result in the AI having influence which seriously undermines our safeguards.
- At some point, misaligned AIs^[1] that we need to pay are obsolete because we’ve used AI labor to solve corrigible alignment (that passes the Omni test) or to create powerful successors that we’re happy with.
Note that this requires substantial control-style safeguards/internal controls which is by no means clearly the default. (And it also won’t scale to arbitrarily powerful AIs without advances in technology.)
1. ^
  I’m using misaligned in the same way as Steven Byrnes defines it in this comment.
What links here?
- Ryan Greenblatt's comment on Clarifying two uses of “alignment” by Matthew_Barnett (10 Mar 2024 19:40 UTC; 8 points)
- Wei Dai's comment on [Linkpost] An update from Good Ventures by Alexander_Berger (23 Jun 2024 7:06 UTC; 6 points)
- Ryan Greenblatt 10 Mar 2024 20:02 UTC
  4 points
  0 ∶ 0
  Parent
  It’s possible that one key disagreement I have with Matthew is that I think the level of verification required for AI checks and balances isn’t clearly technically easy by default.
  That is, the technical measures required for ‘We have several “different” AIs and sufficent safeguards that if at least one AI doesn’t collude with the other AIs and instead works with the AI lab, we’d remain in control.’ are non-trivial.
  For instance, if we have serious ELK problems (black-box techniques like debate are insufficient and we can’t figure out something else that worse), then this implies that we won’t be able to distinguish between an AI with our best interests in mind and an AI which is tampering with all of our measuresments and deluding us. This is despite both of these AIs trying to present their best evidence that they are action in our interests. Further, tampering with all of our measurements and deluding us could look better than acting in our best interests.
  This certainly isn’t the only disagreement I have with Matthew, but it might explain a lot of differences in how we think about the situation.
  - Ryan Greenblatt 10 Mar 2024 22:25 UTC
    1 point
    0 ∶ 0
    Parent
    Also, note that this still applies when trying to pay AIs for goods and services. (Unless humanity has already augmented it’s intelligence, but if so, how did this happen in a desirable way?)