Geoffrey Irving answers What are the coolest topics in AI safety, to a hopelessly pure mathematician?

Geoffrey Irving May 9, 2022, 8:53 PM
2 points
0 ∶ 0
In outer alignment one can write down a correspondence between ML training schemes that learn from human feedback and complexity classes related to interactive proof schemes. If we model the human as a (choosable) polynomial time algorithm, then

1. Debate and amplification get to PSPACE, and more generally $n$ -step debate gets to $Σ_{n} P$ .
2. Cross-examination gets to NEXP.
3. If one allows opaque pointers, there are schemes that go further: market making gets to R.
Moreover, we informally have constraints on which schemes are practical based on properties of their complexity class analogues. In particular, interactive proofs schemes are only interesting if they relativize: we also have $I P = P S P A C E$ and thus a single prover gets to PSPACE given an arbitrary polynomial time verifier, but w.r.t. a typical oracle $I P^{O} < P S P A C E^{O}$ . My sense is there are further obstacles that can be found: my intuition is that “market making = R” isn’t the right theorem once obstacles are taken into account, but don’t have a formalized model of this intuition.
The reason this type of intuition is useful is humans are unreliable, and schemes that reach high complexity class analogies should (everything else equal) give more help to the humans in noticing problems with ML systems.
I think there’s quite a bit of useful work that can be done pushing this type of reasoning further, but (full disclosure) it isn’t of the “solve a fully formalized problem” sort. Two examples:

1. As mentioned above, I find “market making = R” unlikely to the right result. But this doesn’t mean that market making isn’t an interesting scheme: there are connections between market making and Paul Christiano’s learning the prior scheme. As previously formalized, market making misses a practical limitation on the available human data (the $n$ -way assumption in that link), so there may be work to do to reformalize it into a more limited complexity class in a more useful way.

2. Two-player debate is only one of many possible schemes using self-play to train systems, and in particular one could try to shift to $n$ -player schemes in order to reduce playing-for-variance strategies where a behind player goes for risky lies in order to possibly win. But the “polynomial time judge” model can’t model this situation, as there is no variance when trying to convince a deterministic algorithm. As a result, there is a need for more precise formalization that can pick up the difference between self-play schemes that are more or less robust to human error, possibly related to CRMDPs.