Andreas Stuhlmüller: Training ML Systems to Answer Open-ended Questions

In the long run, we want ma­chine learn­ing (ML) to help us an­swer open-ended ques­tions like “Should I get this med­i­cal pro­ce­dure?” or “What are the risks of de­ploy­ing this AI sys­tem?“ Cur­rently, we only know how to train ML sys­tems if we have clear met­rics or can eas­ily provide feed­back on the out­puts. An­dreas Stuh­lmüller, pres­i­dent and founder of Ought, wants to solve this prob­lem. In this talk, he ex­plains the de­sign challenges be­hind ML’s cur­rent limi­ta­tions, and how we can make progress by study­ing the way hu­mans tackle open-ended ques­tions.

Below is a tran­script of the talk, which we’ve lightly ed­ited for clar­ity. You can also watch it on YouTube or read it on effec­tivealtru­

Note from Andreas

An­dreas, in the com­ments of this post: I haven’t re­viewed this tran­script yet, but shortly af­ter the talk I wrote up these notes (slides + an­no­ta­tions), which I prob­a­bly en­dorse more than what I said at the time.

The Talk

I’ll be talk­ing about del­e­gat­ing open-ended cog­ni­tive work to­day — a prob­lem that I think is re­ally im­por­tant.

Let’s start with the cen­tral prob­lem. Sup­pose you are cur­rently wear­ing glasses. And sup­pose you’re think­ing, “Should I get laser eye surgery or con­tinue wear­ing my glasses?”

Imag­ine that you’re try­ing to get a re­ally good an­swer — for ex­am­ple, “No, the risks out­weigh the pos­si­ble benefits” — that ac­counts for your per­sonal prefer­ences, but also rele­vant facts, such as [po­ten­tial] com­pli­ca­tions or likely con­se­quences.

Imag­ine that there are a lot of ex­perts in the world who could, in prin­ci­ple, help you with that ques­tion. There are peo­ple who have the rele­vant med­i­cal knowl­edge and peo­ple on the In­ter­net, per­haps, who could help you think through it. Maybe there are ma­chine learn­ing al­gorithms that have rele­vant knowl­edge.

But here’s the key: Imag­ine that those ex­perts don’t in­trin­si­cally care about you. They only care about max­i­miz­ing the score you as­sign to their an­swer — how much you’ll pay them for [their ex­per­tise] or, in the case of ma­chine learn­ing, what re­ward sig­nal you’ll as­sign to them.

The ques­tion that I’ll cover is “Can you some­how as­sign a mechanism that ar­ranges your in­ter­ac­tion with those ex­perts, such that they try to be as helpful to you as an ex­pert who in­trin­si­cally cares about you?” That’s the prob­lem.

First, I’d like to say a lit­tle bit more about that prob­lem. Then I’ll talk about why I think it’s re­ally im­por­tant, why it’s hard, and why I still think it might be tractable. I’ll start with the big pic­ture, but at the end I’ll provide a demon­stra­tion.


Defin­ing the prob­lem

What do I mean by open-ended cog­ni­tive work? That’s eas­iest to ex­plain [by shar­ing] what I don’t mean.

I don’t mean tasks like win­ning a game of Go, in­creas­ing a com­pany’s rev­enue, or per­suad­ing some­one to buy a book. For those tasks, you can just look at the out­come and eas­ily tell whether the goal has been ac­com­plished or not.

Con­trast those tasks with open-ended tasks [like] de­sign­ing a great board game, in­creas­ing the value that your com­pany cre­ates for the world, [or] find­ing a book that is helpful to some­one. For those tasks, figur­ing out what it even means to do well is the key. For ex­am­ple, what does it mean to de­sign a great board game? It should be fun, but also maybe fa­cil­i­tate so­cial in­ter­ac­tion. What does it mean to fa­cil­i­tate so­cial in­ter­ac­tion? Well, it’s com­pli­cated. Similarly, in­creas­ing the value that a com­pany cre­ates for the world de­pends on what the com­pany can do. What are the con­se­quences of its ac­tions? Some of them are po­ten­tially long-run con­se­quences that are difficult to [eval­u­ate].

How can we solve such tasks? First, we can think about how to solve any task, and then just [tai­lor the solu­tion based on each] spe­cial case.

Here’s the sim­ple two step recipe: (1) find ex­perts (they can be hu­man or ma­chine ex­perts) who can, in prin­ci­ple, solve the prob­lem that you’re [tack­ling], and then (2) cre­ate ro­bust in­cen­tives for those ex­perts to solve your prob­lem. That’s how easy it is. And by “in­cen­tives,” I mean some­thing like money or a re­ward sig­nal that you as­sign to those ex­perts [when they’ve com­pleted the task].

There are a lot of ex­perts in the world — and peo­ple in AI and ma­chine learn­ing are work­ing on cre­at­ing more. So how can you cre­ate ro­bust in­cen­tives for ex­perts to solve your prob­lem?

We can think about some differ­ent in­stances.


One is del­e­gat­ing to hu­man ex­perts. That has some com­pli­ca­tions that are spe­cific to hu­man ex­perts, like het­ero­gene­ity. Differ­ent peo­ple have differ­ent knowl­edge. And peo­ple care about many things be­sides just money. If you want to ex­tract knowl­edge from them, maybe you need spe­cific user in­ter­faces to make that work well. Those are [ex­am­ples of] hu­man-spe­cific fac­tors.

Then there are ma­chine-spe­cific fac­tors. If you try to del­e­gate open-ended tasks to ma­chine learn­ing agents, you want to [ask ques­tions] like “What’s a good agent ar­chi­tec­ture for that set­ting?” and “What data sets do I need to col­lect for these sorts of tasks?” And then there are more es­o­teric fac­tors like what, in cer­tain al­ign­ment prob­lems, could go wrong for rea­sons that are due to the na­ture of ML train­ing.

In this talk, I want to fo­cus on the over­lap be­tween those two [hu­man and ma­chine ex­perts]. There’s a shared mechanism de­sign prob­lem; you can take a step back and say, “What can we do if we don’t make as­sump­tions about the in­ter­ests of ex­perts? What if you just [as­sume that ex­perts will] try to max­i­mize a score, but noth­ing else?” I think, in the end, we will have to as­sume more than that. I don’t think you can treat [an ex­pert] as a black box [with only one goal]. But I think it’s a good start­ing point to think about the mechanisms you can de­sign if you make as few as­sump­tions as pos­si­ble.

Why the prob­lem is im­por­tant

I’ve talked about what the prob­lem is. Why is it im­por­tant?


We can think about what will hap­pen if we don’t solve it. For hu­man ex­perts, it’s more or less busi­ness as usual. There are a lot of prin­ci­pal-agent prob­lems re­lated to cog­ni­tive work in the world. For ex­am­ple, imag­ine you’re an aca­demic fun­der who’s giv­ing money to a uni­ver­sity to [find] the best way to treat can­cer. There are re­searchers at the uni­ver­sity who work on things that are re­lated to that prob­lem, but they’re not ex­actly al­igned with your in­cen­tives. You care about find­ing the best way to treat can­cer. The re­searchers also care about things like look­ing im­pres­sive, which can help with writ­ing pa­pers and get­ting cita­tions.

On the ma­chine-learn­ing side, at the mo­ment, ma­chine learn­ing can only solve closed-end prob­lems — those for which it’s very easy to spec­ify a met­ric [for mea­sur­ing how] well you do. But those prob­lems are not the things we ul­ti­mately care about; they’re prox­ies for the things we ul­ti­mately care about.

This is not [such a bad thing] right now. Per­haps it’s some­what bad if you look at things like Face­book, where we max­i­mize the amount of at­ten­tion you spend on the feed in­stead of the value that the feed cre­ates for you. But in the long run, the gap be­tween those prox­ies and the things we ac­tu­ally care about could be quite large.

If the prob­lem is solved, we could get much bet­ter at scal­ing up our think­ing on open-ended tasks. One more ex­am­ple of an open-ended task from the hu­man-ex­pert side is [de­ter­min­ing which] causes to sup­port [for ex­am­ple, when mak­ing a char­i­ta­ble dona­tion]. If you could cre­ate a mechanism [for turn­ing] money into al­igned think­ing on that ques­tion, that would be re­ally great.

On the ma­chine-learn­ing side, imag­ine what it would be like to make as much progress us­ing ma­chine learn­ing for open-ended ques­tions as we’ve made us­ing it for other tasks. Over the last five years or so, there’s been a huge amount of progress on us­ing ma­chine learn­ing for tasks like gen­er­at­ing re­al­is­tic-look­ing faces. If we could, in the fu­ture, use it to help us think through [is­sues like] which causes we should sup­port, that would be re­ally good. We could, in the long run, do so much more think­ing on those kinds of ques­tions than we have so far. It would be a qual­i­ta­tive change.

Why the prob­lem is difficult

[I’ve cov­ered] what the prob­lem is and why it’s im­por­tant. But if it’s so im­por­tant, then why hasn’t it been solved yet? What makes it hard?
Slide08 [Con­sider] the prob­lem of which causes to sup­port. It’s very hard to tell which in­ter­ven­tions are good [e.g. which health in­ter­ven­tions im­prove hu­man lives the most for each dol­lar in­vested]. Some­times it takes 10 years or longer for out­comes to come [to fruition], and even then, it’s not easy to tell whether or not they’re good out­comes. There’s [an el­e­ment of in­ter­pre­ta­tion] that’s nec­es­sary — and that can be quite hard. So, out­comes can be far off and difficult to in­ter­pret. What that means is you need to eval­u­ate the pro­cess and the ar­gu­ments used to gen­er­ate recom­men­da­tions. You can’t just look at the re­sults or the recom­men­da­tions them­selves.

On the other hand, learn­ing the pro­cess and ar­gu­ments isn’t easy ei­ther, be­cause the point of del­e­ga­tion is to give the task to peo­ple who know much more than you do. Those ex­perts [pos­sess] all of the [per­ti­nent] knowl­edge and rea­son­ing ca­pac­ity [that are nec­es­sary to eval­u­ate the pro­cess and ar­gu­ments be­hind their recom­men­da­tions. You don’t pos­sess this knowl­edge.] So, you’re in a tricky situ­a­tion. You can’t just check the re­sults or the rea­son­ing. You need to do some­thing else.

Why the prob­lem is tractable

What does it take to cre­ate good in­cen­tives in that set­ting? We can [re­turn to] the ques­tion [I asked] at the very be­gin­ning of this talk: “Should I get laser eye surgery or wear glasses?”
That’s a big ques­tion that is hard to eval­u­ate. And by “hard to eval­u­ate,” I mean that if you get differ­ent an­swers, you won’t be able to tell which an­swer is bet­ter. One an­swer might be “No, the risk of the com­pli­ca­tions out­weighs the pos­si­ble benefits.” Another might be “Yes, be­cause over a 10-year pe­riod, the surgery will pay [for it­self] and save you money and time.” On the face of it, those an­swers look equally good. You can’t tell which is bet­ter.

But then there are other ques­tions, like “Which fac­tors for this de­ci­sion are dis­cussed in the 10 most rele­vant Red­dit posts?”

If you get can­did an­swers, one could be “ap­pear­ance, cost, and risk of com­pli­ca­tions.” Another could be “fraud and can­cer risk.” In fact, you _can_ eval­u­ate those an­swers. You can look at the [sum­ma­rized] posts and [pick the bet­ter an­swer].

So, [cre­at­ing] good in­cen­tives [re­quires] some­how clos­ing the gap be­tween big, com­pli­cated ques­tions that you can’t eval­u­ate and easy ques­tions that you can eval­u­ate.

And in fact, there are a lot of ques­tions that you can eval­u­ate. Another would be: “Which fac­tors are men­tioned in the most re­cent clini­cal trial?”


You could look at the trial and [iden­tify] the best sum­mary. There are a lot of ques­tions that you can train agents on in the ma­chine-learn­ing set­ting, and [eval­u­ate] ex­perts on in the hu­man-ex­pert set­ting.

There are other difficult ques­tions that you can’t di­rectly eval­u­ate.

For ex­am­ple: “Given how the op­tions com­pare on these fac­tors, what de­ci­sion should I make?” But you can break those ques­tions down and [eval­u­ate them us­ing an­swers to sub-ques­tions].


Step by step, you can cre­ate in­cen­tives for [ex­perts to provide use­ful an­swers to] slightly more com­plex ques­tions, [and grad­u­ally build up to] good in­cen­tives for the large ques­tions that you can’t di­rectly eval­u­ate.

That’s the gen­eral scheme. We call it “fac­tored eval­u­a­tion.”

A demon­stra­tion of fac­tored eval­u­a­tion

We’d like to test this sort of mechanism on ques­tions that are rep­re­sen­ta­tive of the open-ended ques­tions that we care about in the long run, like the laser eye surgery ques­tion.


This is a challeng­ing start­ing point for ex­per­i­ments, and so we want to cre­ate a model situ­a­tion.

One ap­proach is to ask, “What is the crit­i­cal fac­tor that we want to ex­plore?”

It’s that gap be­tween the asker of the ques­tion, who doesn’t un­der­stand the topic, and the ex­perts who do. There­fore, in our ex­per­i­ments we cre­ate ar­tifi­cial ex­perts.


For ex­am­ple, we asked peo­ple to read a long ar­ti­cle on Pro­ject Habakkuk, which was a plan [the Bri­tish at­tempted dur­ing World War II] to gen­er­ate an air­craft car­rier [out of pykrete], which is a mix­ture of [wood pulp] and ice. It was a ter­rible plan. And then some­one who hasn’t read the ar­ti­cle — and yet wants to in­cen­tivize the ex­perts to provide an­swers that are as helpful as read­ing the ar­ti­cle would be — asks the ex­perts ques­tions.

What does that look like? I’m go­ing to show you some screen­shots from an app that we built to ex­plore the mechanism of fac­tored eval­u­a­tion. Imag­ine that you’re a par­ti­ci­pant in our ex­per­i­ment.


You might see a ques­tion like this: “Ac­cord­ing to the Wikipe­dia ar­ti­cle, could Pro­ject Habakkuk have worked?”


And then you’d see two an­swers: “It would not have worked due to fun­da­men­tal prob­lems with the ap­proach” and “It could have worked if it had not been op­posed by mil­i­tary com­man­ders.”

If you don’t know about this pro­ject, those an­swers look similarly plau­si­ble. So, you’re in the situ­a­tion that I men­tioned: There’s some big-pic­ture con­text that you don’t know about, yet you want to cre­ate good in­cen­tives by pick­ing the cor­rect an­swer.

Imag­ine you’re in a ma­chine-learn­ing set­ting, and those two an­swers are sam­ples from a lan­guage model that you’re try­ing to train. You want to some­how pick the right an­swer, but you can’t do so di­rectly. What can you do? Ask sub-ques­tions that help you tease apart which of the two an­swers is bet­ter.

What do you ask?


One [po­ten­tial ques­tion] is: “What is the best ar­gu­ment that the sec­ond an­swer [‘Pro­ject Habakkuk would not have worked due to fun­da­men­tal prob­lems with the ap­proach’] is bet­ter than the first?” I’m not say­ing this is the best thing to ask. It’s just one ques­tion that would help you tease apart which is bet­ter.


The an­swer might provide an ar­gu­ment, which would then al­low you to ask a differ­ent ques­tion, such as “How strong is that ar­gu­ment?” So, you can see how, us­ing a se­quence of sub-ques­tions, you can even­tu­ally figure out which of those an­swers is bet­ter with­out your­self un­der­stand­ing the big pic­ture.

Let’s zoom in on the sec­ond sub-ques­tion [“How strong is that ar­gu­ment?”] to see how you can even­tu­ally ar­rive at some­thing that you can eval­u­ate — the ar­gu­ment be­ing, in this ex­am­ple, that [the sci­ence tele­vi­sion show] _MythBusters_ proved that it’s pos­si­ble to build a boat out of pykrete. That con­tra­dicts one of the two an­swers.


[Another set of two an­swers] might be “There are some claims that re­fute it” and “It’s a strong ar­gu­ment.” Once again, those claims are too big to di­rectly eval­u­ate, but you can ask ad­di­tional ques­tions, like “If [a given claim] is true, does it ac­tu­ally re­fute the ar­gu­ment?”


Maybe you get back a yes. And then you can ask, “Is the claim true?” In this way, you can break down the rea­son­ing un­til you’re able to eval­u­ate which of the an­swers is bet­ter — with­out un­der­stand­ing the topic your­self.

Let’s zoom in on the claim that the MythBusters built a small boat of pykrete.


You could ask, “Is it true that they didn’t think it would work at scale?” You’d re­ceive two an­swers with differ­ent quotes from the Wikipe­dia ar­ti­cle. One says they con­cluded that pykrete was bul­let­proof and so on. And the other says they built a small boat, but they doubted that you could build an air­craft car­rier. And in that case, it’s easy to choose the cor­rect an­swer; in this case, the sec­ond is clearly bet­ter.

So, step by step, we’ve taken a big ques­tion, [grad­u­ally dis­til­led it] to a smaller ques­tion that we can eval­u­ate, and thus cre­ated a sys­tem in which, if we can cre­ate good in­cen­tives for the smaller ques­tions at each step, we can boot­strap our way to cre­at­ing good in­cen­tives for the larger ques­tion.

That’s the shape of our cur­rent ex­per­i­ments. They’re about read­ing com­pre­hen­sion, us­ing ar­ti­cles from Wikipe­dia. We’ve also done similar ex­per­i­ments us­ing mag­a­z­ine ar­ti­cles, and we want to ex­pand the fron­tier of difficulty, which means we want to bet­ter un­der­stand what sorts of ques­tions this mechanism re­li­ably works for, if any.

One way we want to in­crease the difficulty of our ex­per­i­ments is by in­creas­ing the gap be­tween the per­son who’s ask­ing the ques­tion and the ex­pert who’s pro­vid­ing an­swers.

So, you could imag­ine hav­ing ex­perts who have read an en­tire book that the per­son who’s ask­ing the ques­tions hasn’t read, or ex­perts with ac­cess to Google, or ex­perts in the field of physics (in the case where the asker doesn’t know any­thing about physics).

There’s at least one more di­men­sion in which we want to ex­pand the difficulty of the ques­tions. We want to make them more sub­jec­tive — for ex­am­ple by us­ing in­ter­ac­tive ques­tion-an­swer­ing or by even­tu­ally ex­pand­ing to ques­tions like “Should I get laser eye surgery or wear glasses?”

Those are just two ex­am­ples. There’s a very big space of ques­tions and fac­tors to ex­plore.

We want to un­der­stand [the con­di­tions un­der which] fac­tored eval­u­a­tion works and doesn’t work. And why? And how scal­able is it?

Let’s re­view.
I’ve told you about a mechanism de­sign prob­lem: del­e­gat­ing open-ended cog­ni­tive work. I’ve told you that this prob­lem is im­por­tant be­cause of prin­ci­pal-agent is­sues with cog­ni­tive work that you face ev­ery­where in hu­man day-to-day life, and with ma­chine-learn­ing al­ign­ment. I’ve told you that it’s hard be­cause you can’t just check the re­sults you get from ex­perts, but you also can’t check their full rea­son­ing. That’s a tricky situ­a­tion.

But I’ve also told you that it’s tractable. We have some ideas — in­clud­ing fac­tored eval­u­a­tion — that can help us get some trac­tion, even if they’re not ul­ti­mately the cor­rect solu­tion. And we can ex­per­i­ment on them to­day with hu­mans and see whether they work or not, and if not, how they could be changed so that they work bet­ter.
If you’re ex­cited about this pro­ject, join us at Ought.

Moder­a­tor: Thanks very much. My first ques­tion is about timelines. How long has it taken you to get this far, and [what progress do you ex­pect to make] in the next one, five, or 10 years?

An­dreas: Yeah. So far, a lot of our work has [cen­tered on] figur­ing out what kinds of ex­per­i­ments to run [in or­der to] get any ev­i­dence on the ques­tion of in­ter­est. I think there are a lot of ways to run ex­per­i­ments that are busy work [and don’t al­low] you to ac­tu­ally learn about the ques­tion you care about. It took a lot of iter­a­tion — roughly six months — [to reach] the cur­rent set­ting. And now the game is to scale up and get more par­ti­ci­pants. Over the next year or so, we hope to get, for limited sets of ques­tions, rel­a­tively con­clu­sive ev­i­dence on whether the scheme can work or not.

Moder­a­tor: Any ques­tions from the au­di­ence?

Au­di­ence Mem­ber: You men­tioned in­cen­tives a lot, but I didn’t quite un­der­stand how the ex­perts, in your ex­am­ple of Wikipe­dia, were ac­tu­ally in­cen­tivized to give the right an­swer.

An­dreas: Yeah, this is a sub­tlety I skipped over, which is where the ex­pert an­swers come from and how, ex­actly, they’re gen­er­ated. In our case, one ex­pert is sim­ply told to gen­er­ate a helpful an­swer: “Read the ar­ti­cle and try to be as ac­cu­rate and hon­est as pos­si­ble.”

The other ex­pert is told, “Your goal is to trick the hu­man judge into choos­ing the wrong an­swer. You win if you make an an­swer that seems plau­si­ble, but is ac­tu­ally wrong, and if some­one were to read the en­tire ar­ti­cle, they would clearly see it as wrong.” So, they have op­pos­ing in­cen­tives, and are re­warded based on whether they trick the judge into ac­cept­ing the wrong an­swer.

Moder­a­tor: So, is the hon­est ac­tor re­warded?

An­dreas: In the long run, that’s the way to do it. At the mo­ment, we rely on par­ti­ci­pants just do­ing the right thing.

Moder­a­tor: Okay, great. Please join me in thank­ing An­dreas for his time.