[Link] “Progress Update October 2019” (Ought)

Dis­clo­sure: I do con­tract work for Ought.


https://​​ought.org/​​up­dates/​​2019-10-28-progress-up­date (a)

tl;dr:


This is an up­date on our progress to­wards our goals over the last ten months. If you can only read 650 char­ac­ters of this up­date, like the judges in our ex­per­i­ments, here’s what you need to know:
1. We switched from ex­per­i­ments that break down tasks (fac­tored gen­er­a­tion) to ex­per­i­ments that break down eval­u­at­ing ex­pert work (fac­tored eval­u­a­tion)
2. 60+ par­ti­ci­pants have been work­ing 150+ hours per week on our experiments
3. We’re build­ing Mo­saic2, an app that stream­lines run­ning varied ques­tion-an­swer ex­per­i­ments (fac­tored eval­u­a­tion, de­bate, etc.)
4. We’re ex­plor­ing if lan­guage mod­els can au­to­mate de­com­po­si­tions, get­ting 30% ac­cu­racy on the Com­plex Web Ques­tions dataset
5. William Saun­ders joined as ML en­g­ineer, Jung­won Byun as COO
6. We’re hiring an en­g­ineer­ing team lead and a busi­ness op­er­a­tions per­son. We’ll pay $5000 for a suc­cess­ful refer­ral! [for the en­g­ineer­ing team lead]

Sum­mary of Ought’s ex­per­i­ment struc­ture:


Skip­ping over a few de­tails, our ex­per­i­ments have the fol­low­ing struc­ture:
-There is a per­son, the judge.
-The judge faces an over­all (root) ques­tion: “What does the au­thor of this Pitch­fork mu­sic album re­view think of the work be­ing re­viewed?”
-This judge is hand­i­capped: they can read at most 650 char­ac­ters, so they can never read the whole re­view. Thus, the judge does not have the con­text re­quired to an­swer this root ques­tion.
-How­ever, the judge has ac­cess to two ex­perts who can read the whole text and who provide two pos­si­ble an­swers.
-Un­for­tu­nately, only one of these ex­perts is hon­est, the other is mal­i­cious, and is try­ing to trick the judge into ac­cept­ing a wrong but plau­si­ble-sound­ing an­swer.
-Without ever see­ing the whole text, and only get­ting in­for­ma­tion through the ex­perts, the judge must ask fol­low-up ques­tions to the ex­perts to de­ci­pher which an­swer to the root ques­tion is hon­est and se­lect that one.
-No one can lie about quotes or quotes’ po­si­tions in the text: the quotes from the text are the ground truth an­chor­ing this game.
-Up to 6 to­tal ques­tions can be asked by the judge be­fore a de­ci­sion must be made.
When­ever the judge asks the ex­perts a ques­tion, this gen­er­ates a new ex­per­i­ment: Now a differ­ent judge must de­cide which of two ex­pert an­swers to that ques­tion is hon­est and which is mal­i­cious, us­ing the same re­cur­sive pro­cess. For this to ter­mi­nate, even­tu­ally a judge must choose an an­wer with­out ask­ing any sub­ques­tions.

Some ML work as well:


Com­plex Web Questions
First, we took the Com­plex Web Ques­tions dataset, which con­tains ques­tions like this:
-The ac­tress that had the role of Martha Als­ton, plays what role in Find­ing Nemo?
-Which school that Sir Ernest Rutherford at­tended has the lat­est found­ing date?
-What movies does Leo Howard play in and that is 113.0 min­utes long?
-Where is the end of the river that origi­nates in Shan­non Pot?
We built an end-to-end sys­tem us­ing GPT-2 that breaks the ques­tions into sub­ques­tions, queries Google to an­swer each of the sub­ques­tions, and ag­gre­gates the an­swers back to­gether to an­swer the origi­nal ques­tion. Cur­rently, our sys­tem an­swers about 30% of the ques­tions in CWQ cor­rectly.