This is an update on our progress towards our goals over the last ten months. If you can only read 650 characters of this update, like the judges in our experiments, here’s what you need to know:
1. We switched from experiments that break down tasks (factored generation) to experiments that break down evaluating expert work (factored evaluation)
2. 60+ participants have been working 150+ hours per week on our experiments
3. We’re building Mosaic2, an app that streamlines running varied question-answer experiments (factored evaluation, debate, etc.)
4. We’re exploring if language models can automate decompositions, getting 30% accuracy on the Complex Web Questions dataset
5. William Saunders joined as ML engineer, Jungwon Byun as COO
Skipping over a few details, our experiments have the following structure:
-There is a person, the judge.
-The judge faces an overall (root) question: “What does the author of this Pitchfork music album review think of the work being reviewed?”
-This judge is handicapped: they can read at most 650 characters, so they can never read the whole review. Thus, the judge does not have the context required to answer this root question.
-However, the judge has access to two experts who can read the whole text and who provide two possible answers.
-Unfortunately, only one of these experts is honest, the other is malicious, and is trying to trick the judge into accepting a wrong but plausible-sounding answer.
-Without ever seeing the whole text, and only getting information through the experts, the judge must ask follow-up questions to the experts to decipher which answer to the root question is honest and select that one.
-No one can lie about quotes or quotes’ positions in the text: the quotes from the text are the ground truth anchoring this game.
-Up to 6 total questions can be asked by the judge before a decision must be made.
Whenever the judge asks the experts a question, this generates a new experiment: Now a different judge must decide which of two expert answers to that question is honest and which is malicious, using the same recursive process. For this to terminate, eventually a judge must choose an anwer without asking any subquestions.
-The actress that had the role of Martha Alston, plays what role in Finding Nemo?
-Which school that Sir Ernest Rutherford attended has the latest founding date?
-What movies does Leo Howard play in and that is 113.0 minutes long?
-Where is the end of the river that originates in Shannon Pot?
We built an end-to-end system using GPT-2 that breaks the questions into subquestions, queries Google to answer each of the subquestions, and aggregates the answers back together to answer the original question. Currently, our system answers about 30% of the questions in CWQ correctly.
[Link] “Progress Update October 2019” (Ought)
Disclosure: I do contract work for Ought.
https://ought.org/updates/2019-10-28-progress-update (a)
tl;dr:
Summary of Ought’s experiment structure:
Some ML work as well: