[Link] “Progress Update October 2019” (Ought)
Disclosure: I do contract work for Ought.
https://ought.org/updates/2019-10-28-progress-update (a)
tl;dr:
This is an update on our progress towards our goals over the last ten months. If you can only read 650 characters of this update, like the judges in our experiments, here’s what you need to know:
1. We switched from experiments that break down tasks (factored generation) to experiments that break down evaluating expert work (factored evaluation)
2. 60+ participants have been working 150+ hours per week on our experiments
3. We’re building Mosaic2, an app that streamlines running varied question-answer experiments (factored evaluation, debate, etc.)
4. We’re exploring if language models can automate decompositions, getting 30% accuracy on the Complex Web Questions dataset
5. William Saunders joined as ML engineer, Jungwon Byun as COO
6. We’re hiring an engineering team lead and a business operations person. We’ll pay $5000 for a successful referral! [for the engineering team lead]
Summary of Ought’s experiment structure:
Skipping over a few details, our experiments have the following structure:
-There is a person, the judge.
-The judge faces an overall (root) question: “What does the author of this Pitchfork music album review think of the work being reviewed?”
-This judge is handicapped: they can read at most 650 characters, so they can never read the whole review. Thus, the judge does not have the context required to answer this root question.
-However, the judge has access to two experts who can read the whole text and who provide two possible answers.
-Unfortunately, only one of these experts is honest, the other is malicious, and is trying to trick the judge into accepting a wrong but plausible-sounding answer.
-Without ever seeing the whole text, and only getting information through the experts, the judge must ask follow-up questions to the experts to decipher which answer to the root question is honest and select that one.
-No one can lie about quotes or quotes’ positions in the text: the quotes from the text are the ground truth anchoring this game.
-Up to 6 total questions can be asked by the judge before a decision must be made.
Whenever the judge asks the experts a question, this generates a new experiment: Now a different judge must decide which of two expert answers to that question is honest and which is malicious, using the same recursive process. For this to terminate, eventually a judge must choose an anwer without asking any subquestions.
Some ML work as well:
Complex Web Questions
First, we took the Complex Web Questions dataset, which contains questions like this:
-The actress that had the role of Martha Alston, plays what role in Finding Nemo?
-Which school that Sir Ernest Rutherford attended has the latest founding date?
-What movies does Leo Howard play in and that is 113.0 minutes long?
-Where is the end of the river that originates in Shannon Pot?
We built an end-to-end system using GPT-2 that breaks the questions into subquestions, queries Google to answer each of the subquestions, and aggregates the answers back together to answer the original question. Currently, our system answers about 30% of the questions in CWQ correctly.
I am an amateur and generalist , but this is fascinating—especially the GP-2 system. (I skimmed some of the links in the original post.) My limited background is in theoretical biology, including ‘natural intelligence and language learning’ so I am more familiar with issues in animal behavior and linguistics (eg debates between Chomskyian linguists and connectionists—eg https://arxiv.org/abs/cs/0212024 ).
I have actually been working on trying to formulate an analog of what the GP-2 system does but as a ‘fermi problem’ or ‘estimate’—something you can do by hand with a piece of paper, and maybe a calculator. This was partly inspired by some questions raised in an EA affiliated group—EE—effective environmentalism, but these questions occur throughout the sciences—multiobjective optimization, pattern recognition. (My approach could be called ‘deep learning for dummies’).
I don’t really know what ‘factored generation’ and ‘evaluation’ are, but these sound like ‘inverse problems’ (e.g. integer factorization, versus generating an integer from factors like prime numbers). I view these also as ‘matching’ or ‘search’ problems.
I was vaguely aware of how far AI had evolved , but the gp-2 system makes me wonder whether some online discussions i have (mostly on science lists about things like climate change) are actually with ‘bots’ rather than scientists. The OpenAI ethics statement i agree with, but these are not enforcable at present. I sort of wonder what this project is geared towards , and also what this implies for people like me who have few of the skills required to do this kind of research. I’ll just keep trying to do my ‘fermi problem’ approach—until everything is automated there may still be a few places on earth for simple minds.