This paywalled article mentions a $4B valuation for the round:
stuhlmueller
Scaling your impact: A polynomial model of career growth
Discovering alignment windfalls reduces AI risk
A concrete version of this I’ve been wondering about the last few days: To what extent are the negative results on Debate (single-turn, two-turn) intrinsic to small-context supervision vs. a function of relatively contingent design choices about how people get to interact with the models?
I agree that misuse is a concern. Unlike alignment, I think it’s relatively tractable because it’s more similar to problems people are encountering in the world right now.
To address it, we can monitor and restrict usage as needed. The same tools that Elicit provides for reasoning can also be used to reason about whether a use case constitutes misuse.
This isn’t to say that we might not need to invest a lot of resources eventually, and it’s interestingly related to alignment (“misuse” is relative to some values), but it feels a bit less open-ended.
Elicit is using using the Semantic Scholar Academic Graph dataset. We’re working on expanding to other sources. If there are particular ones that would be helpful, message me?
Have you listened to the 80k episode with Nova DasSarma from Anthropic? They might have cybersecurity roles. The closest we have right now is devops—which, btw, if anyone is reading this comment, we are really bottlenecked on and would love intros to great people.
No, it’s that our case for alignment doesn’t rest on “the system is only giving advice” as a step. I sketched the actual case in this comment.
Oh, forgot to mention Jonathan Uesato at Deepmind who’s also very interested in advancing the ML side of factored cognition.
The things that make submodels easier to align that we’re aiming for:
(Inner alignment) Smaller models, making it less likely that there’s scheming happening that we’re not aware of; making the bottom-up interpretability problem easier
(Outer alignment) More well-specified tasks, making it easier to generate a lot of in-distribution feedback data; making it easier to do targetted red-teaming
For AGI there isn’t much of a distinction between giving advice and taking actions, so this isn’t part of our argument for safety in the long run. But in the time between here and AGI it’s better to focus on supporting reasoning to help us figure out how to manage this precarious situation.
To clarify, here’s how I’m interpreting your question:
“Most technical alignment work today looks like writing papers or theoretical blog posts and addressing problems we’d expect to see with more powerful AI. It mostly doesn’t try to be useful today. Ought claims to take a product-driven approach to alignment research, simultaneously building Elicit to inform and implement its alignment work. Why did Ought choose this approach instead of the former?”
First, I think it’s good for the community to take a portfolio approach and for different teams to pursue different approaches. I don’t think there is a single best approach, and a lot of it comes down to the specific problems you’re tackling and team fit.
For Ought, there’s an unusually good fit between our agenda and Elicit the product—our whole approach is built around human-endorsed reasoning steps, and it’s hard to do that without humans who care about good reasoning and actually want to apply it to solve hard problems. If we were working on ELK I doubt we’d be working on a product.
Second, as a team we just like building things. We have better feedback loops this way and the nearer-term impacts of Elicit on improving quality of reasoning in research and beyond provide concrete motivation in addition to the longer-term impacts.
Some other considerations in favor of taking a product-driven approach are:
Deployment plans help us choose tasks. We did “pure alignment research” when we ran our initial factored cognition experiments. At the time, choosing the right task felt about as hard as choosing the right mechanism or implementing it correctly. For example, we want to study factored cognition—should we factor reasoning about essays? SAT questions? Movie reviews? Forecasting questions? When experiments failed, it was hard to know whether we could have stripped down the task more to better test the mechanism, or whether the mechanism in fact didn’t solve the problem at hand. Our findings seemed brittle and highly dependent on assumptions about the task, unlikely to hold up in future deployment scenarios. Now that we have a much clearer incremental deployment story in mind we can better think about what research is more or less likely to be useful. FWIW I suspect this challenge of task specification is a pretty underrated obstacle for many alignment researchers.
Eventually we’ll have to cross the theory-practice gap. At some point alignment research will likely have to cover the theory-practice gap. There are different ways to do this—we could first develop theoretical foundations, then basic algorithms, then implement them in the real-world, or co-evolve the different parts and propagate constraints between them as we go. I think both ways have pros and cons, but it seems important that some groups pursue the latter, especially in a world with short AI timelines.
Risks with trying to do both are:
Balancing multiple stakeholders. Sometimes an impressive research demonstration isn’t actually useful in the short term, or the most useful things for users don’t teach us anything new about alignment. Models are barely capable enough to be acceptable stand-ins for crowd workers, which limits what we can learn; conversely, the best way to solve some product problems could just be to scale up the models. Overall, our product-research dynamic has been net positive and creates virtuous cycles where the product grounds the research and the research improves the product. But it’s a fine line to tread and a tension we have to actively manage. I can easily imagine other teams or agendas where this would be net negative. I imagine we’d also have a harder time making good choices here if we were a for-profit.
Being too solution-driven. From a product angle, I sometimes worry that we might over-apply the decomposition hammer. But an important part of our research goal is understanding where decomposition is useful / competitive / necessary so it’s probably fine as long as we course-correct quickly.
We’re aiming to shift the balance towards supporting high-quality reasoning. Every tool has some non-zero usefulness for non-central use cases, but seems unlikely that it will be as useful as tools that were made for those use cases.
I found your factored cognition project really interesting, is anyone still researching this? (besides the implementation in Elicit)
Some people who are explicitly interested in working on it: Sam Bowman at NYU, Alex Gray at OpenAI. On the ML side there’s also work like Selection-Inference that isn’t explicitly framed as factored cognition but also avoids end-to-end optimization in favor of locally coherent reasoning steps.
I’d say what we’re afraid of is that we’ll have AI systems that are capable of sophisticated planning but that we don’t know how to channel those capabilities into aligned thinking on vague complicated problems. Ought’s work is about avoiding this outcome.
At this point we could chat about why it’s plausible that we’ll have such capable but unaligned AI systems, or about how Ought’s work is aimed at reducing the risk of such systems. The former isn’t specific to Ought, so I’ll point to Ajeya’s post Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.
I just want to highlight the key assumption Ajeya’s argument rests on: The system is end-to-end optimized on a feedback signal (generally from human evaluations), i.e. all its compute is optimizing a signal that has now way to separate “fake it while in training” from “have the right intent” and so can lead to catastrophic outcomes when the system is deployed.
How does Ought’s work help avoid that outcome?
We’re breaking down complex reasoning into processes with parts that are not jointly end-to-end optimized. This makes it possible to use smaller models for individual parts, makes the computation more transparent, and makes it easier to verify that the parts are indeed implementing the function that we (or future models) think they’re implementing.
You can think of it as interpretability-by-construction: Instead of training a model end-to-end and then trying to see what circuits it learned and whether they’re implementing the right thing, take smaller models that you know are implementing the right thing and compose them (with AI help) into larger systems that are correct not primarily based on empirical performance but based on a priori reasoning.
This is complementary to traditional bottom-up interpretability work: The more decomposition can limit the amount of black-box compute and uninterpretable intermediate state, the less weight rests on circuits-style interpretability and ELK-style proposals.
We don’t think we’ll be able to fully avoid end-to-end training (it’s ML’s magic juice, after all), but we think that reducing it is helpful even on the margin. From our post on supervising process, which has a lot more detail on the points in this comment: “Inner alignment failures are most likely in cases where models don’t just know a few facts we don’t but can hide extensive knowledge from us, akin to developing new branches of science that we can’t follow. With limited compute and limited neural memory, the risk is lower.”
- 9 Aug 2022 22:16 UTC; 2 points) 's comment on AMA: Ought by (
We built Ergo (a Python library for integrating model-based and judgmental forecasting) as part of our work on forecasting. In the course of this work we realized that for many forecasting questions the bottleneck isn’t forecasting infrastructure per se, but the high-quality research and reasoning that goes into creating good forecasts, so we decided to focus on that aspect.
I’m still excited about Ergo-like projects (including Squiggle!). Developing it further would be a valuable contribution to epistemic infrastructure. Ergo is an MIT-licensed open-source project so you can basically do whatever you want with it. As a small team we have to focus on our core project, but if there are signs of life from an Ergo successor (5+ regular users, say) I’d be happy to talk for a few hours about what we learned from Ergo.
AMA: Ought
Ought is an applied machine learning lab, hiring for:
Our mission is to automate and scale open-ended reasoning. To that end, we’re building Elicit, the AI research assistant. Elicit’s architecture is based on supervising reasoning processes, not outcomes. This is better for supporting open-ended reasoning in the short run and better for alignment in the long run.
Over the last year, we built Elicit to support broad reviews of empirical literature. The literature review workflow runs on general-purpose infrastructure for executing compositional language model processes. Going forward, we’ll expand to deep literature reviews, then other research workflows, then general-purpose reasoning. (More here)
We’re also only reporting our current guess for how things will turn out. We’re monitoring how Elicit is used and we’ll study its impacts and the anticipated impacts of future features, and if it turns out that the costs outweigh the benefits we will adjust our plans.
Another potential windfall I just thought of: the kind of AI scientist system discussed by Bengio in this talk (older writeup). The idea is to build a non-agentic system that uses foundation models and amortized Bayesian inference to create and do inference on compositional and interpretable world models. One way this would be used is for high-quality estimates of p(harm|action) in the context of online monitoring of AI systems, but if it could work it would likely have other profitable use cases as well.