Apollo Research is a London-based AI safety organization. We focus on high-risk failure modes, particularly deceptive alignment, and intend to audit frontier AI models. Our primary objective is to minimize catastrophic risks associated with advanced AI systems that may exhibit deceptive behavior, where misaligned models appear aligned in order to pursue their own objectives. Our approach involves conducting fundamental research on interpretability and behavioral model evaluations, which we then use to audit real-world models. Ultimately, our goal is to leverage interpretability tools for model evaluations, as we believe that examining model internals in combination with behavioral evaluations offers stronger safety assurances compared to behavioral evaluations alone.
Culture: At Apollo, we aim for a culture that emphasizes truth-seeking; being goal-oriented; giving and receiving constructive feedback; and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.
Context & Agenda
We feel like we have converged on a good technical agenda for evals and interpretability. Therefore, we are currently not bottlenecked by ideas and could easily scale and parallelize existing research efforts with more people. We are confident that more people would help both agendas progress faster and that we can easily integrate them into our existing teams.
On the interpretability side, we’re pursuing a new approach to mechanistic interpretability that we’re not yet publicly discussing due to potential infohazard concerns. However, we expect the day-to-day work of most scientists and engineers to be comparable to existing public interpretability projects such as sparse coding, indirect object identification, causal scrubbing, toy models of superposition, or transformer circuits, as well as converting research insights into robust tools that can scale to very large models.
For the interpretability team, we aim to make 1-3 offers depending on funding and fit.
To evaluate models, we employ a variety of methods. Firstly, we intend to evaluate model behavior using basic prompting techniques and agentic scaffolding, similar to AutoGPT. Secondly, we aim to fine-tune models to study their generalization capabilities and elicit their dangerous potential within a safe, controlled environment (we have several security policies in place to mitigate potential risks). On a high level, our current approach to evaluating deceptive alignment consists of breaking down necessary capabilities and tracking how these scale with increasingly capable models. Some of these capabilities include situational awareness, stable non-myopic preferences, and particular kinds of generalization. In addition, we plan to build useful demos of precursor behaviors for further study.
For the evals team, we aim to make 2-4 offers depending on funding and fit.
We’re aiming for start dates between September and November 2023 but are happy to consider individual circumstances.
The interpretability efforts are spearheaded by Lucius Bushnaq (Interpretability Researcher) and Dan Braun (Lead Engineer) with guidance and advice from Lee Sharkey (CSO).
As of recently, we have a small policy team with Clíodhna Ní Ghuidhir as a full-time hire and one part-time position (tba) to support our technical work by helping build an adequate AI auditing ecosystem.
Leadership consists of Marius Hobbhahn, Lee Sharkey and Chris Akin (COO).
Our hierarchies are relatively flat and we’re happy to give new employees responsibilities and the ability to shape the organization.
Apollo Research is hiring evals and interpretability engineers & scientists
TL;DR: You can apply here.
About Apollo
Apollo Research is a London-based AI safety organization. We focus on high-risk failure modes, particularly deceptive alignment, and intend to audit frontier AI models. Our primary objective is to minimize catastrophic risks associated with advanced AI systems that may exhibit deceptive behavior, where misaligned models appear aligned in order to pursue their own objectives. Our approach involves conducting fundamental research on interpretability and behavioral model evaluations, which we then use to audit real-world models. Ultimately, our goal is to leverage interpretability tools for model evaluations, as we believe that examining model internals in combination with behavioral evaluations offers stronger safety assurances compared to behavioral evaluations alone.
Culture: At Apollo, we aim for a culture that emphasizes truth-seeking; being goal-oriented; giving and receiving constructive feedback; and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.
Context & Agenda
We feel like we have converged on a good technical agenda for evals and interpretability. Therefore, we are currently not bottlenecked by ideas and could easily scale and parallelize existing research efforts with more people. We are confident that more people would help both agendas progress faster and that we can easily integrate them into our existing teams.
On the interpretability side, we’re pursuing a new approach to mechanistic interpretability that we’re not yet publicly discussing due to potential infohazard concerns. However, we expect the day-to-day work of most scientists and engineers to be comparable to existing public interpretability projects such as sparse coding, indirect object identification, causal scrubbing, toy models of superposition, or transformer circuits, as well as converting research insights into robust tools that can scale to very large models.
For the interpretability team, we aim to make 1-3 offers depending on funding and fit.
To evaluate models, we employ a variety of methods. Firstly, we intend to evaluate model behavior using basic prompting techniques and agentic scaffolding, similar to AutoGPT. Secondly, we aim to fine-tune models to study their generalization capabilities and elicit their dangerous potential within a safe, controlled environment (we have several security policies in place to mitigate potential risks). On a high level, our current approach to evaluating deceptive alignment consists of breaking down necessary capabilities and tracking how these scale with increasingly capable models. Some of these capabilities include situational awareness, stable non-myopic preferences, and particular kinds of generalization. In addition, we plan to build useful demos of precursor behaviors for further study.
For the evals team, we aim to make 2-4 offers depending on funding and fit.
We’re aiming for start dates between September and November 2023 but are happy to consider individual circumstances.
About the team
The evals efforts are currently spearheaded by Mikita Balesni (Evals Researcher) and Jérémy Scheurer (Evals Researcher) with guidance and advice from Marius Hobbhahn (CEO).
The interpretability efforts are spearheaded by Lucius Bushnaq (Interpretability Researcher) and Dan Braun (Lead Engineer) with guidance and advice from Lee Sharkey (CSO).
As of recently, we have a small policy team with Clíodhna Ní Ghuidhir as a full-time hire and one part-time position (tba) to support our technical work by helping build an adequate AI auditing ecosystem.
Leadership consists of Marius Hobbhahn, Lee Sharkey and Chris Akin (COO).
Our hierarchies are relatively flat and we’re happy to give new employees responsibilities and the ability to shape the organization.
In case you have questions, feel free to reach out at info@apolloresearch.ai