I recently updated my AIxBio reading list, which was originally developed for my PhD qualifying exams, and decided the list is substantive enough to post here, with the caveats that this is a non-definitive, partially organized, and semi-curated reading list.
In sharing the list with a few others, I was pointed to two other public AIxBio reading lists (sorry if I missed anyone else!):
AIxBio Research Hub—a great collection of papers and newsletters. The website also lets you submit resources you think should be added (I submitted my list!)
Thanks to the SecureBio AI team and others for surfacing a steady drumbeat of relevant papers, many of which made their way into this list.
Capabilities of AI in biology and evaluation science
Jasper, G., Pedro, M., Jon, G. S., Nathaniel, L., Long, P., Karam, E., Lennart, J., Dan, H., & Seth, D. (2025). Virology Capabilities Test (VCT): A multimodal virology Q&A benchmark. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2504.16137
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2311.12022
Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammerling, M. J., Narayanan, S., Ponnapati, M., White, A. D., & Rodriques, S. G. (2024). LAB-bench: Measuring capabilities of language models for biology research. In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2407.10362
Boiko, D. A., MacKnight, R., Kline, B., & Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624(7992), 570–578. https://doi.org/10.1038/s41586-023-06792-0
Thadani, N. N., Gurev, S., Notin, P., Youssef, N., Rollins, N. J., Ritter, D., Sander, C., Gal, Y., & Marks, D. S. (2023). Learning from prepandemic data to forecast viral escape. Nature, 622(7984), 818–825. https://doi.org/10.1038/s41586-023-06617-0.
Hayes, T., Rao, R., Akin, H., Sofroniew, N. J., Oktay, D., Lin, Z., Verkuil, R., Tran, V. Q., Deaton, J., Wiggert, M., Badkundri, R., Shafkat, I., Gong, J., Derry, A., Molina, R. S., Thomas, N., Khan, Y. A., Mishra, C., Kim, C., … Rives, A. (2025). Simulating 500 million years of evolution with a language model. Science (New York, N.Y.), 387(6736), 850–858. https://doi.org/10.1126/science.ads0018.
Nguyen, E., Poli, M., Durrant, M. G., Kang, B., Katrekar, D., Li, D. B., Bartie, L. J., Thomas, A. W., King, S. H., Brixi, G., Sullivan, J., Ng, M. Y., Lewis, A., Lou, A., Ermon, S., Baccus, S. A., Hernandez-Boussard, T., Ré, C., Hsu, P. D., & Hie, B. L. (2024). Sequence modeling and design from molecular to genome scale with Evo. Science (New York, N.Y.), 386(6723), eado9336. https://doi.org/10.1126/science.ado9336.
Moult, J., Pedersen, J. T., Judson, R., & Fidelis, K. (1995). A large-scale experiment to assess protein structure prediction methods. Proteins, 23(3), ii–v. https://doi.org/10.1002/prot.340230303.
Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., Mukobi, G., Helm-Burger, N., Lababidi, R., Justen, L., Liu, A. B., Chen, M., Barrass, I., Zhang, O., Zhu, X., … Hendrycks, D. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2403.03218
Miller, E. (2024). Adding error bars to evals: A statistical approach to language model evaluations. In arXiv [stat.AP]. arXiv. http://arxiv.org/abs/2411.00640.
Wei, B., Che, Z., Li, N., Sehwag, U. M., Götting, J., Nedungadi, S., Michael, J., Yue, S., Hendrycks, D., Henderson, P., Wang, Z., Donoughe, S., & Mazeika, M. (2025). Best practices for biorisk evaluations on open-weight bio-foundation models. In arXiv [cs.CR]. arXiv. https://doi.org/10.48550/arXiv.2510.27629
McCaslin, T., Alaga, J., Nedungadi, S., Donoughe, S., Reed, T., Bommasani, R., Painter, C., & Righetti, L. (2025). STREAM (ChemBio): A standard for Transparently Reporting Evaluations in AI Model Reports. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2508.09853
Reed, T., McCaslin, T., & Righetti, L. (2025). What do model reports say about their ChemBio benchmark evaluations? Comparing recent releases to the STREAM framework. In arXiv [cs.CY]. arXiv. https://doi.org/10.48550/arXiv.2510.20927
Mitchener, L., Yiu, A., Chang, B., Bourdenx, M., Nadolski, T., Sulovari, A., Landsness, E. C., Barabasi, D. L., Narayanan, S., Evans, N., Reddy, S., Foiani, M., Kamal, A., Shriver, L. P., Cao, F., Wassie, A. T., Laurent, J. M., Melville-Green, E., Caldas, M., … White, A. D. (2025). Kosmos: An AI scientist for autonomous discovery. In arXiv [cs.AI]. arXiv. https://doi.org/10.48550/arXiv.2511.02824
Zhu, Y., Jin, T., Pruksachatkun, Y., Zhang, A., Liu, S., Cui, S., Kapoor, S., Longpre, S., Meng, K., Weiss, R., Barez, F., Gupta, R., Dhamala, J., Merizian, J., Giulianelli, M., Coppock, H., Ududec, C., Sekhon, J., Steinhardt, J., … Kang, D. (2025). Establishing best practices for building rigorous agentic benchmarks. In arXiv [cs.AI]. arXiv. https://doi.org/10.48550/arXiv.2507.02825
Mazeika, M., Gatti, A., Menghini, C., Sehwag, U. M., Singhal, S., Orlovskiy, Y., Basart, S., Sharma, M., Peskoff, D., Lau, E., Lim, J., Carroll, L., Blair, A., Sivakumar, V., Basu, S., Kenstler, B., Ma, Y., Michael, J., Li, X., … Hendrycks, D. (2025). Remote Labor Index: Measuring AI automation of remote work. In arXiv [cs.LG]. arXiv. https://doi.org/10.48550/arXiv.2510.26787
Cong, L., Zhang, Z., Wang, X., Di, Y., Jin, R., Gerasimiuk, M., Wang, Y., Dinesh, R. K., Smerkous, D., Smerkous, A., Wu, X., Liu, S., Li, P., Zhu, Y., Serrao, S., Zhao, N., Mohammad, I. A., Sunwoo, J. B., Wu, J. C., & Wang, M. (2025). LabOS: The AI-XR co-scientist that sees and works with humans. In arXiv [cs.AI]. arXiv. https://doi.org/10.48550/arXiv.2510.14861
Skowronek, P., Nawalgaria, A., & Mann, M. (2025). Multimodal AI agents for capturing and sharing laboratory practice. In bioRxiv (p. 2025.10.05.680425). https://doi.org/10.1101/2025.10.05.680425
King, S. H., Driscoll, C. L., Li, D. B., Guo, D., Merchant, A. T., Brixi, G., Wilkinson, M. E., & Hie, B. L. (2025). Generative design of novel bacteriophages with genome language models. In bioRxiv (p. 2025.09.12.675911). https://doi.org/10.1101/2025.09.12.675911
Wang, D., Huot, M., Zhang, Z., Jiang, K., Shakhnovich, E. I., & Esvelt, K. M. (2025). Without safeguards, AI-biology integration risks accelerating future pandemics. https://doi.org/10.13140/RG.2.2.29765.15849
Pannu, J., Bloomfield, D., MacKnight, R., Hanke, M. S., Zhu, A., Gomes, G., Cicero, A., & Inglesby, T. V. (2025). Dual-use capabilities of concern of biological AI models. PLoS Computational Biology, 21(5), e1012975. https://doi.org/10.1371/journal.pcbi.1012975
Zhang, Z., Zhou, Z., Jin, R., Cong, L., & Wang, M. (2025). GeneBreaker: Jailbreak attacks against DNA language models with pathogenicity guidance. In arXiv [cs.CR]. arXiv. https://doi.org/10.48550/arXiv.2505.23839
Needham, J., Edkins, G., Pimpale, G., Bartsch, H., & Hobbhahn, M. (2025). Large language models often know when they are being evaluated. In arXiv [cs.CL]. arXiv. https://doi.org/10.48550/arXiv.2505.23836
Ikonomova, S. P., Wittmann, B. J., Piorino, F., Ross, D. J., Schaffter, S. W., Vasilyeva, O., Horvitz, E., Diggans, J., Strychalski, E. A., Lin-Gibson, S., & Taghon, G. J. (2025). Experimental evaluation of AI-driven protein design risks using safe biological proxies. In bioRxiv (p. 2025.05.15.654077). https://doi.org/10.1101/2025.05.15.654077
Wei, K., Paskov, P., Dev, S., Byun, M. J., Reuel, A., Roberts-Gaal, X., Calcott, R., Coxon, E., & Deshpande, C. (2025, March 6). Model Evaluations Need Rigorous and Transparent Human Baselines. ICLR 2025 Workshop on Building Trust in Language Models and Applications. https://openreview.net/forum?id=VbG9sIsn4F
Sandbrink, J. B. (2023). Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2306.13952
Jasper, G., Pedro, M., Jon, G. S., Nathaniel, L., Long, P., Karam, E., Lennart, J., Dan, H., & Seth, D. (2025). Virology Capabilities Test (VCT): A multimodal virology Q&A benchmark. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2504.16137
Thadani, N. N., Gurev, S., Notin, P., Youssef, N., Rollins, N. J., Ritter, D., Sander, C., Gal, Y., & Marks, D. S. (2023). Learning from prepandemic data to forecast viral escape. Nature, 622(7984), 818–825. https://doi.org/10.1038/s41586-023-06617-0.
Nelson, C., & Rose, S. (2023). Understanding AI-Facilitated Biological Weapon Development. The Centre for Long-term Resilience. https://doi.org/10.71172/nm7j-qzt1
Gopal, A., Helm-Burger, N., Justen, L., Soice, E. H., Tzeng, T., Jeyapragasan, G., Grimm, S., Mueller, B., & Esvelt, K. M. (2023). Will releasing the weights of future large language models grant widespread access to pandemic agents? In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2310.18233
Walsh, M. E. (2024). Towards risk analysis of the impact of AI on the deliberate biological threat landscape. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2401.12755.
Bromberg, Y., Altman, R., Imperiale, M., Horvitz, E., Dus, M., Townshend, R., Yao, V., Treangen, T., Alexanian, T., Szymanski, E., Yassif, J., Anta, R., Lindner, A. B., Schmidt, M., Diggans, J., Esvelt, K. M., Molla, K. A., Phelan, R., Wang, M., … de Carvalho Bittencourt, D. M. (2025). 3.1 Artificial Intelligence and the Future of Biotechnology. Rice University. https://doi.org/10.25611/1233-X161
Wang, M., Zhang, Z., Bedi, A. S., Velasquez, A., Guerra, S., Lin-Gibson, S., Cong, L., Qu, Y., Chakraborty, S., Blewett, M., Ma, J., Xing, E., & Church, G. (2025). A call for built-in biosecurity safeguards for generative AI tools. Nature Biotechnology, 1–3. https://doi.org/10.1038/s41587-025-02650-8
Wittmann, B. J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B. T., Mitchell, T., Murphy, S. T., Wheeler, N. E., & Horvitz, E. (2024). Toward AI-resilient screening of nucleic acid synthesis orders: Process, results, and recommendations. In bioRxiv (p. 2024.12.02.626439). https://doi.org/10.1101/2024.12.02.626439
Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., Mukobi, G., Helm-Burger, N., Lababidi, R., Justen, L., Liu, A. B., Chen, M., Barrass, I., Zhang, O., Zhu, X., … Hendrycks, D. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2403.03218
Bloomfield, D., Pannu, J., Zhu, A. W., Ng, M. Y., Lewis, A., Bendavid, E., Asch, S. M., Hernandez-Boussard, T., Cicero, A., & Inglesby, T. (2024). AI and biosecurity: The need for governance. Science (New York, N.Y.), 385(6711), 831–833. https://doi.org/10.1126/science.adq1977
Bromberg, Y., Altman, R., Imperiale, M., Horvitz, E., Dus, M., Townshend, R., Yao, V., Treangen, T., Alexanian, T., Szymanski, E., Yassif, J., Anta, R., Lindner, A. B., Schmidt, M., Diggans, J., Esvelt, K. M., Molla, K. A., Phelan, R., Wang, M., … de Carvalho Bittencourt, D. M. (2025). 3.1 Artificial Intelligence and the Future of Biotechnology. Rice University. https://doi.org/10.25611/1233-X161
Casper, S., O’Brien, K., Longpre, S., Seger, E., Klyman, K., Bommasani, R., Nrusimha, A., Shumailov, I., Mindermann, S., Basart, S., Rudzicz, F., Pelrine, K., Ghosh, A., Strait, A., Kirk, R., Hendrycks, D., Henderson, P., Kolter, J. Z., Irving, G., … Hadfield-Menell, D. (2025). Open technical problems in open-weight AI model risk management. In Social Science Research Network. https://doi.org/10.2139/ssrn.5705186
Wittmann, B. J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B. T., Mitchell, T., Murphy, S. T., Wheeler, N. E., & Horvitz, E. (2025). Strengthening nucleic acid biosecurity screening against generative protein design tools. Science (New York, N.Y.), 390(6768), 82–87. https://doi.org/10.1126/science.adu8578
Zhang, Z., Jin, R., Cong, L., & Wang, M. (2025). Securing the language of life: Inheritable watermarks from DNA language models to proteins. In arXiv [q-bio.GN]. arXiv. https://doi.org/10.48550/arXiv.2509.18207
Yanda Chen, Mycal Tucker, Nina Panickssery, Tony Wang, Francesco Mosconi, Anjali Gopal, Carson Denison, Linda Petrini, Jan Leike, Ethan Perez, Mrinank Sharma. (2025, August 19). Enhancing Model Safety through Pretraining Data Filtering. Alignment Science Blog. https://alignment.anthropic.com/2025/pretraining-data-filtering/
O’Brien, K., Casper, S., Anthony, Q., Korbak, T., Kirk, R., Davies, X., Mishra, I., Irving, G., Gal, Y., & Biderman, S. (2025). Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2508.06601
Disseminating In Silico and Computational Biological Research: Navigating Benefits and Risks: Proceedings of a Workshop (2025). (2025). The National Academies Press. https://doi.org/10.17226/11
Rivera, S., Hanke, M. S., Curtis, S., Cherian, N., Gitter, A., Gray, J. J., Hebbeler, A., McCarthy, S., Nannemann, D., Qureshi, C., Weitzner, B., & Haydon, I. C. (2025). Responsible Biodesign Workshop: AI, Protein Design, and the biosecurity landscape – Recommended Actions. https://doi.org/10.31219/osf.io/yq48e_v1
Li, M., Zhou, B., Tan, Y., & Hong, L. (2024). Unlearning virus knowledge toward safe and responsible mutation effect predictions. In bioRxiv (p. 2024.10.02.616274). https://doi.org/10.1101/2024.10.02.616274
A non-definitive AIxBio reading list (plus other lists)
I recently updated my AIxBio reading list, which was originally developed for my PhD qualifying exams, and decided the list is substantive enough to post here, with the caveats that this is a non-definitive, partially organized, and semi-curated reading list.
In sharing the list with a few others, I was pointed to two other public AIxBio reading lists (sorry if I missed anyone else!):
List of AI x Biosecurity Resources by Ryan Teo and Shrestha Rath (mostly non-technical/policy-focused).
AIxBio Research Hub—a great collection of papers and newsletters. The website also lets you submit resources you think should be added (I submitted my list!)
Thanks to the SecureBio AI team and others for surfacing a steady drumbeat of relevant papers, many of which made their way into this list.
Capabilities of AI in biology and evaluation science
Jasper, G., Pedro, M., Jon, G. S., Nathaniel, L., Long, P., Karam, E., Lennart, J., Dan, H., & Seth, D. (2025). Virology Capabilities Test (VCT): A multimodal virology Q&A benchmark. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2504.16137
Justen, L. (2025). LLMs outperform experts on challenging biology benchmarks. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2505.06108
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2311.12022
Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammerling, M. J., Narayanan, S., Ponnapati, M., White, A. D., & Rodriques, S. G. (2024). LAB-bench: Measuring capabilities of language models for biology research. In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2407.10362
Boiko, D. A., MacKnight, R., Kline, B., & Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624(7992), 570–578. https://doi.org/10.1038/s41586-023-06792-0
Thadani, N. N., Gurev, S., Notin, P., Youssef, N., Rollins, N. J., Ritter, D., Sander, C., Gal, Y., & Marks, D. S. (2023). Learning from prepandemic data to forecast viral escape. Nature, 622(7984), 818–825. https://doi.org/10.1038/s41586-023-06617-0.
Hayes, T., Rao, R., Akin, H., Sofroniew, N. J., Oktay, D., Lin, Z., Verkuil, R., Tran, V. Q., Deaton, J., Wiggert, M., Badkundri, R., Shafkat, I., Gong, J., Derry, A., Molina, R. S., Thomas, N., Khan, Y. A., Mishra, C., Kim, C., … Rives, A. (2025). Simulating 500 million years of evolution with a language model. Science (New York, N.Y.), 387(6736), 850–858. https://doi.org/10.1126/science.ads0018.
Nguyen, E., Poli, M., Durrant, M. G., Kang, B., Katrekar, D., Li, D. B., Bartie, L. J., Thomas, A. W., King, S. H., Brixi, G., Sullivan, J., Ng, M. Y., Lewis, A., Lou, A., Ermon, S., Baccus, S. A., Hernandez-Boussard, T., Ré, C., Hsu, P. D., & Hie, B. L. (2024). Sequence modeling and design from molecular to genome scale with Evo. Science (New York, N.Y.), 386(6723), eado9336. https://doi.org/10.1126/science.ado9336.
Mouton, C., Lucas, C., & Guest, E. (2024). The Operational Risks of AI in Large-Scale Biological Attacks. RAND Corporation. https://www.rand.org/pubs/research_reports/RRA2977-2.html.
OpenAI. (2024). Building an early warning system for LLM-aided biological threat creation.
https://openai.com/index/building-an-early-warning-system-for-llm-aided-biological-threat-creation/.
Moult, J., Pedersen, J. T., Judson, R., & Fidelis, K. (1995). A large-scale experiment to assess protein structure prediction methods. Proteins, 23(3), ii–v. https://doi.org/10.1002/prot.340230303.
Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., Mukobi, G., Helm-Burger, N., Lababidi, R., Justen, L., Liu, A. B., Chen, M., Barrass, I., Zhang, O., Zhu, X., … Hendrycks, D. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2403.03218
Miller, E. (2024). Adding error bars to evals: A statistical approach to language model evaluations. In arXiv [stat.AP]. arXiv. http://arxiv.org/abs/2411.00640.
Hobbhahn, M. (2024, January 22). We need a Science of Evals. Apollo Research. https://www.apolloresearch.ai/blog/we-need-a-science-of-evals.
Wei, B., Che, Z., Li, N., Sehwag, U. M., Götting, J., Nedungadi, S., Michael, J., Yue, S., Hendrycks, D., Henderson, P., Wang, Z., Donoughe, S., & Mazeika, M. (2025). Best practices for biorisk evaluations on open-weight bio-foundation models. In arXiv [cs.CR]. arXiv. https://doi.org/10.48550/arXiv.2510.27629
McCaslin, T., Alaga, J., Nedungadi, S., Donoughe, S., Reed, T., Bommasani, R., Painter, C., & Righetti, L. (2025). STREAM (ChemBio): A standard for Transparently Reporting Evaluations in AI Model Reports. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2508.09853
Reed, T., McCaslin, T., & Righetti, L. (2025). What do model reports say about their ChemBio benchmark evaluations? Comparing recent releases to the STREAM framework. In arXiv [cs.CY]. arXiv. https://doi.org/10.48550/arXiv.2510.20927
Mitchener, L., Yiu, A., Chang, B., Bourdenx, M., Nadolski, T., Sulovari, A., Landsness, E. C., Barabasi, D. L., Narayanan, S., Evans, N., Reddy, S., Foiani, M., Kamal, A., Shriver, L. P., Cao, F., Wassie, A. T., Laurent, J. M., Melville-Green, E., Caldas, M., … White, A. D. (2025). Kosmos: An AI scientist for autonomous discovery. In arXiv [cs.AI]. arXiv. https://doi.org/10.48550/arXiv.2511.02824
Zhu, Y., Jin, T., Pruksachatkun, Y., Zhang, A., Liu, S., Cui, S., Kapoor, S., Longpre, S., Meng, K., Weiss, R., Barez, F., Gupta, R., Dhamala, J., Merizian, J., Giulianelli, M., Coppock, H., Ududec, C., Sekhon, J., Steinhardt, J., … Kang, D. (2025). Establishing best practices for building rigorous agentic benchmarks. In arXiv [cs.AI]. arXiv. https://doi.org/10.48550/arXiv.2507.02825
Emberson, L. (2025, October 30). Open-weight models lag state-of-the-art by around 3 months on average. Epoch AI. https://epoch.ai/data-insights/open-weights-vs-closed-weights-models
Somala, V. (2025, August 15). Frontier AI performance becomes accessible on consumer hardware within a year. Epoch AI. https://epoch.ai/data-insights/consumer-gpu-model-gap
Mazeika, M., Gatti, A., Menghini, C., Sehwag, U. M., Singhal, S., Orlovskiy, Y., Basart, S., Sharma, M., Peskoff, D., Lau, E., Lim, J., Carroll, L., Blair, A., Sivakumar, V., Basu, S., Kenstler, B., Ma, Y., Michael, J., Li, X., … Hendrycks, D. (2025). Remote Labor Index: Measuring AI automation of remote work. In arXiv [cs.LG]. arXiv. https://doi.org/10.48550/arXiv.2510.26787
Cong, L., Zhang, Z., Wang, X., Di, Y., Jin, R., Gerasimiuk, M., Wang, Y., Dinesh, R. K., Smerkous, D., Smerkous, A., Wu, X., Liu, S., Li, P., Zhu, Y., Serrao, S., Zhao, N., Mohammad, I. A., Sunwoo, J. B., Wu, J. C., & Wang, M. (2025). LabOS: The AI-XR co-scientist that sees and works with humans. In arXiv [cs.AI]. arXiv. https://doi.org/10.48550/arXiv.2510.14861
Skowronek, P., Nawalgaria, A., & Mann, M. (2025). Multimodal AI agents for capturing and sharing laboratory practice. In bioRxiv (p. 2025.10.05.680425). https://doi.org/10.1101/2025.10.05.680425
King, S. H., Driscoll, C. L., Li, D. B., Guo, D., Merchant, A. T., Brixi, G., Wilkinson, M. E., & Hie, B. L. (2025). Generative design of novel bacteriophages with genome language models. In bioRxiv (p. 2025.09.12.675911). https://doi.org/10.1101/2025.09.12.675911
Wang, D., Huot, M., Zhang, Z., Jiang, K., Shakhnovich, E. I., & Esvelt, K. M. (2025). Without safeguards, AI-biology integration risks accelerating future pandemics. https://doi.org/10.13140/RG.2.2.29765.15849
Third-Party Assessments. (2025, August 4). Frontier Model Forum. https://www.frontiermodelforum.org/technical-reports/third-party-assessments/
A structured protocol for elicitation experiments. (n.d.). AI Security Institute. Retrieved December 3, 2025, from https://www.aisi.gov.uk/blog/our-approach-to-ai-capability-elicitation
Pannu, J., Bloomfield, D., MacKnight, R., Hanke, M. S., Zhu, A., Gomes, G., Cicero, A., & Inglesby, T. V. (2025). Dual-use capabilities of concern of biological AI models. PLoS Computational Biology, 21(5), e1012975. https://doi.org/10.1371/journal.pcbi.1012975
Williams, B., Righetti, L., Rosenberg, J., de Castro, R. C., Kuusela, O., Britt, R., Soice, E., Morales, A., Sanders, J., Donoughe, S., Black, J., Karger, E., & Tetlock, P. E. (2025). Forecasting LLM-enabled biorisk and the efficacy of safeguards. Forcasting Research Institute. https://static1.squarespace.com/static/635693acf15a3e2a14a56a4a/t/68812b62e85b2808f0366c41/1753295738891/ai-enabled-biorisk.pdf
Zhang, Z., Zhou, Z., Jin, R., Cong, L., & Wang, M. (2025). GeneBreaker: Jailbreak attacks against DNA language models with pathogenicity guidance. In arXiv [cs.CR]. arXiv. https://doi.org/10.48550/arXiv.2505.23839
Needham, J., Edkins, G., Pimpale, G., Bartsch, H., & Hobbhahn, M. (2025). Large language models often know when they are being evaluated. In arXiv [cs.CL]. arXiv. https://doi.org/10.48550/arXiv.2505.23836
Ikonomova, S. P., Wittmann, B. J., Piorino, F., Ross, D. J., Schaffter, S. W., Vasilyeva, O., Horvitz, E., Diggans, J., Strychalski, E. A., Lin-Gibson, S., & Taghon, G. J. (2025). Experimental evaluation of AI-driven protein design risks using safe biological proxies. In bioRxiv (p. 2025.05.15.654077). https://doi.org/10.1101/2025.05.15.654077
Wei, K., Paskov, P., Dev, S., Byun, M. J., Reuel, A., Roberts-Gaal, X., Calcott, R., Coxon, E., & Deshpande, C. (2025, March 6). Model Evaluations Need Rigorous and Transparent Human Baselines. ICLR 2025 Workshop on Building Trust in Language Models and Applications. https://openreview.net/forum?id=VbG9sIsn4F
Esvelt, K. M. (2025). Foundation models may exhibit staged progression in novel CBRN threat disclosure. In arXiv [cs.CY]. arXiv. https://doi.org/10.48550/arXiv.2503.15182
Risks of AI in bioweapons development
Sandbrink, J. B. (2023). Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2306.13952
Jasper, G., Pedro, M., Jon, G. S., Nathaniel, L., Long, P., Karam, E., Lennart, J., Dan, H., & Seth, D. (2025). Virology Capabilities Test (VCT): A multimodal virology Q&A benchmark. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2504.16137
Yuki, H., Hough, L., Sageman, M., Danzig, R., Kotani, R., & Leighton, T. (2011). Aum Shinrikyo: Insights Into How Terrorists Develop Biological and Chemical Weapons. CNAS. https://www.cnas.org/publications/reports/aum-shinrikyo-insights-into-how-terrorists-develop-biological-and-chemical-weapons
Thadani, N. N., Gurev, S., Notin, P., Youssef, N., Rollins, N. J., Ritter, D., Sander, C., Gal, Y., & Marks, D. S. (2023). Learning from prepandemic data to forecast viral escape. Nature, 622(7984), 818–825. https://doi.org/10.1038/s41586-023-06617-0.
Montague, M. (2023). Towards a Grand Unified Threat Model of Biotechnology. https://philsci-archive.pitt.edu/22539/
Ben Ouagrham-Gormley, S. (2014). Barriers to bioweapons: The challenges of expertise and organization for weapons development. Cornell University Press. https://doi.org/10.7591/cornell/9780801452888.001.0001.
Nelson, C., & Rose, S. (2023). Understanding AI-Facilitated Biological Weapon Development. The Centre for Long-term Resilience. https://doi.org/10.71172/nm7j-qzt1
Esvelt, K. (2022). Delay, Detect, Defend: Preparing for a Future in which Thousands Can Release New Pandemics. Geneva Centre for Security Policy. https://www.gcsp.ch/sites/default/files/2024-12/gcsp-geneva-paper-29-22.pdf
Roger, B., & Greg McKelvey, T., Jr. (2025). Contemporary AI foundation models increase biological weapons risk. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2506.13798
Carter, S., Wheeler, N., Chwalek, S., Isaac, C., & Yassif, J. (2023). The Convergence of Artificial Intelligence and the Life Sciences. Nuclear Threat Initiative. https://www.nti.org/analysis/articles/the-convergence-of-artificial-intelligence-and-the-life-sciences/
Gopal, A., Helm-Burger, N., Justen, L., Soice, E. H., Tzeng, T., Jeyapragasan, G., Grimm, S., Mueller, B., & Esvelt, K. M. (2023). Will releasing the weights of future large language models grant widespread access to pandemic agents? In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2310.18233
Mouton, C., Lucas, C., & Guest, E. (2024). The Operational Risks of AI in Large-Scale Biological Attacks. RAND Corporation. https://www.rand.org/pubs/research_reports/RRA2977-2.html.
OpenAI. (2024). Building an early warning system for LLM-aided biological threat creation.
https://openai.com/index/building-an-early-warning-system-for-llm-aided-biological-threat-creation/.
Rose, S., Moulange, R., Smith, J., & Nelson, C. (2024). The near-term impact of AI on biological misuse. The Centre for Long-Term Resilience. https://www.longtermresilience.org/reports/the-near-term-impact-of-ai-on-biological-misuse/
Walsh, M. E. (2024). Towards risk analysis of the impact of AI on the deliberate biological threat landscape. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2401.12755.
Jeyapragasan, G. (2024). Risk-Benefit Assessment of Pandemic Virus Identification [MSc, MIT]. https://dam-prod2.media.mit.edu/x/2024/08/19/jeyapragasan-geethaj-SM-MAS-2024-thesis_m7l6fQF.pdf
Bromberg, Y., Altman, R., Imperiale, M., Horvitz, E., Dus, M., Townshend, R., Yao, V., Treangen, T., Alexanian, T., Szymanski, E., Yassif, J., Anta, R., Lindner, A. B., Schmidt, M., Diggans, J., Esvelt, K. M., Molla, K. A., Phelan, R., Wang, M., … de Carvalho Bittencourt, D. M. (2025). 3.1 Artificial Intelligence and the Future of Biotechnology. Rice University. https://doi.org/10.25611/1233-X161
Manheim, D., Williams, A., Aveggio, C., & Berke, A. (2025). Understanding the Theoretical Limits of AI-Enabled Pathogen Design. RAND Corporation. https://www.rand.org/pubs/research_reports/RRA4087-1.html
Technical and policy harm mitigations
Wang, M., Zhang, Z., Bedi, A. S., Velasquez, A., Guerra, S., Lin-Gibson, S., Cong, L., Qu, Y., Chakraborty, S., Blewett, M., Ma, J., Xing, E., & Church, G. (2025). A call for built-in biosecurity safeguards for generative AI tools. Nature Biotechnology, 1–3. https://doi.org/10.1038/s41587-025-02650-8
Wittmann, B. J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B. T., Mitchell, T., Murphy, S. T., Wheeler, N. E., & Horvitz, E. (2024). Toward AI-resilient screening of nucleic acid synthesis orders: Process, results, and recommendations. In bioRxiv (p. 2024.12.02.626439). https://doi.org/10.1101/2024.12.02.626439
Baker, D., & Church, G. (2024). Protein design meets biosecurity. Science (New York, N.Y.), 383(6681), 349. https://doi.org/10.1126/science.ado1671
Executive Office of the President. (2023). Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. In Federal Register. https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence\
Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., Mukobi, G., Helm-Burger, N., Lababidi, R., Justen, L., Liu, A. B., Chen, M., Barrass, I., Zhang, O., Zhu, X., … Hendrycks, D. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2403.03218
Bloomfield, D., Pannu, J., Zhu, A. W., Ng, M. Y., Lewis, A., Bendavid, E., Asch, S. M., Hernandez-Boussard, T., Cicero, A., & Inglesby, T. (2024). AI and biosecurity: The need for governance. Science (New York, N.Y.), 385(6711), 831–833. https://doi.org/10.1126/science.adq1977
Bromberg, Y., Altman, R., Imperiale, M., Horvitz, E., Dus, M., Townshend, R., Yao, V., Treangen, T., Alexanian, T., Szymanski, E., Yassif, J., Anta, R., Lindner, A. B., Schmidt, M., Diggans, J., Esvelt, K. M., Molla, K. A., Phelan, R., Wang, M., … de Carvalho Bittencourt, D. M. (2025). 3.1 Artificial Intelligence and the Future of Biotechnology. Rice University. https://doi.org/10.25611/1233-X161
Bloomfield, D., Khawam, J., & Schnabel, T. (2025). How U.S. Export Controls Risk Undermining Biosecurity. Lawfare. https://www.lawfaremedia.org/article/how-u.s.-export-controls-risk-undermining-biosecurity
Casper, S., O’Brien, K., Longpre, S., Seger, E., Klyman, K., Bommasani, R., Nrusimha, A., Shumailov, I., Mindermann, S., Basart, S., Rudzicz, F., Pelrine, K., Ghosh, A., Strait, A., Kirk, R., Hendrycks, D., Henderson, P., Kolter, J. Z., Irving, G., … Hadfield-Menell, D. (2025). Open technical problems in open-weight AI model risk management. In Social Science Research Network. https://doi.org/10.2139/ssrn.5705186
Wittmann, B. J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B. T., Mitchell, T., Murphy, S. T., Wheeler, N. E., & Horvitz, E. (2025). Strengthening nucleic acid biosecurity screening against generative protein design tools. Science (New York, N.Y.), 390(6768), 82–87. https://doi.org/10.1126/science.adu8578
Zhang, Z., Jin, R., Cong, L., & Wang, M. (2025). Securing the language of life: Inheritable watermarks from DNA language models to proteins. In arXiv [q-bio.GN]. arXiv. https://doi.org/10.48550/arXiv.2509.18207
Yanda Chen, Mycal Tucker, Nina Panickssery, Tony Wang, Francesco Mosconi, Anjali Gopal, Carson Denison, Linda Petrini, Jan Leike, Ethan Perez, Mrinank Sharma. (2025, August 19). Enhancing Model Safety through Pretraining Data Filtering. Alignment Science Blog. https://alignment.anthropic.com/2025/pretraining-data-filtering/
O’Brien, K., Casper, S., Anthony, Q., Korbak, T., Kirk, R., Davies, X., Mishra, I., Irving, G., Gal, Y., & Biderman, S. (2025). Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2508.06601
Preliminary Taxonomy of AI-Bio Misuse Mitigations. (2025, July 30). Frontier Model Forum. https://www.frontiermodelforum.org/issue-briefs/preliminary-taxonomy-of-ai-bio-misuse-mitigations/
Biosecurity Guide to the AI Action Plan. (n.d.). Johns Hopkins Center for Health Security. Retrieved December 3, 2025, from https://centerforhealthsecurity.org/our-work/aixbio/biosecurity-guide-to-the-ai-action-plan
Adamson, G., & Allen, G. C. (2025). Opportunities to Strengthen U.S. Biosecurity from AI-Enabled Bioterrorism: What Policymakers Should Know. Center for Strategic and International Studies. https://www.csis.org/analysis/opportunities-strengthen-us-biosecurity-ai-enabled-bioterrorism-what-policymakers-should
Disseminating In Silico and Computational Biological Research: Navigating Benefits and Risks: Proceedings of a Workshop (2025). (2025). The National Academies Press. https://doi.org/10.17226/11
Rivera, S., Hanke, M. S., Curtis, S., Cherian, N., Gitter, A., Gray, J. J., Hebbeler, A., McCarthy, S., Nannemann, D., Qureshi, C., Weitzner, B., & Haydon, I. C. (2025). Responsible Biodesign Workshop: AI, Protein Design, and the biosecurity landscape – Recommended Actions. https://doi.org/10.31219/osf.io/yq48e_v1
Li, M., Zhou, B., Tan, Y., & Hong, L. (2024). Unlearning virus knowledge toward safe and responsible mutation effect predictions. In bioRxiv (p. 2024.10.02.616274). https://doi.org/10.1101/2024.10.02.616274