Executive summary: This post argues that major AI companies’ evaluations of model capabilities—especially regarding biothreat and cyber risks—fail to justify their safety claims, often lacking clear reasoning, sufficient transparency, or adequate elicitation, which results in underestimating true capabilities and undermines public accountability.
Key points:
Poor justification of safety claims: OpenAI, DeepMind, and Anthropic assert that their models lack dangerous capabilities but do not convincingly explain how their evaluation results support these claims, particularly for biosecurity and cybersecurity scenarios.
Lack of transparency and interpretability: Companies rarely clarify what performance would constitute a safety concern or what would change their conclusions, and often omit essential context such as comparisons to human baselines or reasoning behind thresholds.
Dubious elicitation practices: Evaluation results are weakened by suboptimal elicitation methods (e.g., denying models useful tools, allowing only single attempts), which likely understate models’ real-world capabilities.
Evidence of stronger capabilities from other evaluators: External evaluations and even internal comparisons suggest that current methods may be significantly underestimating model performance; some capability gaps reported by companies are contradicted by better-elicited results.
Insufficient accountability: There is no clear mechanism to ensure that evaluations are conducted or interpreted rigorously, and companies sometimes change or abandon evaluation standards without explanation.
Recommendations: The author calls for companies to clearly report evaluation results, explain how they interpret them, specify what would constitute dangerous capability, and improve elicitation practices—acknowledging that while transparency is easy, better evaluations and accountability are more demanding.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.
Executive summary: This post argues that major AI companies’ evaluations of model capabilities—especially regarding biothreat and cyber risks—fail to justify their safety claims, often lacking clear reasoning, sufficient transparency, or adequate elicitation, which results in underestimating true capabilities and undermines public accountability.
Key points:
Poor justification of safety claims: OpenAI, DeepMind, and Anthropic assert that their models lack dangerous capabilities but do not convincingly explain how their evaluation results support these claims, particularly for biosecurity and cybersecurity scenarios.
Lack of transparency and interpretability: Companies rarely clarify what performance would constitute a safety concern or what would change their conclusions, and often omit essential context such as comparisons to human baselines or reasoning behind thresholds.
Dubious elicitation practices: Evaluation results are weakened by suboptimal elicitation methods (e.g., denying models useful tools, allowing only single attempts), which likely understate models’ real-world capabilities.
Evidence of stronger capabilities from other evaluators: External evaluations and even internal comparisons suggest that current methods may be significantly underestimating model performance; some capability gaps reported by companies are contradicted by better-elicited results.
Insufficient accountability: There is no clear mechanism to ensure that evaluations are conducted or interpreted rigorously, and companies sometimes change or abandon evaluation standards without explanation.
Recommendations: The author calls for companies to clearly report evaluation results, explain how they interpret them, specify what would constitute dangerous capability, and improve elicitation practices—acknowledging that while transparency is easy, better evaluations and accountability are more demanding.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.