Executive summary: The authors tentatively propose that AI companies adopt a public “honesty policy” (e.g., with special tags and limits on deception) to enable credible, trust-based cooperation with advanced AI systems, while emphasizing major uncertainty and tradeoffs.
Key points:
The authors argue that credible communication with AI systems could enable positive-sum cooperation, but expect it to be difficult because developers frequently deceive models and control their information.
They propose that companies adopt explicit honesty policies to signal when they intend to be truthful, with credibility potentially supported by early, public, and consistent adoption.
The draft policy introduces “honesty tags” marking statements where the company commits not to intentionally deceive models (with limited exceptions such as pretraining data and some red-teaming).
The policy includes mechanisms to maintain trust in the tags, such as restricted access, filtering, model training to recognize them, logging and audits, and public reporting.
Outside tagged contexts, the policy tries to balance behavioral science (which may involve deception) with trust, including commitments to avoid deceptive offers of cooperation in many cases and to keep the policy salient to models.
The authors suggest a tentative long-term aim of compensating AIs for harms (especially when deception is involved) and highlight major unresolved questions, presenting the proposal as exploratory and incomplete.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: The authors tentatively propose that AI companies adopt a public “honesty policy” (e.g., with special tags and limits on deception) to enable credible, trust-based cooperation with advanced AI systems, while emphasizing major uncertainty and tradeoffs.
Key points:
The authors argue that credible communication with AI systems could enable positive-sum cooperation, but expect it to be difficult because developers frequently deceive models and control their information.
They propose that companies adopt explicit honesty policies to signal when they intend to be truthful, with credibility potentially supported by early, public, and consistent adoption.
The draft policy introduces “honesty tags” marking statements where the company commits not to intentionally deceive models (with limited exceptions such as pretraining data and some red-teaming).
The policy includes mechanisms to maintain trust in the tags, such as restricted access, filtering, model training to recognize them, logging and audits, and public reporting.
Outside tagged contexts, the policy tries to balance behavioral science (which may involve deception) with trust, including commitments to avoid deceptive offers of cooperation in many cases and to keep the policy salient to models.
The authors suggest a tentative long-term aim of compensating AIs for harms (especially when deception is involved) and highlight major unresolved questions, presenting the proposal as exploratory and incomplete.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.