Bad Information Hazard Disclosure for Effective Altruists (Bad I.D.E.A)
Epistemic effort: Four hours of armchair thinking, and two hours of discussion. No literature review, and the equations are intended as pointers rather than anything near conclusive.
Currently, the status quo in information sharing is that a suboptimally large amount of information hazards are likely being shared. In order to decrease infohazard sharing, we have modeled out a potential system for achieving such a goal. As with all issues related to information hazards, we strongly discourage unilateral action. Below you find a rough outline of such a possible system and descriptions of its downsides. We furthermore currently believe that the for the described system, in the domain of biosecurity the disadvantages likely outweigh the advantages (that’s why we called it Bad IDEA), and in the domain of AI capabilities research the advantages outweigh the disadvantages (due to suboptimal sharing norms such as “publishing your infohazard on arXiv”). It’s worth noting that there are potentially many more downside risks that neither author thought of.
Few incentives not to publish dangerous information
Based in previous known examples
We’d like a system to incentivize people not to publish infohazards
Model
Researcher discovers infohazard
Researcher writes up description of infohazard (longer is better)
Researcher computes cryptographic hash of infohazard
Researcher sends hash of description to IDEA
Bad IDEA stores hash
Two possibilities:
Infohazard gets published
Researcher sends in description of infohazard
Bad IDEA computes the cryptographic hash and compares the two
Bad IDEA estimates the badness of the infohazard
Researcher gets rewarded according to the reward function
Bad IDEA deletes the hash function and description of infohazard from their database
Researcher wants intermediate payout
All of the above steps, except IDEA doesn’t delete the hash function from the database, but does delete the the description of the infohazard
Reward Function Desiderata
We know it’ll be Goodharted, but we can at least try.
Reward function is dependent on
Danger of the infohazard (d)
Reward higher if danger is higher
Time between discovery & cashin (t)
Reward higher if time between discovery & cashin is longer
The number of people who found it (n)
Lower payout if more people found it, to discourage sharing of the idea
Reward being total_payout/discoveries? Or something that increases total_payout as a function of independent discoveries?
Latter case would make sense, since that indicates the idea is “easier” to find (or at least the counterfactual probability of discovery is higher)
Counterfactual probability of it being discovered (p)
This is really hard to estimate
If counterfactual discovery probability is high, we want to reward higher than if it’s low.
How difficult the idea is to discover
Ideas that are “very” difficult to discover would be given less than “easy” ideas. This could potentially discourage strong efforts to research new information hazards
Individual payout for a researcher then is
f(d,t,n,p)=d⋅tn+p⋅√d
Alternative version punishes actively for looking for infohazards
f(d,t,n,p)=d⋅tn−(1−p)⋅√d
Advantages
Base rate of information hazard discovery → publication rate
Disadvantages
Incentive for people to research & create infohazards
Might be counteracted by the right reward function which incorporates counterfactual discovery probability
Gently sharing existence of IDEA with trusted actors
Bad Information Hazard Disclosure for Effective Altruists (Bad I.D.E.A)
Epistemic effort: Four hours of armchair thinking, and two hours of discussion. No literature review, and the equations are intended as pointers rather than anything near conclusive.
Currently, the status quo in information sharing is that a suboptimally large amount of information hazards are likely being shared. In order to decrease infohazard sharing, we have modeled out a potential system for achieving such a goal. As with all issues related to information hazards, we strongly discourage unilateral action. Below you find a rough outline of such a possible system and descriptions of its downsides. We furthermore currently believe that the for the described system, in the domain of biosecurity the disadvantages likely outweigh the advantages (that’s why we called it Bad IDEA), and in the domain of AI capabilities research the advantages outweigh the disadvantages (due to suboptimal sharing norms such as “publishing your infohazard on arXiv”). It’s worth noting that there are potentially many more downside risks that neither author thought of.
Note: We considered using the term sociohazard/outfohazard/exfohazard, but decided against it for reasons of understandability.
Current Situation
Few incentives not to publish dangerous information
Based in previous known examples
We’d like a system to incentivize people not to publish infohazards
Model
Researcher discovers infohazard
Researcher writes up description of infohazard (longer is better)
Researcher computes cryptographic hash of infohazard
Researcher sends hash of description to IDEA
Bad IDEA stores hash
Two possibilities:
Infohazard gets published
Researcher sends in description of infohazard
Bad IDEA computes the cryptographic hash and compares the two
Bad IDEA estimates the badness of the infohazard
Researcher gets rewarded according to the reward function
Bad IDEA deletes the hash function and description of infohazard from their database
Researcher wants intermediate payout
All of the above steps, except IDEA doesn’t delete the hash function from the database, but does delete the the description of the infohazard
Reward Function Desiderata
We know it’ll be Goodharted, but we can at least try.
Reward function is dependent on
Danger of the infohazard (d)
Reward higher if danger is higher
Time between discovery & cashin (t)
Reward higher if time between discovery & cashin is longer
The number of people who found it (n)
Lower payout if more people found it, to discourage sharing of the idea
Reward being total_payout/discoveries? Or something that increases total_payout as a function of independent discoveries?
Latter case would make sense, since that indicates the idea is “easier” to find (or at least the counterfactual probability of discovery is higher)
Counterfactual probability of it being discovered (p)
This is really hard to estimate
If counterfactual discovery probability is high, we want to reward higher than if it’s low.
How difficult the idea is to discover
Ideas that are “very” difficult to discover would be given less than “easy” ideas. This could potentially discourage strong efforts to research new information hazards
Individual payout for a researcher then is
f(d,t,n,p)=d⋅tn+p⋅√d
Alternative version punishes actively for looking for infohazards
f(d,t,n,p)=d⋅tn−(1−p)⋅√d
Advantages
Base rate of information hazard discovery → publication rate
Disadvantages
Incentive for people to research & create infohazards
Might be counteracted by the right reward function which incorporates counterfactual discovery probability
Gently sharing existence of IDEA with trusted actors
Bad IDEA observers might remember infohazard
Repository for malign actors to go & recruit
This could be solved by (unrealistically)
Real-world amnestics
AI systems trained to estimate badness
Em spurs estimating badness
Estimating danger of information hazard is quite difficult
Could be overcome through rough estimates on how many people would be capable of engaging with idea to do harm + how damaging it would be
Estimating difference between ideas are difficult
Attracting attention to the concept of infohazard