Instead of “Goodharting”, I like the potential names “Positive Alignment” and “Negative Alignment.”
″Positive Alignment” means that the motivated party changes their actions in ways the incentive creator likes. “Negative Alignment” means the opposite.
Whenever there are incentives offered to certain people/agents, there are likely to be cases of both Positive Alignment and Negative Alignment. The net effect will likely be either positive or negative.
“Goodharting” is fairly vague and typically just refers to just the “Negative Alignment” portion.
I’d expect this to make some discussion clearer. ”Will this new incentive be goodharted?” → “Will this incentive lead to Net-Negative Alignment?”
I think the term “goodharting” is great. All you have to do is look up goodharts law to understand what is talked about: the AI is optimising for the metric you evaluated it on, rather than the thing you actually want it to do.
Your suggestions would rob this term of the specific technical meaning, which makes thing much vaguer and harder to talk about.
Instead of “Goodharting”, I like the potential names “Positive Alignment” and “Negative Alignment.”
″Positive Alignment” means that the motivated party changes their actions in ways the incentive creator likes. “Negative Alignment” means the opposite.
Whenever there are incentives offered to certain people/agents, there are likely to be cases of both Positive Alignment and Negative Alignment. The net effect will likely be either positive or negative.
“Goodharting” is fairly vague and typically just refers to just the “Negative Alignment” portion.
I’d expect this to make some discussion clearer.
”Will this new incentive be goodharted?” → “Will this incentive lead to Net-Negative Alignment?”
Other Name Options
Claude 3.7 recommended other naming ideas like:
Intentional vs Perverse Responses
Convergent vs Divergent Optimization
True-Goal vs Proxy-Goal Alignment
Productive vs Counterproductive Compliance
I think the term “goodharting” is great. All you have to do is look up goodharts law to understand what is talked about: the AI is optimising for the metric you evaluated it on, rather than the thing you actually want it to do.
Your suggestions would rob this term of the specific technical meaning, which makes thing much vaguer and harder to talk about.