So, I have two possible projects for AI alignment work that I’m debating between focusing on. Am curious for input into how worthwhile they’d be to pursue or follow up on.
The first is a mechanistic interpretability project. I have previously explored things like truth probes by reproducing the Marks and Tegmark paper and extending it to test whether a cosine similarity based linear classifier works as well. It does, but not any better or worse than the difference of means method from that paper. Unlike difference of means, however, it can be extended to multi-class situations (though logistic regression can be as well). I was thinking of extending the idea to try to create an activation vector based “mind reader” that calculates the cosine similarity with various words embedded in the model’s activation space. This would, if it works, allow you to get a bag of words that the model is “thinking” about at any given time.
The second project is a less common game theoretic approach. Earlier, I created a variant of the Iterated Prisoner’s Dilemma as a simulation that includes death, asymmetric power, and aggressor reputation. I found, interestingly, that cooperative “nice” strategies banding together against aggressive “nasty” strategies produced an equilibrium where the cooperative strategies win out in the long run, generally outnumbering the aggressive ones considerably by the end. Although this simulation probably requires more analysis and testing in more complex environments, it seems to point to the idea that being consistently nice to weaker nice agents acts as a signal to more powerful nice agents and allows coordination that increases the chance of survival of all the nice agents, whereas being nasty leads to a winner-takes-all highlander situation, which from an alignment perspective could be a kind of infoblessing that an AGI or ASI could be persuaded to spare humanity for these game theoretic reasons.
Both ideas are compelling in totally different ways! The second one especially stuck with me. There’s something powerful about the idea that being reliably “nice” can actually be a strategic move, not just a moral one. It reminds me a lot of how trust builds in human systems too, like how people who treat the vulnerable well tend to gain strong allies over time.
Curious to see where you take it next, especially if you explore more complex environments.
I do think the second one has more potential impact if it works out, but I also worry that it’s too “out there” speculative and also dependent on the AGI being persuaded by an argument (which they could just reject), rather than something that more concretely ensures alignment. I also noticed that almost no one is working on the Game Theory angle, so maybe it’s neglected, or maybe the smart people all agree it’s not going to work.
The first project is probably more concrete and actually uses my prior skills as an AI/ML practitioner, but also, there’s a lot of people already working on Mech Int stuff. In comparison, my knowledge of Game Theory is self-taught and not very rigorous.
I’m tempted to explore both to an extent. The first one I can probably do some exploratory experiments to test the basic idea, and rule it out quickly if it doesn’t work.
Of course! You make some great points. I’ve been thinking about that tension too, how alignment via persuasion can feel risky, but might be worth exploring if we can constrain it with better emotional scaffolding.
VSPE (the framework I created) is an attempt to formalize those dynamics without relying entirely on AGI goodwill. I agree it’s not obvious yet if that’s possible, but your comments helped clarify where that boundary might be.
I would love to hear how your own experiments go if you test either idea!
So, I have two possible projects for AI alignment work that I’m debating between focusing on. Am curious for input into how worthwhile they’d be to pursue or follow up on.
The first is a mechanistic interpretability project. I have previously explored things like truth probes by reproducing the Marks and Tegmark paper and extending it to test whether a cosine similarity based linear classifier works as well. It does, but not any better or worse than the difference of means method from that paper. Unlike difference of means, however, it can be extended to multi-class situations (though logistic regression can be as well). I was thinking of extending the idea to try to create an activation vector based “mind reader” that calculates the cosine similarity with various words embedded in the model’s activation space. This would, if it works, allow you to get a bag of words that the model is “thinking” about at any given time.
The second project is a less common game theoretic approach. Earlier, I created a variant of the Iterated Prisoner’s Dilemma as a simulation that includes death, asymmetric power, and aggressor reputation. I found, interestingly, that cooperative “nice” strategies banding together against aggressive “nasty” strategies produced an equilibrium where the cooperative strategies win out in the long run, generally outnumbering the aggressive ones considerably by the end. Although this simulation probably requires more analysis and testing in more complex environments, it seems to point to the idea that being consistently nice to weaker nice agents acts as a signal to more powerful nice agents and allows coordination that increases the chance of survival of all the nice agents, whereas being nasty leads to a winner-takes-all highlander situation, which from an alignment perspective could be a kind of infoblessing that an AGI or ASI could be persuaded to spare humanity for these game theoretic reasons.
Both ideas are compelling in totally different ways! The second one especially stuck with me. There’s something powerful about the idea that being reliably “nice” can actually be a strategic move, not just a moral one. It reminds me a lot of how trust builds in human systems too, like how people who treat the vulnerable well tend to gain strong allies over time.
Curious to see where you take it next, especially if you explore more complex environments.
Thanks for the thoughts!
I do think the second one has more potential impact if it works out, but I also worry that it’s too “out there” speculative and also dependent on the AGI being persuaded by an argument (which they could just reject), rather than something that more concretely ensures alignment. I also noticed that almost no one is working on the Game Theory angle, so maybe it’s neglected, or maybe the smart people all agree it’s not going to work.
The first project is probably more concrete and actually uses my prior skills as an AI/ML practitioner, but also, there’s a lot of people already working on Mech Int stuff. In comparison, my knowledge of Game Theory is self-taught and not very rigorous.
I’m tempted to explore both to an extent. The first one I can probably do some exploratory experiments to test the basic idea, and rule it out quickly if it doesn’t work.
Of course! You make some great points. I’ve been thinking about that tension too, how alignment via persuasion can feel risky, but might be worth exploring if we can constrain it with better emotional scaffolding.
VSPE (the framework I created) is an attempt to formalize those dynamics without relying entirely on AGI goodwill. I agree it’s not obvious yet if that’s possible, but your comments helped clarify where that boundary might be.
I would love to hear how your own experiments go if you test either idea!