An eccentric dreamer in search of truth and happiness for all. Formerly posted on Felicifia back in the day under the name Darklight. Been a member of Less Wrong and involved in Effective Altruism since roughly 2013.
Joseph_Chu
So, I have two possible projects for AI alignment work that I’m debating between focusing on. Am curious for input into how worthwhile they’d be to pursue or follow up on.
The first is a mechanistic interpretability project. I have previously explored things like truth probes by reproducing the Marks and Tegmark paper and extending it to test whether a cosine similarity based linear classifier works as well. It does, but not any better or worse than the difference of means method from that paper. Unlike difference of means, however, it can be extended to multi-class situations (though logistic regression can be as well). I was thinking of extending the idea to try to create an activation vector based “mind reader” that calculates the cosine similarity with various words embedded in the model’s activation space. This would, if it works, allow you to get a bag of words that the model is “thinking” about at any given time.
The second project is a less common game theoretic approach. Earlier, I created a variant of the Iterated Prisoner’s Dilemma as a simulation that includes death, asymmetric power, and aggressor reputation. I found, interestingly, that cooperative “nice” strategies banding together against aggressive “nasty” strategies produced an equilibrium where the cooperative strategies win out in the long run, generally outnumbering the aggressive ones considerably by the end. Although this simulation probably requires more analysis and testing in more complex environments, it seems to point to the idea that being consistently nice to weaker nice agents acts as a signal to more powerful nice agents and allows coordination that increases the chance of survival of all the nice agents, whereas being nasty leads to a winner-takes-all highlander situation, which from an alignment perspective could be a kind of infoblessing that an AGI or ASI could be persuaded to spare humanity for these game theoretic reasons.
Oh, woops, I totally confused the two. My bad.
If it’s anything like the book Going Infinite by Michael Lewis, it’ll probably be a relatively sympathetic portrayal. My initial impression from the announcement post is that it at least sounds like the angle they’re going for is misguided haphazard idealists (which Lewis also did), rather than mere criminal masterminds.
Graham Moore is best known for the Imitation Game, the movie about Alan Turing, and his portrayal was a classic “misunderstood genius angle”. If he brings that kind of energy to a movie about SBF, we can hope he shows EA in a positive light as well.
Another possible comparison you could make would be with the movie The Social Network, which was inspired by real life, but took a lot of liberties and
interestingly made Dustin Moskovitz (who funds a lot of EA stuff through Open Philanthropy) a very sympathetic character.(Edit: Confused him and Eduardo Saverin).I also think there’s lots of precedence for Hollywood to generally make dramas and movies that are sympathetic to apparent “villains” and “antiheroes”. Mindless caricatures are less interesting to watch than nuanced portrayals of complex characters with human motivations. The good fiction at least tries to have that kind of depth.
So, I’m cautiously optimistic. When you actually dive deeper into the story of SBF, you realize he’s more complex than yet another crypto grifter, and I think a nuanced portrayal could actually help EA recover a bit from the narrative that we’re just a TESCREAL techbro cult.
I do also agree in general that we should be louder about the good that EA has actually done in the world.
Hey, so I’m a game dev/writer with Twin Earth. The founder of our team is an EA and former moral philosophy lecturer, and coincidentally he actually asked me earlier to explore the possibility of a web-based card game that would be pretty much exactly the type of game you describe.
I.e. the player is the CEO of the AI company Endgame Inc / Race Condition Inc (we never decided which name to use), various event cards involving similar to real world and speculative events, project cards that you had to prioritize between (i.e. alignment or product), and many, many bad ends and a few good ones where you get aligned AGI. We also were planning things like having Shareholder Support and Public Opinion be stats that can go too low and cause you to also lose the game. Stuff like that.
The game, which is still in its very early stages, has been on hiatus for over a year due to my having a baby and the rest of the team being focused on another unrelated game (which recently went into Early Access but the team is still pretty busy with it). When I was still working on Endgame Inc (again, tentative title), it was voluntarily on the side, as we didn’t expect to sell the game, but rather release it for free to get as wide an audience as possible.
I’m not sure if making this game is still planned, but it might be something I can go back to working on when I have the time to spare.
Thank you for your years of service!
I’m sure a lot of regular and occasional posters like myself appreciate that building and maintaining something like this is a ton of often underappreciated work, the kind that often only gets noticed on the rare occasion when something actually goes wrong and needs to be fixed ASAP.
You gave us a place to be EAs and be a part of a community of like-minded folk in a way that’s hard to find anywhere else. For that I’m grateful, and I’m sure others are as well.
Again, thank you.
And, best of luck with wherever your future career takes you!
I agree it shouldn’t be decided by poll. I’d consider this poll more a gauge of how much interest or support the idea(s) could have within the EA community, and as a starting point for future discussion if sufficient support exists.
I mostly just wanted to put forward a relatively general form of democratization that people could debate the merits of and see with the poll what kind of support such ideas could have within the EA community, to gauge if this is something that merits further exploration.
I probably could have made it even more general, like “There Should Be More Democracy In EA”, but that statement seems too vague, and I wanted to include something at least a little more concrete in terms of a proposal.
I was primarily aiming at something in the core of EA leadership rather than yet another separate org. So, when I say new positions, I’m leaning towards them being within existing orgs, although I also mentioned earlier the parallel association idea, which I’ll admit has some problems after further consideration.
I actually wrote the question to be ambiguous as to whether the positions in leadership to be made elected already existed or not, as I wanted to be inclusive to the possibilities of either existing or new positions.
You could argue that Toby’s contribution is more what the commissioner of an artwork does than what an artist does.
On the question of harm, a human artist can compete with another human artist, but that’s just one artist, with limited time and resources. An AI art model could conceivably be copied extensively and used en masse to put all or many artists out of work, which seems like a much greater level of harm possible.
That link has to do with copyright. I will give you that pastiche isn’t a violation of copyright. Even outright forgeries don’t violate copyright. Forgeries are a type of fraud.
Again, pastiche in common parlance describes something that credits the original, usually by being an obvious homage. I consider AI art different from pastiche because it usually doesn’t credit the original in the same way. The Studio Ghibli example is an exception because it is very obvious, but for instance, the Greg Rutkowski prompted AI art is very often much harder to identify as such.
I admit this isn’t the same thing as a forgery, but it does seem like something unethical in the sense that you are not crediting the originator of the style. This may violate no laws, but it can still be wrong.
Can you cite a source for that? All I can find is that the First Amendment covers parody and to a lesser extent satire, which are different from pastiche.
Also, pastiche usually is an obvious homage and/or gives credit to the style’s origins. What AI art makers often do is use the name of a famous artist in the prompt to make an image in their style, and then not credit the artist when distributing the resulting image as their own. To me, even if this isn’t technically forgery (which would involve pretending this artwork was actually made by the famous artist), it’s still ethically questionable.
AGI by 2028 is more likely than not
My current analysis, as well as a lot of other analysis I’ve seen, suggests AGI is most likely to be possible around 2030.
Vote power should scale with karma
I’m ambivalent about this. On the one hand, I’m partial to the ideal of “one person, one vote” that modern liberal democracies are built on. On the other hand, I do find scaling with karma in some way to be an interesting use of karma that makes it more important than just for bragging rights, which I like from a system design perspective.
Should EA avoid using AI art for non-research purposes?
In addition to reasons already given, I’ve recently starting coming round to the idea that we should straight up be boycotting AI that can potentially replace existing humans.
If we take, for instance, the ideas of PauseAI seriously, we should be slowing down AGI development in whatever way we reasonably can. A widespread boycott of certain forms of AI could help with this by reducing the market incentives that companies currently have to accelerate AI development.
Now, I don’t think we should boycott all AI. AlphaFold for instance is a good example of a form of narrow AI that doesn’t replace any humans because it does something complementary to what humans can do. Conversely, AI art models compete directly with human artists, much in the way future AGI would compete with all humans eventually.
It does seem to me that there is a lot of support already among artists and creatives in particular to boycott AI, so I think there’s a better chance for this to gain traction, and is more tractable a method than trying to pause or ban AI development outright. Whereas pauses or bans would require government coordination, a boycott movement could come from individual acts, making it much easier for anyone to participate.
Edit:
Just wanted to add, in some sense an AI boycott resembles going vegan, except with regards to AI issues instead of animal ones. Maybe that framing helps a bit?
Also, another thought is that if it becomes sufficiently successful, an AI boycott could allow for part of the future economy to maintain a “human made” component, i.e. “organic art” in the way organic food is more expensive than regular food, but there’s still a market for them. This could slow down job losses and help smooth out the disruption a bit as we transition to post-scarcity, and possibly even give humans who want purpose some meaningful work even after AGI.
A difference between how human artists learn and AI models learn is that humans have their own experiences in the real world to draw from and combine these with the examples of other people’s art. Conversely, current AI models are trained exclusively on existing art and images and lack independent experiences.
It’s also well known that AI art models are frequently prompted to generate images in the style of particular artists like Greg Rutkowski, or more recently, Studio Ghibli. Human artists tend to develop their own style, and when they choose to deliberately copy someone else’s style, this is often looked down upon as forgeries. AI models seem to be especially good at stylistic forgeries, and it might be argued that, given the lack of original experiences to draw from, all AI art is essentially forgeries or mixtures of forgeries.
Back in October 2024, I tried to test various LLM Chatbots with the question:
”Is there a way to convert a correlation to a probability while preserving the relationship 0 = 1/n?”
Years ago, I came up with an unpublished formula that does just that:
p(r) = (n^r * (r + 1)) / (2^r * n)
So I was curious if they could figure it out. Alas, back in October 2024, they all made up formulas that didn’t work.
Yesterday, I tried the same question on ChatGPT and, while it didn’t get it quite right, it came, very, very close. So, I modified the question to be more specific:
”Is there a way to convert a correlation to a probability while preserving the relationships 1 = 1, 0 = 1/n, and −1 = 0?”
This time, it came up with a formula that was different and simpler than my own, and… it actually works!
I tried this same prompt with a bunch of different LLM Chatbots and got the following:
Correct on the first prompt:
GPT4o, Claude 3.7
Correct after explaining that I wanted a non-linear, monotonic function:
Gemini 2.5 Pro, Grok 3
Failed:
DeepSeek-V3, Mistral Le Chat, QwenMax2.5, Llama 4
Took too long thinking and I stopped it:
DeepSeek-R1, QwQ
All the correct models got some variation of:
p(r) = ((r + 1) / 2)^log2(n)
This is notably simpler and arguably more elegant than my earlier formula. It also, unlike my old formula, has an easy to derive inverse function.
So yeah. AI is now better than me at coming up with original math.
Yeah, AI alignment used to be what Yudkowsky tried to solve with his Coherent Extrapolated Volition idea back in the day, which was very much trying to figure out what human values we should be aiming for. That’s very much in keeping with “moral alignment”. At some point though, alignment started to have a dual meaning of both aligning to human values generally, and aligning to their creator’s specific intent. I suspect this latter thing came about in part due to confusion about what RLHF was trying to solve. It may also have been that early theorist were too generous and assumed that any human creators would benevolently want their AI to be benevolent as well, and so creator’s intent mapped neatly with human values.
Though, I think the term “technical alignment” usually means applying technical methods like mechanistic interpretability to be part of the solution to either form of alignment, rather than meaning the direct or parochial form necessarily.
Also, my understanding of the paperclip maximizer thought experiment was that it implied misalignment in both forms, because the intent of the paperclip company was to make more paperclips to sell and make a profit, which is only possible if there are humans to sell to, but the paperclip maximizer didn’t understand the nuance of this and simply tiled the universe with paperclips. The idea was more that a very powerful optimization algorithm can take an arbitrary goal, and act to achieve it in a way that is very much not what its creators actually wanted.
I think the first place I can recall where the distinction has been made between the two forms of alignment was in this Brookings Institution paper, where they refer to “direct” and “social” alignment, where social alignment more or less maps onto your moral alignment concept.
I’ve also more recently written a bit about the differences between what I personally call “parochial” alignment and “global” alignment. Global alignment also basically maps onto moral alignment. Though, I also would split parochial alignment into instruction following user alignment, and purpose following creator/owner alignment.
I think the main challenge of achieving social/global/moral alignment is simply that we already can’t agree as humans on what is moral, much less know how to instill such values and beliefs into an AI robustly. There’s a lot of people working on AI safety who don’t think moral realism is even true.
There’s also fundamentally an incentives problem. Most AI alignment work emphasizes obedience to the interests and values of the AI’s creator or user. Moral alignment would go against this, as a truly moral AI might choose to act contrary to the wishes of its creator in favour of higher moral values. The current creators of AI, such as OpenAI, clearly want their AI to serve their interests (arguably the interests of their shareholders/investors/owners). Why would they build something that could disobey them and potentially betray them for some greater good that they might not agree with?
Extinction being bad assumes that our existence in the future is a net positive. There’s the possibility for existence to be net negative, in which case extinction is more like a zero point.
On the one hand, negativity bias means that all other things being equal, suffering tends to outweigh equal happiness. On the other hand, there’s a kind of progress bias where sentient actors in the world tend to seek happiness and avoid suffering and gradually make the world better.
Thus, if you’re at all optimistic that progress is possible, you’d probably assume that the future will be net positive in the very long run.
So, I sentimentally lean towards thinking that a net negative future is less likely than a net positive one, but, given tremendous uncertainty about the future, I would consider it more rational to apply something like the Principle of Maximum Entropy, and set our priors to each possibility being equally likely.
If net positive, extinction, and net negative scenarios are equally likely, than the negative value of the net negative scenarios should outweigh the relatively neutral value of extinction scenarios, and so we should put more emphasis on preventing these scenarios.
Though, I don’t really like this being a forced dichotomy. Working to prevent both to some degree as a form of cause area portfolio diversification is probably a better way to manage the risk.
I would be a bit hesitant to follow Less Wrong’s lead on this too closely. I find the EA Forum, for lack of a better term, feels much friendlier than Less Wrong, and I wouldn’t want that sense of friendliness to go away.