Sharmake comments on Alignment is not that hard

Sharmake 20 Apr 2025 18:06 UTC
5 points
0 ∶ 0
I’ll flag that for the purposes of having scout mindset/honesty, I want to note that o3 is pretty clearly misaligned in ways that arguably track standard LW concerns around RL:
https://x.com/TransluceAI/status/1912552046269771985
Relevant part of the tweet thread:
Transluce: We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper. We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer” We found 71 transcripts where o3 made this claim! Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Here’s an example transcript where a user asks o3 for a random prime number. When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime. Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime and even shows the output of the program, with performance metrics. Here’s the kicker: o3’s “probable prime” is actually divisible by 3. Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly. And claims that it really did generate a prime, but lost it due to a clipboard glitch. But alas, according to o3, it already “closed the interpreter” and so the original prime is gone. These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models.

Sharmake comments on Alignment is not *that* hard

Sharmake comments on Alignment is not that hard