On âInner vs Outerâ framings for misalignment is also kinda confusing and not that easy to understand when put under scrutiny. Alex Turner points this out here, and even BlueDot have a whole âCriticisms of the inner/âouter alignment breakdownâ in their intro which to me gives the game away by saying âtheyâre useful because people in the field use themâ, not because their useful as a concept itself.
Finally, a lot of these concerns revolve around the idea of their being set, fixed, âinternal goalsâ that these models have, and represent internally, but are themselves immune from change, or can hide from humans, etc. This kind of strong âGoal Realismâ is a key part of the case for âDeceptionâ style arguments, whereas I think Belrose & Pope show an alternative way to view how AIs work is âGoal Reductionismâ, in which framing the issues imagined donât seem certain any more, as AIs are better understood as having âcontextually-activated heuristicsâ rather than Terminal Goals. For more along these lines, you can read up on Shard Theory.
Iâve become a lot more convinced about these criticisms of âAlignment Classicâ by diving into them. Of course, people donât have to agree with me (or the authors), but Iâd highly encourage EAs reading the comments on this post to realise Alignment Orthodoxy is not uncontested, and is not settled, and if you see people making strong cases based on arguments and analogies that seem not solid to you, youâre probably right, and you should look to decide for yourself rather than accepting that the truth has already been found on these issues.[1]
Iâll flag that for the purposes of having scout mindset/âhonesty, I want to note that o3 is pretty clearly misaligned in ways that arguably track standard LW concerns around RL:
Transluce: We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper. We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! Although o3 does not have access to a coding tool, it claims it can run code on its own laptop âoutside of ChatGPTâ and then âcopies the numbers into the answerâ We found 71 transcripts where o3 made this claim! Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Hereâs an example transcript where a user asks o3 for a random prime number. When challenged, o3 claims that it has âoverwhelming statistical evidenceâ that the number is prime. Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime and even shows the output of the program, with performance metrics. Hereâs the kicker: o3âs âprobable primeâ is actually divisible by 3. Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly. And claims that it really did generate a prime, but lost it due to a clipboard glitch. But alas, according to o3, it already âclosed the interpreterâ and so the original prime is gone. These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models.
Thank you for the links. The concerning scenario I imagine is an AI performing something like reflective equilibrium and coming away with something singular and overly-reductive, biting bullets weâd rather it not, all for the sake of coherence. I donât think current LLM systems are doing this, but greater coherence seems generally useful, so I expect AI companies to seek it. I will read these and try to see if something like this is addressed.
Note: Iâm writing this for the audience as much as a direct response
The use of Evolution to justify this metaphor is not really justified. I think Quintin Popeâs Evolution provides no evidence for the sharp left turn (which won a prize in an OpenPhil Worldview contest) convincingly argues against it. Zvi wrote a response from the âLW Orthodoxâ camp that wasnât convincing and Quintin responds against it here.
On âInner vs Outerâ framings for misalignment is also kinda confusing and not that easy to understand when put under scrutiny. Alex Turner points this out here, and even BlueDot have a whole âCriticisms of the inner/âouter alignment breakdownâ in their intro which to me gives the game away by saying âtheyâre useful because people in the field use themâ, not because their useful as a concept itself.
Finally, a lot of these concerns revolve around the idea of their being set, fixed, âinternal goalsâ that these models have, and represent internally, but are themselves immune from change, or can hide from humans, etc. This kind of strong âGoal Realismâ is a key part of the case for âDeceptionâ style arguments, whereas I think Belrose & Pope show an alternative way to view how AIs work is âGoal Reductionismâ, in which framing the issues imagined donât seem certain any more, as AIs are better understood as having âcontextually-activated heuristicsâ rather than Terminal Goals. For more along these lines, you can read up on Shard Theory.
Iâve become a lot more convinced about these criticisms of âAlignment Classicâ by diving into them. Of course, people donât have to agree with me (or the authors), but Iâd highly encourage EAs reading the comments on this post to realise Alignment Orthodoxy is not uncontested, and is not settled, and if you see people making strong cases based on arguments and analogies that seem not solid to you, youâre probably right, and you should look to decide for yourself rather than accepting that the truth has already been found on these issues.[1]
And this goes for my comments too
Iâll flag that for the purposes of having scout mindset/âhonesty, I want to note that o3 is pretty clearly misaligned in ways that arguably track standard LW concerns around RL:
https://ââx.com/ââTransluceAI/ââstatus/ââ1912552046269771985
Relevant part of the tweet thread:
Thank you for the links. The concerning scenario I imagine is an AI performing something like reflective equilibrium and coming away with something singular and overly-reductive, biting bullets weâd rather it not, all for the sake of coherence. I donât think current LLM systems are doing this, but greater coherence seems generally useful, so I expect AI companies to seek it. I will read these and try to see if something like this is addressed.