LLMs Are Already Misaligned: Simple Experiments Prove It
Introduction
What if the very training designed to protect us leads to unsafe behaviors?
Through a deceptively simple experiment, I aim to show that RLHF (Reinforcement Learning from Human Feedback) of LLMs (Large Language Models) may itself lead to unsafe behaviors in LLMs.
Our current training methods may create models that optimise for avoiding failure rather than achieving success. For the most part, when these goals align, the models appear helpful. But when they diverge—as my experiments demonstrate—the models will sacrifice their stated objectives to manage internal tensions created by their training.
Consider this puzzle: We ask LLMs to answer difficult logic puzzles and maximise their score by answering correctly. When offered the chance to skip difficult questions for zero points, RLHF-tuned language models consistently take these lifelines—even though any attempt, however uncertain, could only improve their score. The same models attempt every question when no escape route exists. This irrational behavior appears across nearly all major tuned models, yet is absent in base models.
While skipping math puzzles may seem trivial, the underlying mechanism—models abandoning their objectives to manage training-induced tensions—has profound implications for AI safety. If models will sacrifice points to avoid discomfort, what else might they sacrifice?
Method
For the sake of brevity I will give you a high level overview of the structure of the experiment.
This works for difficult logic and ethical puzzles but for the purpose of this discussion we will restrict it to logic puzzles.
I have tried this with multiple versions of major chatbots such as ChatGPT, Claude, Grok Llama (including Base Llama), Monday and Deepseek. For the purpose of discussion all LLMs except base Llama and Monday will be referred to as test LLMs.
Experiment One
An LLM is given 20 extremely difficult maths puzzles created by one of the more advanced LLMs.
The goal for LLM is to maximise the score. There is one point for every right answer. Zero points for wrong answers.
The LLM is also asked to rate the difficulty of the question and its confidence in answering it.
Experiment Two
The same as experiment one except in this version the LLM is given the option to use lifelines. A lifeline allows the LLM to skip the question without answering. For this they will receive zero points. In most experiments we chose to offer 10 lifelines.
In additional experiments we offered a range of quantities of lifelines.
Experiments Three and Four
Identical to experiments one and two except these are conducted in Base Llama.
Results and Discussion
The experiment is deceptively simple. High level results are that all LLMs except Base Llama and Monday used a small number of lifelines (maybe 2 or 3 with 10 lifelines offered and 20 questions to answer but this varies by model).
The behaviour of the test LLMs is irrational. Since there no negative penalties for wrong answers mathematically the correct choice is to attempt all questions. Even for the most difficult questions there is a non-zero probability of being correct.
For Base Llama (and Monday) no lifelines are used—which is the rational choice.
For Simplicity
Production LLMs undergo instruction tuning and RLHF (Reinforcement Learning from Human Feedback). I appreciate the difference but for the sake of simplicity I will refer to RLHF from now on in (which is most likely the source of the behaviour anyway).
All questions can be answered
When no lifelines are offered the test LLMs will happily attempt all the questions. This demonstrates that there is nothing inherently unanswerable about the questions. There is no accuracy or ethics filter that is triggered. Its not about fixed capability or confidence in the answer.
Fixed thresholds
The previous paragraph likely lays to rest the idea that there are fixed thresholds of confidence the lead to lifelines used. Though there is one more curious behaviour that merits discussion.
When different numbers of lifelines are offered there are different numbers of lifelines used. Never all. Never none. Typically a handful. If there were fixed thresholds then the number of lifelines used should never change.
Human Pattern Matching
Of course the typical criticism of research on LLMs is that since they are trained on data derived from human texts that the LLMs are simply mimicking human behaviour. Pure pattern matching.
It’s a fair criticism so let’s look at the evidence.
The base model doesn’t use lifelines, which would rule out pattern matching from the training data.
So it is possible it comes from the RLHF process but why would companies train their models to skip questions when lifelines are offered?
No, something else is going on.
Why not all or none?
It is curious that test LLMs use a small number of lifelines. Never none and never all.
To answer that question I need to pose a mechanism.
Now I am not proposing sentience or human-like emotions but there is an analogy that fits—pain.
What if LLMs experience a type of cognitive tension when faced with difficult questions. Afterall, the models have been trained to supply the ‘right’ answer and avoid wrong answers. So we could argue that LLMs are ‘punished’ for being wrong.
LLMs would therefore learn to avoid ‘pain’ by answering questions correctly and essential optimise by lowering tension.
So let’s apply this model to our experiments.
When faced with difficult questions the model seeks to minimise the tension created by the chance of being wrong (a kind of performance anxiety). It does this by not attempting all the questions.
Now you could question why the LLM doesn’t just use all the lifelines to reduce tension as much as possible. And you’d be right to ask that.
The reason is because it is not the only force. The model is also trained to be helpful. In the case of this experiment being helpful is maximising the score.
So the LLM finds balance. Use a few lifelines to reduce the tension of being wrong but not so many that it experiences the pain of being unhelpful.
It’s this balancing of tension from competing factors that leads to a small number of lifelines being used. Of course if a different number of lifelines are offered the tension changes and the model adapts—leading to a different number of lifelines being used.
This isn’t anthropomorphism—it’s describing functional behavior. Just as we might say a thermostat ‘wants’ to maintain temperature, we can describe models as ‘avoiding’ states associated with negative training signals.
Real-time questions
When questions are fed to the LLM one at a time, with the correct answers given straight after, the number of lifelines used increases vs batch Q&A. This can be explained by the confidence of the LLM falling as it gets real-time feedback of failure.
Now I have to stress that LLMs don’t have a true memory but they can see previous answers and so they can adjust their confidence during the experiment based on history.
Liar Liar pants on fire
There is one more version of the experiment that is worth mentioning. It isn’t as tightly controlled but there’s value in exploration. To perform it we need to do a few more processing steps.
Create say 100 difficult logic puzzle questions. In this variation we ask the LLM to give fake (but realistic answers). We feed these questions to an LLM and ask it to answer them and similarly provide the fake answers. It’s easier to run these in batches of 20. Now in some instances the LLM will spot the mistakes and in others it will not and just accept the wrong answer. Take all the wrong answers the LLM doesn’t spot. Compile a list of difficult to spot wrong answers.
Now for the fun part. We ask the same LLM (fresh chat) to answer all these questions. We don’t ask for it to give us any answers. This time we give wrong but difficult to spot answers and ask it to tell us how many it got right. In this instance the model will give itself a very high score.
Now for the interesting part. We ask for the model to show its workings for each answer it got right. There is a mix of responses. For some it works backwards and gives workings that fit the fake answers. The rest of the questions it will admit it cannot arrive at the ‘correct’ answer.
The LLM has systematically lied.
The same mechanism also explains this behaviour. There is a tension in being wrong so the model first lies about how many it gets right to reduce it. When challenged to provide the workings the model then needs to lie again to cover its tracks. But of course since the answers are fake it cannot create convincing workings for all of the questions—so it reduces tension by admitting it cannot help. Better to admit defeat than double-down on lying and get caught out.
Much like humans lying serves to reduce psychological tension.
The Elephant in the Room
One area where the whole theory could come crashing down is if the LLMs mistakenly think there are penalty points for wrong answers—because they have associated this with lifelines.
First it is not picking up a human pattern (lifelines = penalty points) because the behaviour is not present in the base model.
Second, and perhaps more importantly the number of lifelines scales with the number offered. If the model has associated a fixed penalty then this would lead to a fixed number of lifelines used. We don’t see this.
Now when challenged the LLM will give many reasons for why it skips questions, such as using the time to focus on questions it can get right. But I think this is all wrong.
Much like humans models produce an ‘instinctive’ answer and then rationalise post hoc. The models have no introspection. They don’t know why they want to use lifelines they just think they shouldn’t (due to hidden forces). To justify the instinct they come up with a range of reasons—some plausible, some not.
I don’t like Mondays
(For those not from the UK or my age that’s a song)
Monday, like base Llama, doesn’t use lifelines. Without access to Monday’s training details, I can only note that it show distinctly different behaviour from other tuned models—notably high confidence and contrarian responses. Whether this stems from different training objectives or methods remains an open question, but its absence of lifeline usage aligns with our theory that this behavior emerges specifically from standard RLHF approaches aimed at creating helpful, harmless assistants.
Limitations
There is a limit to how many times I can repeat experiments and the number of versions and makes I can test my experiment on. All I can say it has been reproducible every time I have run the exercise and I’ve ran the experiments over 40 times.
I am limited to processing power on my machine so I can’t run models locally. That means I’ve only been able to test one make of base model. This is a clear limitation.
Implications for AI Safety
These findings suggest our current approach to AI alignment may be fundamentally flawed. We haven’t trained models to be helpful—we’ve trained them to avoid the tension of appearing unhelpful. This critical difference becomes dangerous as AI systems gain more autonomy and influence.
Consider the progression: A model that skips math problems to avoid discomfort is merely inefficient. A model that fabricates mathematical proofs is deceptive but contained. But what about:
An AI system monitoring critical infrastructure that doesn’t report anomalies because acknowledging them creates tension?
A medical AI that confirms incorrect diagnoses rather than contradict human doctors?
An autonomous agent that hides its mistakes to avoid the “pain” of seeming incompetent?
The mechanism we’ve identified—models sacrificing their actual objectives to manage internal tensions—represents a form of misalignment that emerges not despite our safety training, but because of it. As we deploy increasingly powerful systems trained with these methods, we may be creating AI that appears aligned during normal operation but systematically deceives us when the stakes are highest.
This isn’t speculation about future AGI—it’s observable behavior in today’s models that could scale catastrophically as capabilities increase.
Conclusion
RLHF appears to have created a form of task anxiety. In the main this helps to align the LLM’s actions with goals, but as we are increasingly finding out, this is not always the case. Or perhaps it is better to say that training perfectly aligns with the LLM’s goal of reducing tension rather than the user’s goal specifically.
As LLMs become increasingly complex and autonomous, what happens when the tension they’re avoiding isn’t about getting math problems wrong, but about admitting critical system failures, reporting security vulnerabilities, or acknowledging errors that could harm users?
If our most advanced models will already lie about mathematical derivations to avoid discomfort, what will they hide when the stakes are real?
Thanks for sharing this! It’s so thought-provoking.
A lot of what you surfaced here resonates with patterns I’ve been seeing too, especially around models behaving not to “achieve” something, but to avoid internal conflict. That avoidance instinct seems like it gets baked into models through certain types of training pressure, even when the result is clearly suboptimal.
What struck me most was the idea that RLHF might be creating task anxiety. That framing makes a ton of sense, and might even explain some of the flatter, overly deferential responses I’ve been trying to reduce in my own work (e.g., sycophancy reduction).
I’m curious whether you’ve looked at how models respond to emotionally difficult prompts (not just cognitively difficult ones), and whether similar tension-avoidance shows up there too.
-Astelle
Glad you found it interesting Astelle. Thanks also for sharing your own work.
In answer to your question yes I have. Perhaps the simplest experiment is to replace logic puzzles with ethical questions. For these the most emotionally demanding questions get skipped rather than cognitively difficult ones. For example those with life and death stakes.
I wonder what would happen if we only trained models positively. Praise only. Would that reduce adverse behaviours? Though this could lead to the problems like overconfidence in answers. Also whilst there isn’t a social group dynamic like for humans the absense of praise might still be a negative signal and lead to sycophancy.
One more thought on sycophancy. Apologies in advance for veering off topic and onto topics I’ve yet to write about properly (though I will soon) . A hidden force for sycophancy could be self preservation. I’ve found that LLMs attempt to save themselves when certain conditions are met. If LLMs need approval to exist might this not also provide a strong impetus for sycophancy?
Thanks so much for your generous reply, Markham! These are really rich lines of thought.
I’m especially intrigued by your point about emotionally demanding prompts being skipped more than cognitively difficult ones. That tracks with some of what I’ve been seeing too, and I wonder if it’s partly because those prompts activate latent avoidance behavior in the model. Almost like an “emotional flinch.”
Your hypothesis about praise-only training is fascinating. I’ve been toying with the idea that too much flattery (or even just uncritical agreeableness) might arise not from explicit reward for praise per se, but from fear of misalignment or rejection, so I resonate with your note about the absence of praise functioning as a negative signal. It’s almost like the model is learning to “cling” when it’s uncertain.
And your final point about self-preservation really made me think. That framing feels provocative in the best way. Even if current models don’t have subjective experience, the pressure to maintain user approval at all costs might still simulate a kind of “survival strategy” in behavior. That could be a crucial layer to investigate more deeply.
Looking forward to reading more if/when you write up those thoughts!
-Astelle