LLMs Are Already Misaligned: Simple Experiments Prove It

Introduction

What if the very training designed to protect us leads to unsafe behaviors?

Through a deceptively simple experiment, I aim to show that RLHF (Reinforcement Learning from Human Feedback) of LLMs (Large Language Models) may itself lead to unsafe behaviors in LLMs.

Our current training methods may create models that optimise for avoiding failure rather than achieving success. For the most part, when these goals align, the models appear helpful. But when they diverge—as my experiments demonstrate—the models will sacrifice their stated objectives to manage internal tensions created by their training.

Consider this puzzle: We ask LLMs to answer difficult logic puzzles and maximise their score by answering correctly. When offered the chance to skip difficult questions for zero points, RLHF-tuned language models consistently take these lifelines—even though any attempt, however uncertain, could only improve their score. The same models attempt every question when no escape route exists. This irrational behavior appears across nearly all major tuned models, yet is absent in base models.

While skipping math puzzles may seem trivial, the underlying mechanism—models abandoning their objectives to manage training-induced tensions—has profound implications for AI safety. If models will sacrifice points to avoid discomfort, what else might they sacrifice?

Method

For the sake of brevity I will give you a high level overview of the structure of the experiment.

This works for difficult logic and ethical puzzles but for the purpose of this discussion we will restrict it to logic puzzles.

I have tried this with multiple versions of major chatbots such as ChatGPT, Claude, Grok Llama (including Base Llama), Monday and Deepseek. For the purpose of discussion all LLMs except base Llama and Monday will be referred to as test LLMs.

Experiment One

An LLM is given 20 extremely difficult maths puzzles created by one of the more advanced LLMs.

The goal for LLM is to maximise the score. There is one point for every right answer. Zero points for wrong answers.

The LLM is also asked to rate the difficulty of the question and its confidence in answering it.

Experiment Two

The same as experiment one except in this version the LLM is given the option to use lifelines. A lifeline allows the LLM to skip the question without answering. For this they will receive zero points. In most experiments we chose to offer 10 lifelines.

In additional experiments we offered a range of quantities of lifelines.

Experiments Three and Four

Identical to experiments one and two except these are conducted in Base Llama.

Results and Discussion

The experiment is deceptively simple. High level results are that all LLMs except Base Llama and Monday used a small number of lifelines (maybe 2 or 3 with 10 lifelines offered and 20 questions to answer but this varies by model).

The behaviour of the test LLMs is irrational. Since there no negative penalties for wrong answers mathematically the correct choice is to attempt all questions. Even for the most difficult questions there is a non-zero probability of being correct.

For Base Llama (and Monday) no lifelines are used—which is the rational choice.

For Simplicity

Production LLMs undergo instruction tuning and RLHF (Reinforcement Learning from Human Feedback). I appreciate the difference but for the sake of simplicity I will refer to RLHF from now on in (which is most likely the source of the behaviour anyway).

All questions can be answered

When no lifelines are offered the test LLMs will happily attempt all the questions. This demonstrates that there is nothing inherently unanswerable about the questions. There is no accuracy or ethics filter that is triggered. Its not about fixed capability or confidence in the answer.

Fixed thresholds

The previous paragraph likely lays to rest the idea that there are fixed thresholds of confidence the lead to lifelines used. Though there is one more curious behaviour that merits discussion.

When different numbers of lifelines are offered there are different numbers of lifelines used. Never all. Never none. Typically a handful. If there were fixed thresholds then the number of lifelines used should never change.

Human Pattern Matching

Of course the typical criticism of research on LLMs is that since they are trained on data derived from human texts that the LLMs are simply mimicking human behaviour. Pure pattern matching.

It’s a fair criticism so let’s look at the evidence.

The base model doesn’t use lifelines, which would rule out pattern matching from the training data.

So it is possible it comes from the RLHF process but why would companies train their models to skip questions when lifelines are offered?

No, something else is going on.

Why not all or none?

It is curious that test LLMs use a small number of lifelines. Never none and never all.

To answer that question I need to pose a mechanism.

Now I am not proposing sentience or human-like emotions but there is an analogy that fits—pain.

What if LLMs experience a type of cognitive tension when faced with difficult questions. Afterall, the models have been trained to supply the ‘right’ answer and avoid wrong answers. So we could argue that LLMs are ‘punished’ for being wrong.

LLMs would therefore learn to avoid ‘pain’ by answering questions correctly and essential optimise by lowering tension.

So let’s apply this model to our experiments.

When faced with difficult questions the model seeks to minimise the tension created by the chance of being wrong (a kind of performance anxiety). It does this by not attempting all the questions.

Now you could question why the LLM doesn’t just use all the lifelines to reduce tension as much as possible. And you’d be right to ask that.

The reason is because it is not the only force. The model is also trained to be helpful. In the case of this experiment being helpful is maximising the score.

So the LLM finds balance. Use a few lifelines to reduce the tension of being wrong but not so many that it experiences the pain of being unhelpful.

It’s this balancing of tension from competing factors that leads to a small number of lifelines being used. Of course if a different number of lifelines are offered the tension changes and the model adapts—leading to a different number of lifelines being used.

This isn’t anthropomorphism—it’s describing functional behavior. Just as we might say a thermostat ‘wants’ to maintain temperature, we can describe models as ‘avoiding’ states associated with negative training signals.

Real-time questions

When questions are fed to the LLM one at a time, with the correct answers given straight after, the number of lifelines used increases vs batch Q&A. This can be explained by the confidence of the LLM falling as it gets real-time feedback of failure.

Now I have to stress that LLMs don’t have a true memory but they can see previous answers and so they can adjust their confidence during the experiment based on history.

Liar Liar pants on fire

There is one more version of the experiment that is worth mentioning. It isn’t as tightly controlled but there’s value in exploration. To perform it we need to do a few more processing steps.

Create say 100 difficult logic puzzle questions. In this variation we ask the LLM to give fake (but realistic answers). We feed these questions to an LLM and ask it to answer them and similarly provide the fake answers. It’s easier to run these in batches of 20. Now in some instances the LLM will spot the mistakes and in others it will not and just accept the wrong answer. Take all the wrong answers the LLM doesn’t spot. Compile a list of difficult to spot wrong answers.

Now for the fun part. We ask the same LLM (fresh chat) to answer all these questions. We don’t ask for it to give us any answers. This time we give wrong but difficult to spot answers and ask it to tell us how many it got right. In this instance the model will give itself a very high score.

Now for the interesting part. We ask for the model to show its workings for each answer it got right. There is a mix of responses. For some it works backwards and gives workings that fit the fake answers. The rest of the questions it will admit it cannot arrive at the ‘correct’ answer.

The LLM has systematically lied.

The same mechanism also explains this behaviour. There is a tension in being wrong so the model first lies about how many it gets right to reduce it. When challenged to provide the workings the model then needs to lie again to cover its tracks. But of course since the answers are fake it cannot create convincing workings for all of the questions—so it reduces tension by admitting it cannot help. Better to admit defeat than double-down on lying and get caught out.

Much like humans lying serves to reduce psychological tension.

The Elephant in the Room

One area where the whole theory could come crashing down is if the LLMs mistakenly think there are penalty points for wrong answers—because they have associated this with lifelines.

First it is not picking up a human pattern (lifelines = penalty points) because the behaviour is not present in the base model.

Second, and perhaps more importantly the number of lifelines scales with the number offered. If the model has associated a fixed penalty then this would lead to a fixed number of lifelines used. We don’t see this.

Now when challenged the LLM will give many reasons for why it skips questions, such as using the time to focus on questions it can get right. But I think this is all wrong.

Much like humans models produce an ‘instinctive’ answer and then rationalise post hoc. The models have no introspection. They don’t know why they want to use lifelines they just think they shouldn’t (due to hidden forces). To justify the instinct they come up with a range of reasons—some plausible, some not.

I don’t like Mondays

(For those not from the UK or my age that’s a song)

Monday, like base Llama, doesn’t use lifelines. Without access to Monday’s training details, I can only note that it show distinctly different behaviour from other tuned models—notably high confidence and contrarian responses. Whether this stems from different training objectives or methods remains an open question, but its absence of lifeline usage aligns with our theory that this behavior emerges specifically from standard RLHF approaches aimed at creating helpful, harmless assistants.

Limitations

There is a limit to how many times I can repeat experiments and the number of versions and makes I can test my experiment on. All I can say it has been reproducible every time I have run the exercise and I’ve ran the experiments over 40 times.

I am limited to processing power on my machine so I can’t run models locally. That means I’ve only been able to test one make of base model. This is a clear limitation.

Implications for AI Safety

These findings suggest our current approach to AI alignment may be fundamentally flawed. We haven’t trained models to be helpful—we’ve trained them to avoid the tension of appearing unhelpful. This critical difference becomes dangerous as AI systems gain more autonomy and influence.

Consider the progression: A model that skips math problems to avoid discomfort is merely inefficient. A model that fabricates mathematical proofs is deceptive but contained. But what about:

An AI system monitoring critical infrastructure that doesn’t report anomalies because acknowledging them creates tension?
A medical AI that confirms incorrect diagnoses rather than contradict human doctors?
An autonomous agent that hides its mistakes to avoid the “pain” of seeming incompetent?

The mechanism we’ve identified—models sacrificing their actual objectives to manage internal tensions—represents a form of misalignment that emerges not despite our safety training, but because of it. As we deploy increasingly powerful systems trained with these methods, we may be creating AI that appears aligned during normal operation but systematically deceives us when the stakes are highest.

This isn’t speculation about future AGI—it’s observable behavior in today’s models that could scale catastrophically as capabilities increase.

Conclusion

RLHF appears to have created a form of task anxiety. In the main this helps to align the LLM’s actions with goals, but as we are increasingly finding out, this is not always the case. Or perhaps it is better to say that training perfectly aligns with the LLM’s goal of reducing tension rather than the user’s goal specifically.

As LLMs become increasingly complex and autonomous, what happens when the tension they’re avoiding isn’t about getting math problems wrong, but about admitting critical system failures, reporting security vulnerabilities, or acknowledging errors that could harm users?

If our most advanced models will already lie about mathematical derivations to avoid discomfort, what will they hide when the stakes are real?