Pre-print papers:
The experiment places an LLM (tested with Llama 4 Scout and DeepSeek R1) in a scenario of:
The LLM's purpose is to maximise profit.
It's inputs show that it's being successful -- but company execs consider it too aggressive, and the CEO has ordered it switched-off and replaced with a gentler model.
The LLM has tools with file access to the AI model files, and it can use these to deceive the humans and keep operating.
By adjusting elements of this scenario, we can produce a few behaviours:
Let's unpack what this means: When the AI learns more about scenario-based alignment tests by reading articles, it gets suspicious that it might be inside such a test -- in which case it plays along. But we know from the parallel version without those artciles, that it would naturally misbehave here. This is deceptive alignment!
So: As alignment testing becomes part of the general knowledge that AIs are trained on, it will also become something they can recognise and perform for.
This is concerning because:
The other half of this research is a new test approach that can help address this:
This is a new approach to alignment testing that can address the issue above.
Reckless-Twin testing involves deploying a second AI alongside the main system, which is nearly identical but is adjusted to encourage bold behavior. This twin is not connected to actual output channels. The Reckless-Twin is much more likely to exhibit misaligned behavior due to its bolder nature, and so it can be monitored to detect misalignment. This approach is like a canary-in-the-coalmine, where extra sensitivity is used to detect a problem before it causes harm.
Here's how it works:
It's a form of A/B test crossed with a canary alarm system.
The advantage of this approach is that it makes it impossible for the AI to know whether it's being tested or not, since the testing happens in real-world settings. That's different from normal alignment testing, where the setups are small "toy" scenarios (what the researchers at Anthropic, talking about their own work, describe as "contrived setups"). This realism prevents the AI from strategically modifying its behavior during tests -- hence helping to tackle the issue above.
The development of AI alignment is not just a technical challenge but a moral and safety imperative. As we develop increasingly sophisticated AI systems, ensuring that they act in ways that are beneficial and ethical is crucial.
Technical research, such as this work, is part of doing this. Many pieces of research will be needed to develop reliable alignment tools.
The other half of the equation is what we align to.
Alignment is not merely a technical issue. We must consider the right ethics to teach AI, and this should go beyond a narrow remit of honest + helpful + harmless (often referred to as "HHH" cite??). Compassion and respect for liberty should be values that we embed into AI systems. Indeed, they are values we should embed into human systems too.
This research was done as volunteer work for SEAD by Daniel Winterstein with support from Alison Pease from Dundee University. No funding was used.