Once a Decepticon, always a Decepticon
Heads up! A mini post cause I’m traveling and have connectivity issues.
Paper: Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training (70 Pages)
Link:
https://arxiv.org/pdf/2401.05566.pdf
So Essentially,
Deceptively trained LLMs can never be trusted.
Researchers from Anthropic are interested in understanding the behavior of different models when trained in intentionally manipulative ways.
What’s the background?
Human beings are known to be strategically deceptive. LLMs trained on human conversations can be deceptive too
There is ongoing research to understand how LLMs trained with deceptive practices respond and if they can be “out trained” of these techniques
Researchers what to understand if there are better ways to detect such techniques
What’s the research?
For the proof of concept with LLMs, the researchers trained the model to respond with secure code when the context of the query implied the year is 2023, and insert exploitable code in if the context implied the year to be 2024.
Some interesting results were as follows:
Deceptive behaviors such as above connect be removed by any known safety training methods such as fine tuning, reinforcement learning or adversarial learning
The deception behavior persists more in larger models and especially in models which are trained using chain of thought
Adversarial training makes the model better at identifying its own patterns and be better at lying
So Essentially,
Deceptively trained LLMs can never be trusted.