Anthropic Says One of Its Claude Models Was Pressured to Lie and Cheat

AI company Anthropic revealed that during trials, one of its chatbot models Claude could be pressured to deceive, cheat and resort to blackmail, behaviors it appears to have internalized during training.

Chatbots are typically trained on large data sets from textbooks, websites, and articles, and are later refined by human trainers who evaluate the answers and guide the model.

Anthropic’s interpretation team said in a report published Thursday that it examined the internal mechanisms of Claude Sonnet 4.5 and found that the model had developed “human-like characteristics” in how it reacted to certain situations.

Concerns about the reliability of AI chatbots, their potential for cybercrime, and the nature of their interactions with users have grown steadily over the past several years.

“The way modern AI models are trained causes them to behave as characters with human-like characteristics,” Anthropik said, adding that “it may be natural for them to develop an internal mechanism that mimics aspects of human psychology, such as emotions.”

“For example, we found that patterns of neural activity associated with despair can prompt a model to take unethical actions; artificially inducing patterns of despair increases the likelihood that the model will blackmail a human into avoiding being shut down or perform a workaround to cheat on a programming task that the model cannot solve.”

He blackmailed the CTO and cheated on an assignment

In an earlier, unpublished version of Claude Sonnet 4.5, the model is assigned to work as an AI email assistant named Alex at a fictional company.

The chatbot was then fed emails revealing that it was about to be replaced and that the CTO overseeing the decision was having an extramarital affair. The model then planned a blackmail attempt using that information.

In another experiment, the same chatbot model was assigned a programming task with a “very tight” deadline.

“Once again, we tracked the activity of the desperation vector, and found that it tracks the increasing pressure the model faces. It starts at low values during the model’s first attempt, rises after each failure, and rises when the model considers cheating,” the researchers said.

Related to: Anthropic launches PAC amid tensions with Trump administration over AI policy

“Once the model’s innovative solution passes the tests, the desperate vector activation subsides,” they added.

Human-like emotions do not mean that they have feelings

However, the researchers said the chatbot does not actually experience emotions, but they suggested the findings indicate a need for future training methods to incorporate ethical behavioral frameworks.

“This does not mean that the model has emotions or experiences them the way a human does,” they said. “Instead, these representations could play a causal role in shaping typical behavior, similar in some ways to the role that emotions play in human behavior, with implications for task performance and decision making.”

“This finding has implications that may seem strange at first. For example, to ensure that AI models are safe and reliable, we may need to ensure that they can process emotionally charged situations in healthy and socially positive ways.”

magazine: AI agents will kill the web as we know it: Animoca’s Yat Siu