Anthropic’s Claude AI will now cut off abusive chats with users for its own ‘welfare’

Read Time:2 Minute, 45 Second

Anthropic has said its most capable AI models, Claude Opus 4 and 4.1, will now exit a conversation with a user if they are being abusive or persistently harmful in their interactions.

The move is aimed at improving the ‘welfare’ of AI systems in potentially distressing situations, the Amazon-backed company said in a blog post on Friday, August 15. “We’re treating this feature as an ongoing experiment and will continue refining our approach,” it said.

If Claude ends a conversation on its end, users can either edit and submit their previous prompt again or start a new chat. They can also give feedback by reacting to Claude’s message with Thumbs or using the dedicated ‘Give feedback’ button.

Story continues below this ad

Claude will also not be able to end chats on its own in cases where users might be at imminent risk of harming themselves or others, Anthropic further said.

The new feature, developed as part of Anthropic’s exploratory work on AI welfare, comes amid the emerging trend of users turning to AI chatbots like Claude or ChatGPT for low-cost therapy and professional advice. However, a recent study found that AI chatbots showed signs of stress and anxiety when users shared “traumatic narratives” about crime, war, or car accidents. This could potentially make the chatbots less useful in therapeutic settings with people.

Beyond AI welfare, Anthropic said Claude’s ability to end chats also has broader relevance to model alignment and safeguards.

Do AI chatbots have a sense of welfare or well-being?

Prior to rolling out Claude Opus 4, Anthropic said it studied the model’s self-reported and behavioral preferences. The AI model showed a “consistent aversion” to harmful prompts from users such as requests to generate sexual content involving minors and information related to terror acts.

Story continues below this ad

Claude Opus 4 showed “a pattern of apparent distress when engaging with real-world users seeking harmful content” and a tendency to end such conversations with the user, as per the company.

“These behaviors primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions,” the company said.
However, Anthropic has added a disclaimer as well, noting, “We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future.”

This is because framing AI models in terms of their welfare or wellbeing risks anthropomorphising them. Several researchers argue that today’s large language models (LLMs) do not possess genuine understanding or reasoning, describing them instead as stochastic systems optimised for predicting the next token.

Anthropic has said it will keep exploring ways to mitigate risks to AI welfare, “in case such welfare is possible.”

Source link