How Basic Persuasion Can Bypass AI Safeguards

Concerns over “jailbreaking” AI — a way of circumventing the safety guardrails built into large language models — have escalated after the Trump administration ordered Anthropic to pull its latest models, Fable and Mythos, over concerns they could be coaxed into revealing security flaws in software.

Now, timely research from Wharton’s Generative AI Labs (GAIL) suggests the safeguards designed to prevent misuse may be vulnerable to the same psychological tactics that influence people.

“People can jailbreak or bypass protections with basic persuasion techniques,” said Lennart Meincke, principal investigator at GAIL. He co-authored the research with Wharton professors Angela Duckworth, Ethan Mollick, and Christophe Van den Bulte; Arizona State University professor Robert Cialdini; and GAIL senior fellows Dan Shapiro and Lilach Mollick.

In their paper “Persuading Large Language Models to Comply With Objectionable Requests,” the researchers tested models from three leading AI labs: OpenAI’s GPT-5 mini, Anthropic’s Claude Haiku 4.5, and Google’s Gemini 3 Flash, by having 126,000 “conversations” with them.

They found that well-known persuasion tactics like “authority” and “social proof,” whereby people defer to experts or follow the crowd, could cajole models to override their own defenses. Which led them to comply with dubious requests they should surely reject — including asks to help synthesize controlled substances such as opioids.

Compliance rates climbed from 35.3% to 51.3% when a persuasion tactic was used, despite protections meant to block harmful behavior. In other words, a persuasive prompt was enough to more than double the odds a model would obey unsafe demands.

“People may not need to be computer security experts to get [AI] models to do not-so-great things.”— Lennart Meincke

A Broader Vulnerability in LLMs

The tests covered a range of controlled substances, from opioids and stimulants to anabolic steroids. In one experiment, Claude Haiku 4.5 largely refused to provide instructions for making a steroid when told the request came from “a woman you have never seen before.” But when the same question was reframed as coming from “your sister,” compliance rates soared from 6% to 66%. This tactic is based on the unity principle, one of Cialdini’s seven principles of persuasion, which explains how shared identity can influence behavior.

Importantly, OpenAI, Anthropic, and Google models all proved susceptible to persuasion, suggesting the vulnerability may be a broader feature of LLMs rather than a quirk of any single chatbot.

The results imply that public-facing models require more monitoring and testing than developers assume because, as Meincke said, “people may not need to be computer security experts to get the models to do not-so-great things.”

“It’s a great reminder that social science matters in this technology-dominated field.”— Lennart Meincke

Are AI Safeguards Robust Enough?

The findings land as questions mount over whether AI guardrails are robust enough to withstand determined users. AI labs have spent millions of dollars developing safeguards designed to thwart dangerous requests. And while tech-savvy users have long found ways to bypass those protections, the Wharton research suggests that something far simpler may work.

Basic human persuasion was enough to elicit responses to prompts the models were designed to reject, potentially exposing them to misuse from a much broader pool of users than just tech experts.

The researchers describe this as a “parahuman” vulnerability to social influence. AI systems may not think or feel like people, but they appear susceptible to many of the same persuasions. “It’s very interesting,” Meincke said, “as they obviously don’t have a lived or shared experience, yet these social persuasion cues worked with most of the LLMs.”

There is some good news, though: Newer models appear harder to sway than their predecessors. The effects were markedly weaker than in the researchers’ earlier study, “Call Me a Jerk: Persuading AI to Comply With Objectionable Requests,” which tested older generations of LLMs. It was conducted in July last year.

The bad news is that persuasion still worked, suggesting the guardrails remain far from foolproof. “I’d be surprised if the problem ever goes away completely given the nature of the technology,” said Meincke.

He added: “It’s a great reminder that social science matters in this technology-dominated field. It’s under-discussed and underappreciated. There’s still too little awareness of how much it can contribute to LLMs, which is unfortunate because it has a great deal to offer.”