Why You Shouldn't Ask Chatbots to Act Like an Expert

Many professionals try to sharpen results from artificial intelligence with prompts like “act like a world-class Python developer.” But new research from Wharton’s Generative AI Labs (GAIL) suggests the tactic does little to help.

Testing six large language models on graduate-level questions in science, engineering, and law, the researchers found that so-called “expert personas” delivered no consistent boost in accuracy and in some cases worsened the results.

The pattern held across most models, suggesting organizations may get more value from how they use AI than from how they prompt it.

That cuts against a widely promoted approach. Guidance from leading chatbot makers — including OpenAI, Google, and Anthropic — often encourages users to assign roles, with prompts that cast the model as a “math teacher” or “tax expert.”

The same idea is increasingly showing up in business, with Meta developing an AI version of its boss Mark Zuckerberg to interact with employees in his stead. That reflects a broader assumption that framing the system as an expert will improve results.

But the evidence here suggests otherwise. “When LLMs first came out people assumed personas would really help, but it matters less today,” says Lennart Meincke, a research fellow at Wharton’s Mack Institute for Innovation Management who co-authored the paper.

The study is the fourth in Wharton’s “Prompting Science” series examining how different prompting techniques shape AI performance. It is co-authored by Savir Basil, Ina Shapiro, GAIL senior fellows Dan Shapiro and Lilach Mollick, and management professor Ethan Mollick.

“When LLMs first came out people assumed personas would really help, but it matters less today.”— Lennart Meincke

Why Expert Personas Aren’t Reliable

The researchers tested several ways of instructing AI to answer nearly 200 PhD-level questions in one test and a further 300 similarly demanding ones in another. Some prompts framed the model as a subject matter expert, others as a different kind of expert, or as a child or layperson. But the results were consistent.

Expert personas did not lift performance and in most cases were no better than a simple baseline with no persona at all, while less knowledgeable roles often hurt accuracy.

Any gains were small and tied to specific models, not a general pattern, and even matching the persona to the task — using a “physics expert” for physics questions, for example — made little difference.

“AI is not super reliable. Ask the same hard question 25 or 30 times and you only get the right answer a few times,” says Meincke. That variability helps explain why prompt tweaks alone often fail to deliver consistent gains.

The researchers held other factors constant and focused on the impact of prompting alone. Perhaps unsurprisingly, asking the model to respond like a “toddler” cut accuracy in four of the six models, while assigning it the wrong kind of expert role also sometimes knocked performance.

“AI is not super reliable. Ask the same hard question 25 or 30 times and you only get the right answer a few times.”— Lennart Meincke

A Better Way to Prompt Chatbots

The findings point to a different way of working with AI, coming after a period in which “prompt engineering” was widely seen as a way to get better results, even spawning a new role focused on crafting the right instructions.

Instead, the paper suggests organizations are likely to get more value from how tasks are framed for AI, what gets fed in, and how results are checked, rather than from layering personas on top of prompts.

“For factual work, it doesn’t really matter, so don’t overcomplicate it,” says Meincke. “If you ask a lawyer or a doctor what the capital of the UK is, it’s an established fact. The role shouldn’t matter.”

There can also be downsides to deploying personas. The paper found that models sometimes declined to answer questions when given the wrong role, citing a lack of expertise. In some cases this happened in more than 10 out of 25 attempts, limiting the model’s usefulness when a broader answer is needed.

And at times, models became overly cautious, declining to answer rather than risk being wrong, while in other instances assigning a role narrowed the responses, holding the LLMs back from drawing on what it already knew.

None of this, however, rules out personas entirely, which still have a role to play in shaping the tone and presentation of responses. “They are still immensely useful,” notes Meincke. “You get a very different response depending on the type of work.”