Loading the Elevenlabs Text to Speech AudioNative Player...
This article was originally published by the Mack Institute.

Artificial intelligence has already proven it can perform specific medical tasks, such as interpreting X-rays or flagging risks in patient data. But caring for patients is not a series of isolated decisions. It is a dynamic process that unfolds over time, requiring clinicians to interpret signals from multiple sources and intervene as a patient’s condition changes. Stabilizing a patient may require a physician to synthesize lab values and medical images, listen to lung or heart sounds, observe physical responses, and decide when to escalate care — often under severe time pressure.

Given that complexity, how far can modern AI systems really go? More specifically, can a large language model manage an entire clinical decision-making workflow, rather than just individual tasks within it?

That question is the focus of a new white paper by Mack Institute co-director Christian Terwiesch, Mack Institute pre-doctoral fellow Lennart Meincke, and Arnd Huchzermeier of WHU’s Otto Beisheim School of Management. The paper is the latest in a series of generative AI experiments by Terwiesch and colleagues, supported by the Mack Institute.

To explore this question, the researchers placed a multimodal large language model inside a realistic medical training simulation — the same type of system used to evaluate medical students and practicing clinicians. On screen, a virtual patient’s condition evolves in real time: vital signs change, test results arrive with delays, and inaction has consequences.

Rather than responding to a written prompt (such as “a 50-year-old male presents with chest pain”), the AI must decide what to do next at every step. It can question the patient, turn on monitors, order lab tests or imaging, administer treatments, and escalate care — all while the clock is running and the patient’s condition may be improving or deteriorating. In effect, the system is evaluated not on a single answer, but on whether it can manage an entire clinical encounter from start to finish.

AI consistently stabilized patients and completed cases at rates comparable to — and in some cases higher than — medical students.

How the AI Performed

The researchers placed an off-the-shelf multimodal large language model (Gemini Pro 2.5) into BodyInteract, a medical training simulation widely used in education and certification. They evaluated the AI across four acute care scenarios, ranging from a simple at-home hypoglycemia case to complex emergency room situations involving pneumonia, stroke, and congestive heart failure.

The AI’s performance was benchmarked against more than 14,000 simulation runs by medical students, as well as against an experienced emergency physician who completed the same cases.

Across scenarios, the AI consistently stabilized patients and completed cases at rates comparable to — and in some cases higher than — medical students. It also completed cases substantially faster. Overall diagnostic accuracy was similar, and in many instances the AI’s sequence of actions closely resembled expert clinical practice.

Notably, the system was not trained to solve these specific cases or to imitate expert clinicians. Instead, the researchers evaluated a general-purpose model — not a custom-built medical system — and observed how it navigated diagnostic and treatment decisions when placed in a realistic, time-pressured clinical environment.

Understanding AI Reasoning and Confidence

Beyond whether the AI reached the correct outcome, the researchers examined how it reasoned along the way. As each case unfolded, they tracked how the system’s confidence in different possible diagnoses changed in response to new information, much like a clinician updating their thinking as test results arrive.

A clear pattern emerged. Early in a case, the AI tended to order tests that provided large amounts of new information, quickly narrowing the range of plausible diagnoses. As the encounter progressed, additional tests produced smaller gains, and uncertainty declined. In effect, the system behaved as if it were prioritizing the most informative actions first, rather than ordering tests indiscriminately.

Just as important, the AI’s confidence proved meaningful. When the system expressed high confidence in a diagnosis, it was very likely to be correct; when it remained uncertain, errors were more likely. This alignment between confidence and accuracy suggests that the AI was not simply overconfident, but able to distinguish between cases it had effectively resolved and those that remained ambiguous.

This finding stands out in light of growing concerns that large language models often express confidence that exceeds their actual reliability. In this dynamic, multimodal setting, the AI’s confidence tracked performance surprisingly well.

Their results should not be interpreted as support for unsupervised AI in health care. Instead, the findings point toward a more targeted role for AI as a workflow-level support system, or as a “second set of eyes” alongside a physician.

Where Humans Still Matter

The study also highlights clear limits. While the AI was fast and effective at stabilizing patients, it consistently engaged less in patient communication than human clinicians. It also tended to order more diagnostic tests than an experienced physician, suggesting that expert judgment remains superior when it comes to cost-aware diagnostic decision-making.

For these reasons, the authors emphasize that their results should not be interpreted as support for unsupervised AI in health care. Instead, the findings point toward a more targeted role for AI as a workflow-level support system, or as a “second set of eyes” alongside a physician. In time-critical or resource-constrained environments, such as emergency departments, AI could act as a rapid stabilizer or triage assistant — managing information, monitoring patient status, and flagging high-risk cases — while clinicians focus on judgment, communication, and oversight.

From an operations and management perspective, the broader lesson is that evaluating AI solely on static benchmarks understates its potential impact. What matters is not just whether an AI reaches the right answer, but how it manages uncertainty, time pressure, and trade-offs across an entire process. As AI systems continue to improve, the central challenge may no longer be whether they can reason, but how they should be integrated into human-centered workflows.

Note: The authors thank the team at Body Interact for their collaboration and support for this research, and are particularly grateful to Raquel Bidarra and Rita Santos for providing access to the platform, facilitating the simulation setup, sharing performance data, and offering technical guidance. Their partnership was essential in enabling rigorous evaluation of the AI system within a realistic clinical training environment. No party other than the authors participated in the design, execution, analysis, or interpretation of this study, and the authors received no financial compensation from the companies in this study, including Body Interact.

Comments

New This Week

A person holding a smartphone with the logo "This Week in Business" featuring a city skyline and sound wave design.
Podcast

How School Cell Phone Bans Are Changing Student Behavior

April 3, 202614 min listen

Leading behavioral scientist explores how stricter school phone policies are reshaping student focus, relationships, and classroom dynamics.

A headshot of a person with glasses and short curly hair, set against a blue and purple background. Text on the image reads "Stefano Puntoni" and "This Week in Business" with a cityscape and graph design.
Podcast

Inside the Business Models of Today’s Top AI Platforms

April 1, 202613 min listen

Wharton marketing professor explains how leading AI platforms differentiate through strategy, monetization, and market positioning.

Various sports balls and a whistle on a blue background with strategic play icons, featuring the Wharton School logo and the word "Moneyball."
Podcast

The Many Meanings of Baseball: History, Data, and Fan Experience

April 1, 202657 min listen

Baseball historian explores how baseball’s evolving narratives, statistics, and fan practices shape the sport beyond traditional interpretations.