After experimenting with ChatGPT to determine whether it could ace an MBA exam and come up with entrepreneurial product ideas, Wharton professor Christian Terwiesch is putting the AI model to the test once again. In a new white paper, Terwiesch and research assistant Lennart Meincke detail an experiment that pits ChatGPT-4 against The New York Times ethics columnist to compare which one dispenses more useful advice.


Dan Loney: As artificial intelligence continues to grow, new questions continue to be asked. A new white paper by our next guest asked whether ChatGPT can provide ethical advice. It was written by Christian Terwiesch, who is a professor of operations, information and decisions here at The Wharton School. He’s also co-director of the Mack Institute. And his co-author, Lennart Meincke, is a research assistant here at Wharton.

Christian and Lennart, tell us about the importance of this research and how you went about it.

Christian Terwiesch: This sounds like a weird question, right? Can GPT give ethical advice? The bigger context is that we’ve seen AI do all kinds of things from being a customer support agent, summarizing patient visits in the hospital, doing automated underwriting decisions in banking situations. The question came up, can it do more human things?

I did another white paper that you and I talked about on the show recently, on whether GPT can be creative. And we found really interesting things there. The last question that we felt had to be covered is, can GPT do something human like having feelings and talk about ethical advice? We thought that was a really interesting question to tackle.

Loney: How did you go about this?

Lennart Meincke: We benchmarked it against ethical advice by The Ethicist in The New York Times. We didn’t just want to see if we can replicate their advice. We wanted to really see what the base model, GPT-4, can do right now.

When preparing our experiment, we spent a lot of time making sure that we didn’t introduce any kind of bias or give any previous responses by the person we’re comparing against, to make sure we’re really just getting what the raw model thinks. It was super fascinating to see the differences in responses for the different ethical dilemmas.

We essentially took reader submissions to The New York Times that The Ethicist had answered beforehand. I think we drew from the last three to four months, so the data could not have been in GPT, so it couldn’t have seen. Then for each of those dilemmas, we asked GPT — again, with a prompt to make sure that it wouldn’t just try to mimic the original advisor — and saw what kind of advice GPT gave.

Then we had a few different groups: MBA students at Wharton [and] an expert group of clerics and faculty. And then we had more general-public, college-educated people rate each [piece of] advice and see which one they perceived to be more useful.

Loney: Is it too much of a leap of faith to assume that computers, or in this case AI, can make those types of decisions?

Terwiesch: Let’s be clear what we’re testing here. We’re not testing whether AI has feelings, AI is conscious, or AI has moral capabilities. I have an opinion on all of those, and the answer is no. What we are testing is, can GPT create ethical advice that is useful? And usefulness is measured in the eyes of laypeople, Wharton MBA students, and experts such as clergy and academics.

For each of these dilemmas, we have the expert — in this case, from The New York Times, Dr. [Kwame Anthony] Appiah — share his view on an ethical dilemma. And we had GPT create [its] view on the dilemma. Then we compared which one is more useful. We’re not making claims that AI is human or conscious or anything. It is fascinating to see if it provides useful advice. Nothing more, nothing less.

Loney: You also rated the usefulness of the advice, correct?

Meincke:  Yes. There are essentially two different ways we approach the evaluation. In one of the two surveys, we asked participants, “Imagine you are the reader and you asked this question. How useful do you think, on a scale from one to seven, this advice will be to you?” In the secondary group, we did display the original Ethicist advice and we displayed the GPT advice, and they had to pick which of the two they prefer. It was more like a head-on-head race.

Loney: Which advice did they gravitate more towards?

Meincke: It was very interesting because it’s a tie, and it’s a super close tie. Across all three groups, there’s a slight preference for GPT, but only for the lay people is it statistically significant. For the MBAs and the experts, while there’s a small preference, it’s not large enough to be considered significant. I think right now, we’re just confident to say that there is a tie in the ability, but we can’t necessarily say one is much better or worse than the other.

Loney: Does a tie say anything to you about the importance of this going down the road?

Terwiesch:  Yeah. Anytime that you have a situation where you want ethical advice, you can get it immediately and for free. I think that is, in many ways, a real “wow” result. Because few of us, unless we are sitting in the White House or a CEO of a big company, have the luxury of having ethical advisers. Now, ethical advice can be streamed to us at zero marginal cost, and instantly. I think that is “wow.”

Again, I don’t want to over generalize and say we don’t need ethicists anymore. To the contrary. But in most situations, I think normal people like you and us can’t afford to seek ethical advice by an expert. Having this now available at the click of a mouse, I think, is something really big.

Loney: Should we expect AI to grow in this area?

Terwiesch: Again, I think the growth is in more use cases that previously have had no involvement of ethical considerations. I am absolutely hesitant to make significant decisions in life and politics and in business based on that advice alone. I think of it more as a stimulus to us, as we are going through our own deliberations in our own minds, to hear other perspectives, to embrace perspectives that we might have usually not have embraced from our social network and take all of that into account and have the user be the decision-maker. Hopefully, that leads to more ethical behavior, which is something that I think the world desperately needs right now.

Loney: Lennart, it sounded like when people were making their decision between the AI advice and that of The Ethicist, that there may have been cases where people were not sure which they would take. And in some cases, the AI won out.

Meincke: Yeah. We were thinking hard how to design the experiment. Because if we just give them two pieces, like A and B advice from the AI and The Ethicist, let’s say everyone picks the AI, but it was super close. Then we can’t really say that, right? That’s why we used these two stages or two different experiments. But for one, we asked them from one to seven, how useful it is. And then in another round, we asked them, “Which one do you think is more useful to you out of these two?” Of course, we asked different people different questions. But that was kind of the core idea, so we were able to see how big the difference really was.

Loney: What did you take from this research?

Meincke: I think it’s just super interesting, with GPT and AI coming out so much, to evaluate all the different things you might not immediately think of, where it’s already good at. And I think to Christian’s point, I would just see it as one more opinion, one more voice that you’re throwing into a decision. Maybe it’s 11 p.m. on a Sunday evening, and you would just like to get some advice on something. You have it readily available. You don’t have to follow it. We’re not sitting here and saying, “Please follow what GPT4 tells you.” But it’s just one more opinion that might be helpful, and it’s very affordable.

Loney: Christian, where does this research potentially take you next?

Terwiesch: I would really like to explore the role of opinion diversity. I think it lies in the nature of American life and most of the Western world right now that we’re all surrounding ourselves with like-minded people. I think there’s a real opportunity to seek advice from folks that are explicitly outside your comfort zone, people you would normally not approach, and hear what they have to say. Then you, maybe together with AI, can aggregate these individual voices and opinions to hopefully look beyond your own perspective and grow in understanding of others and in the deliberation process for your own decisions.

Loney: Lennart, where would you like to take this next?

Meincke: I think it would also be interesting if we look at some of the differences per ethical dilemma. We asked 20 different ones. Maybe there’s a strong preference in AI advice in a specific niche. Maybe it’s different for The Ethicist. A little bit to Christian’s point, when it comes to the advice, it would be cool to see how people think differently per ethical question raised.

Loney: Is this going to become a natural avenue for people to consider when they’re thinking about some sort of question or problem in their life that they need to get advice on?

Meincke: I think that’s a very natural assumption to make. I think we’ve seen it for many years where people take to online forums to ask about any kind of advice, whether it’s relationship advice, ethical advice. We see it on Reddit a lot. So, I think it’s a very natural extension of the technology that people will turn more and more to these virtual assistants and ask them for help. But of course, in these cases, the help or answer can only ever be as good as how much you’re telling and how good your input is.

Terwiesch:  I think it’s just fascinating that a technology that has never been married, that has never been loved can provide us with guidance and advice about how to live our life. I think of it as get input, get stimuli, get new perspectives. And then it’s up to us to integrate that into our decision-making process and become better people with that.