Wharton’s Barbara Mellers and doctoral student Ike Silver discuss their research on “collective confidence calibration” and the effectiveness of team discussions.

When we’re unsure of an answer to a question or need help solving a problem, we often turn to our co-workers for collaboration. Indeed, businesses and organizations have long relied on the wisdom of the crowd to produce better outcomes than the individual can achieve alone. But new research from Penn Integrates Knowledge professor Barbara Mellers and Wharton doctoral student Ike Silver challenges that notion, revealing that sometimes the crowd can be overly confident without merit. In their paper, “Wise Teamwork: Collective Confidence Calibration Predicts the Effectiveness of Group Discussion,” the researchers explain why the composition of the group is critical to achieving better results. They recently spoke with Knowledge at Wharton about their research and why the best teams should be made up of people who are self-aware. (Listen to the podcast at the top of this page.)

An edited transcript of the conversation follows.

Knowledge at Wharton: The idea behind the wisdom of crowds is that pooling judgments from individuals can lead to greater accuracy. But you note in your paper that when the crowd engages in conversations, sometimes accuracy can suffer. Why?

Ike Silver: The idea of the wisdom of crowds draws on this very intuitive statistical fact, which is that if you take a group of independent people and ask them a difficult question, nearly all of them will get the question wrong. But on aggregate, they will often get the question wrong in ways that are unbiased and uncorrelated, meaning that their errors will cancel out on average. When we allow people to talk to one another, their errors become correlated, which is to say that they listen to one another. That can be a really great thing if the group is mostly listening to someone who is on the smarter end of the distribution of people in the group. But it can also be a bad thing if people are listening to someone who is persuasive but not necessarily knowledgeable.

In the context of discussions, there have been other well-documented biases that can arise, too. One of them is that groups tends to listen to ideas that are held more broadly at the expense of unique expertise held by specific individuals. So, groups will sometimes ignore individuals who have unique ideas that might be helpful because they’re not the prevailing views.

The other thing that happens in discussions is that people will start to care about how they look to others, and that can just be distracting, which can exacerbate some of the other biases that I mentioned.

Barbara Mellers: There’s lots of work on groupthink, conformity and bystander intervention that shows how ineffective groups can be. That’s well-documented in the psychological literature.

Silver: The other thing we should mention is that conversations are costly, so there’s some reason to believe upfront that we should use them sparingly. You’re putting people in a room, you’re asking them to spend their time. There are logistical considerations. There are costs associated with conducting discussions in the first place.

Mellers: But for the most part, [people] like to get together and talk about things, so it doesn’t feel costly.

Knowledge at Wharton: What was the big question you set out to answer with this research?

“Discussion had a variety of effects, and the big question for us was, ‘What predicts when discussion helps?’” –Barbara Mellers

Mellers: We wanted to know the conditions under which discussions helped people’s judgment, and we came up with a paradigm to answer that question that’s really quite simple. We had people come into the Wharton Behavioral Lab. We put them in front of computers, took away their cellphones so they couldn’t look up the answers, and then asked them questions like, “What’s the diameter of Neptune?” Or, “What’s the population of Madagascar?” Or, “In what year was the printing press invented?”

The first thing they did was to put their independent estimate into the computer. Then we had them turn their chairs around and talk to three other people for two or three minutes and see if they could collectively come up with a better answer than what they had said independently. They chatted, and when that period was over, they turned around and entered a new estimate to the question. They didn’t have to reach consensus. We just took the average of the independent estimates, the average after discussion, and compared those averages to see if the second one was more accurate than the first. If it was, then discussion helped.

Across a variety of estimation problems, we found that we had questions for which people got better after they talked, questions for which there was no effect, and then questions for which they got much worse after discussion. We had a situation along the lines of what we wanted. Discussion had a variety of effects, and the big question for us was, “What predicts when discussion helps?”

Knowledge at Wharton: What did you find? What was the key factor at play?

Silver: The thing that we found that we’re most excited about is this idea of collective confidence calibration. To give you a sense of what we mean by that, calibration in general refers to an appropriate correspondence between knowledge and confidence, which is to say that if you know the answer, you’re confident, and if you don’t know the answer, you’re unconfident. We would call someone who has that property [of being confident when they are also right] well-calibrated. But in this project, we were looking at calibration across a group of individuals. To measure calibration, we asked participants to say how confident they were in their initial answer.

Then we calculated for each group a collective confidence calibration score, which essentially captured the extent to which individuals who had better pre-discussion answers were more confident and individuals who had worse pre-discussion answers were less confident. Sometimes we found that confidence and knowledge lined up that way, and we would call the group well-calibrated. On the other hand, sometimes we found groups in which individuals who had worse pre-discussion answers were actually more confident, and individuals who had better pre-discussion estimates were less confident. That would be a group we would call poorly calibrated. Then we tried to predict the likelihood that a group’s average answer would improve after discussion using these pre-discussion calibration scores.

Knowledge at Wharton: Confidence can be a minefield. Barbara, in a recent interview that we did about how noise impacts predictions, you mentioned that overconfidence is one of the major biases that you might encounter in a group. Is there any danger for that in this context? Can too much confidence among members of the group lead to worse judgments?

Mellers: Sure, that’s always a concern. The aspect of confidence that Ike and I were interested in was the relative confidence among group members. We weren’t really interested in the level, just correlation between confidence and accuracy. Did the most accurate person also say that he or she was most confident or not? Was the most accurate person the least confident? And when the correlation was in the right direction, it was a great predictor. It was an excellent signpost of whether the average of group estimates would become more accurate after discussion, relative to before.

“Listening to the most confident person is a good strategy when the most confident person also happens to have pretty good judgment.” –Ike Silver

Silver: I’ll just mention a little bit about why we think that that’s happening. When our participants went into the discussion phase, they were sitting in a circle with four or five other participants, and their task was to figure out, “How can I take this conversation and improve my estimate?” Part of doing that well is figuring out who is knowledgeable amongst the group – whom to listen to.

But these questions are challenging. The answers aren’t obvious, so it’s hard to know whom you should listen to. There’s another well-documented effect in the literature on conversations, which is that in the absence of information about who is knowledgeable, individuals will rely on expressed confidence as a cue. You’ll listen to the most confident person or someone who’s more assertive and leans into the conversation.

Mellers: That’s not a bad cue in many cases.

Silver: Right. So we were trying to explain: When is that going to be a good strategy, and when is it going to be a bad strategy? It turns out that listening to the most confident person is a good strategy when the most confident person also happens to have pretty good judgment. But listening to the most confident person is not a great idea when that person doesn’t happen to be one of the more knowledgeable members of the group. We were therefore exploiting variation in pre-discussion calibration to predict when listening to the most confident person was going to be a good idea. We also asked participants to identify who was the most knowledgeable member of their group, and we found success in that task, too, was a signpost for beneficial discussions.

Knowledge at Wharton: In your paper, you also mention “the illusion of effective discussion.” What is that, and why is it important?

Mellers: One thing that regularly happened, regardless of whether group discussion had a positive effect, was that the average confidence in the group increased after discussion relative to before. And that got us thinking. Ike, do you want to go on from there?

Silver: Sure. When we looked at the data, what we found was that groups became more confident in their answers about 90% of the time. Groups nearly always thought that discussion improved their answer – at least that’s our interpretation. In addition, we asked them explicitly, “Do you think the discussion helped you?” And we also asked them to predict beforehand, “Do you think discussion will help you?” We found that, before interacting, they said, “Yes, discussion will help me. These questions are hard.” Then afterwards, they said, “Absolutely. My answer got better from talking to other people.” But in reality, there was great variation in whether or not discussion actually improved groups’ answers. Groups improved on average only about 55% or 60% of the time. We’d call that undue confidence. The cases in which groups’ confidence increased but their accuracy didn’t: That’s the “the illusion of effective discussion.” It’s a very tentative name for it, but it’s a pattern in the data that we observed and that we’re interested in looking into further.

“It does seem that there’s something about talking to other people that makes you really feel like you’re getting smarter.” –Ike Silver

Mellers: It’s kind of a warm glow that comes from the conversation. “OK, I’m doing due diligence. I’m talking to my friends or colleagues about this question.” People just don’t seem to realize that conversation can have negative effects on the accuracy of their judgments on a particular estimation task.

Knowledge at Wharton: This reminds me a lot of sitting alone in a room, Googling your symptoms and self-diagnosing your illness. It’s almost like engaging in a conversation, where you’re getting all this feedback that may or may not be right.

Mellers: A friend of mine calls that “cyberchondria.”

Silver: That’s exactly right. We think that when people can engage in, as Barbara said, due diligence, their confidence will go up in their answer whether or not their answer actually improves. What we’re particularly interested in is whether that due diligence-to-confidence pathway is even stronger in a group discussion context. It does seem that there’s something about talking to other people that makes you really feel like you’re getting smarter, maybe even in ways that just sitting by yourself and doing research or deliberating further might not necessarily do.

Knowledge at Wharton: What are the practical implications of your research for companies and organizations?

Silver: This is tentative and needs to be followed up, but I think the biggest practical implication for organizations is that they need to think about assembling teams not only of talented and knowledgeable individuals, but also of individuals who know what they know and what they don’t know, who are well-acquainted with, and self-aware of, their own talents.

What we found in our data is that, above and beyond how knowledgeable a group was before discussion, this confidence calibration, this idea that people have in some cases a meta-awareness of what they know and what they don’t, was a really strong predictor of effective discussion. Right now, in the labor market, what you’ll see is that organizations will administer tests to people that they’re considering hiring. Oftentimes, those tests ask job candidates about problem-solving or area-specific knowledge. And they use that as a way of getting a sense of, “Is this a smart, knowledgeable person that I want to have in my organization?” That’s great. But you could very easily add to those sorts of tests additional questions about confidence. “Answer this question. And then tell us, do you think you know the answer to this question?” And in so doing, organizations could start to get a sense of whether or not they’re staffing their teams not only with knowledgeable individuals, but also with individuals who are self-aware and who know when to lean into a discussion and when to sit back.

Mellers: This suggests a number of different avenues that we’re heading down at the moment, and one of them is, how can we make groups better from the start so that they’re more capable of learning from the discussion? We’ve toyed around with three tips that we could give people. The first is something that’s been shown to reduce overconfidence among professionals in several areas, and that’s to ask people to think of at least one reason why they’re wrong. Like I said, we don’t want everybody to tamp down their confidence, and we don’t really care [if they do]. It’s the relative ratings of confidence that we care about.

“People just don’t seem to realize that conversation can have negative effects on the accuracy of their judgments on a particular estimation task.” –Barbara Mellers

The second tip would be to listen to the estimate of every single person in the group. Don’t let anybody get by without speaking. Take the shy people in the corner who aren’t saying anything and also ask them the reasons behind their estimates.

The third tip would be simply, “Don’t adjust your estimate unless you trust the reasons behind it.” We are working on some new studies to see if some of these tips could have a beneficial effect, and we’d like to get people to become better calibrated and then, following from that, become more accurate in their estimations.

Knowledge at Wharton: Are there any other questions that this research opens up that you think would merit a look?

Silver: Yes. I mentioned the illusion of effective discussion, which is this idea that people will sometimes have undue confidence in the power of discussion to improve their judgment. We found that in this preliminary data. It wasn’t our key research question. But we’re conducting further studies to try to understand what exactly is going on in discussions that is increasing people’s confidence. Is it having access to other people’s answers? Is that enough to produce these confidence increases? Is it something about interacting that causes people to feel like their accuracy has increased?

Another interesting related question is, “Do you even have to participate in a discussion at all to feel like discussion improves the quality of answers?” Suppose you were a manager and wanted to ask a team of employees to come up with an answer to a particular question. You could elicit their independent answers and aggregate them in some way, or you could ask them, “Hey, go have a discussion about this. Schedule some time, think about it deeply, and come back to me.”

What we think is that managers probably have the intuition that the latter is going to get them a better answer. But our research suggests that the latter is only going to get them a better answer some of the time. From a managerial perspective, how can we train people to know when it’s a good idea to ask your employees to engage in discussion versus when it’s a better idea to say, “Give me your independent answers. I’ll aggregate them in some simple way, and that will be enough wisdom for whatever it is that I’m trying to do.”