Want Better Forecasting? Silence the Noise

From predicting the weather to possible election outcomes, forecasts have a wide range of applications. Research shows that many forces can interfere with the process of predicting outcomes accurately — among them are bias, information and noise. Barbara Mellers, a Wharton marketing professor and Penn Integrates Knowledge (PIK) professor at the University of Pennsylvania, and Ville Satopӓӓ, assistant professor of technology and operations management at INSEAD, examined these forces and found that noise was a much bigger factor than expected in the accuracy of predictions. The professors recently spoke with Knowledge at Wharton about their working paper, “Bias, Information, Noise: The BIN Model of Forecasting.” (Listen to the podcast at the top of this page.)

An edited transcript of the conversation follows.

Knowledge at Wharton: In your paper, you propose a model for determining why some forecasters and forecasting methods do better than others. You call it the BIN model, which stands for bias, information and noise. Can you explain how these three elements affect predictions?

Ville Satopӓӓ: Let me begin with information. This describes how much we know about the event that we’re predicting. In general, the more we know about it, the more accurately we can forecast. For instance, suppose someone asked me to predict the occurrence of a series of future political events. If I’m entirely ignorant about this, I don’t really follow politics, I don’t follow the news, I barely understand the questions you’re asking me, I would predict around 50% for these events.

On the other hand, suppose I follow the news and I’m interested in the topic. My predictions would be then more informed, and hence they would be not around 50% anymore. Instead, they would start to tilt in the direction of what would actually happen.

At the extreme case, we could think of me having some sort of a crystal ball that would allow me to see into the future. This would make me perfectly informed, and hence I would predict zero or 100% for each one of the events, depending on what I see in the crystal ball. This just illustrates how information can drive our predictions. It introduces variability into them that is useful because it is based on actual information. Because of that, it correlates with the outcome.

Unfortunately, in the real world we are not rational consumers of information. There are bound to be errors in our predictions. Statistically speaking, we like to separate errors into two different types. We have bias and we have noise. Bias is a systematic error. For instance, in this context of making predictions about political events, I might be making predictions that are systematically too high. That is, I systematically predict too high probabilities for the events to occur. This means that I have a positive bias. Similarly, my predictions could be systematically too low. Then I have a negative bias. The key here is to understand that bias is systematic. Because of that, we should be able to predict the direction and magnitude of bias in the forecaster’s next prediction.

Noise is a very different type of creature. It is not systematic. In fact, it is an error that randomly increases or decreases my predictions. For instance, one of my predictions may become randomly too high. For another event, it might suddenly become too low. The idea here is that no matter how much we know about the forecaster, it is impossible for us to predict the exact direction and magnitude of noise. This also introduces variability in the predictions. This variability is not based on any actual relevant information about the outcome. Therefore, it is not useful and does not correlate with the outcome.

“We often talk as though a prediction is wrong if it falls on the wrong side of maybe.” –Barbara Mellers

Let me sum that up. We have information and noise that define the variability in the predictions. Information is variability that correlates with the outcome, while noise is variability that does not correlate with the outcome. And bias is a systemic over- or underestimation of the predictions.

Barbara Mellers: Bias is what we’ve studied in the judgment and decision-making literature. It’s something you can see and can do something about it, whereas noise is invisible and very difficult to predict. I think that’s why we haven’t focused on it in the science as much as bias.

Back in the 1970s, Daniel Kahneman and Amos Tversky started a program of research that’s called heuristics and biases. They got everyone excited about studying bias, not just in statistical terms the way Ville is describing it, but also in psychological terms, like stereotyping people or being too optimistic and overconfident about our skills or our abilities. Kahneman is in the process of writing a new book on noise, and now he’s getting people very excited about reducing and tamping down noise in human judgment. It’s his view, and I think our research supports it, noise may be a much bigger factor than bias in our desire to improve human judgment.

Knowledge at Wharton: Let’s back up and clarify some terminology that might come up in terms of forecasting. What is a superforecaster?

Mellers: The term came from the previous forecasting tournament that we participated in. It was IARPA sponsored and went on for four years. There were hundreds of forecasting questions that we asked people during that time period. Thousands of forecasters were involved and over a million forecasts were made. At the end of each year, we calculated everyone’s accuracy score, then took the top 2% of forecasters in each of the various conditions that we had created. These were people who had done super well relative to the thousands of other people involved. And none of them objected to the title. They all were happy with the label of superforecasters and put it on their vitas!

We kept adding to that group at the end of each year, taking the top 2%. It’s relatively large now, in the hundreds. Believe it or not, superforecasters get together every so often in Boston and New York and San Francisco and all over the world. Many of them have become good friends and it’s kind of a club now.

Knowledge at Wharton: A correct prediction is either a zero or 100%. Everything else in between is a maybe, right? In other words, if someone guess 50% or 30%, that’s a maybe. We tend to think anything that’s not a perfect prediction is a bad prediction. Is that true?

Satopӓӓ: You’re absolute right. Ideally, we would have predictions of zero or 100 because those are absolute claims. If you were accurate, we would know exactly what is going to happen. But in the real world, this is rarely the case. We only know so much. There is always some sort of irreducible uncertainty that we cannot harness, and that will leave us with some uncertainty.

A forecaster who gives probabilistic predictions — somewhere between zero and 100 — can also be extremely good. We call these kinds of forecasters calibrated. This means that when you give a probabilistic prediction, your prediction can’t be interpreted in a way that we typically like to think about probabilities.

For instance, suppose you make a prediction that says this event is going to happen with 30% accuracy. That means that if we could somehow simulate 100 worlds, that event would happen 30 times — if you were calibrated, if you were a good forecaster. This is something we can actually test. If you make a lot of predictions on different types of events over time, we can start checking how calibrated your predictions are. Are they matching with the empirical frequencies that we observe in practice?

Knowledge at Wharton: So even if it doesn’t come true, it’s not necessarily a bad prediction?

Satopӓӓ: Not necessarily, no.

Mellers: With probability judgments, there are only two ways to be wrong. You can say zero and the event occurs, or if you say 100% and the event doesn’t occur. All else is in the gray zone. But we often talk as though a prediction is wrong if it falls on the wrong side of maybe. Pollsters who said Hillary Clinton was going to win with probability of 70% were viewed as making incorrect forecasts. Well, there are shades of gray here. Someone who made a 90% prediction that Hillary would win is more wrong than someone who made a 70% prediction. But people don’t treat predictions that way. They like to view them as right or wrong. Anything on the wrong side of maybe is wrong. But that’s not quite right. It’s more subtle than that.

Knowledge at Wharton: The data you looked at is from a multiyear geopolitical forecasting tournament, and you applied three experimental interventions in your research. Can you tell us what those were?

Mellers: We created an intervention that we did not expect would have any effect. It was a training module to teach people a little bit about how to think about probabilities and a little bit of advice about where to get probabilistic information, prediction markets, polling companies. We told them to average professional forecasts, if they have more than one. And then we gave them a small section on what the common biases are in human judgment and what you should avoid. Basically, we said look out for over-confidence. Many people are over-confident, and you might be as well. That’s a known bias. And watch for the confirmation bias, which is a systemic tendency to look for, and pay attention to, information that’s consistent with one’s beliefs. So, try to think about why you’re wrong. Ask yourself what ways you may have misinterpreted the question or forgotten to look for certain kinds of information and so forth.

The next intervention was teaming. We made bets among ourselves about what would work best here. Should forecasters be in groups of 10 or 15 and have the option of talking to each other if they want to? Or should they work alone? The strongest argument for working alone is a statistical one. It’s the notion that pooling independent judgments can often give you a more accurate estimate of something than people who talk first and then average their correlated judgments.

It turned out that the groups that worked in teams had opportunities to share information, and they started feeling responsible for each other. They didn’t want to disappoint each other. They gently corrected each other’s errors. If somebody said 20%, and it sounded like they meant 80%, somebody might say, “Oh, maybe you flipped the scale there.” They motivated each other: “Hey Joe, how come we haven’t heard from you in the last couple of days?”

“No matter how much we know about the forecaster, it is impossible for us to predict the exact direction and magnitude of noise.” –Ville Satopaa

Knowledge at Wharton: Does being on a team help to reduce bias as well?

Satopӓӓ: It did reduce bias, but not as much as it reduced the noise.

Knowledge at Wharton: You had a third intervention. Can you talk about that?

Mellers: The third intervention was what we call tracking of talent, which refers to the pooling. It’s like tracking in schools. In the school context, it’s the pooling of children with similar abilities. This is a controversial topic, and usually the controversy is found at the lower end. Should you pool kids with lower abilities and not have them exposed to kids with higher abilities? But there is not a lot of controversy about the higher end. Here, we’re in a case where we’re putting together the people who have the greatest skill and letting them work together. It was as if we had put them all on steroids. They suddenly just shot up in terms of effort. They really respected their teammates and wanted to do a good job for themselves and for the others.

Knowledge at Wharton: You both mentioned that noise was the biggest factor affecting the accuracy of predictions. Can you talk a little bit about that and why this was surprising to you?

Satopӓӓ: I guess we found it surprising because, going back to the original project research, all these interventions were designed to tackle bias. And going back to what Barb was saying about the history and the research done by Kahneman and Tversky, there’s been a very high focus on the research in bias. What we discovered here was that bias was not the dominant driver. In fact, noise reduction was the dominant driver.

The reason why it’s so handy to have superforecasters is that they are like an elite squad of forecasters. They represent what is humanly possible in this context. They created a very ideal benchmark against which we can compare other groups and see how we could improve those groups to make them more like the supers. What we found out, as a simple rule of thumb, is that about 50% of the accuracy improvements that we saw going from the regulars to the supers can be attributed to noise reduction. The remaining 25% is information improvement, meaning that they have more information, and that last 25% will be then bias reduction. So, 50% to noise, a quarter to information and remaining quarter to bias. This was not quite what we expected to see when we went into this. But now it’s jumping up all the time, no matter how we turn the data.

Mellers: We’re also looking at this from a dynamic perspective now. We’re analyzing what contributes to forecasting or accuracy scores from 60 days up to one day before the event occurred. And it changes somewhat. There is more bias in the judgments further out and less information. That flips as we get closer to the outcome. Information increases. Does noise increase? No. Noise stays about the same and bias decreases.

I think the contribution of the model is that you can ask yourself is your intervention, which you think was designed to reduce bias, doing that? Or should you redesign your intervention to really reduce bias or reduce noise or increase information? You can check yourself.

“The way to get rid of noise is to use an algorithm to completely pull the human out of the loop.” –Barbara Mellers

Knowledge at Wharton: How can you reduce noise in this process?

Mellers: The way to get rid of noise is to use an algorithm to completely pull the human out of the loop. We are not necessarily reliable aggregators of information. We might have a headache or be distracted. We’re in different moods. We argue with our spouses. This, that. We don’t give the same judgment to the same set of information on multiple occasions. An algorithm will do that in a second. We don’t necessarily discriminate between stimulus information that differs, information that should receive different judgments, for all the same reasons that we’re not necessarily consistent. That’s the way to get rid of noise — to take the human out of the loop. But that isn’t something most people want to see.

Knowledge at Wharton: Is there a danger in going too far in that direction?

Mellers: How would you feel if an algorithm decided whether or not you should be charged with a crime? Whether an algorithm decided whether or not you had cancer? I think for many of these cases, it’s well known that the algorithms do much better than people. I was just reading Malcolm Gladwell’s new book, Talking to Strangers. He tells the story of a judge in Chicago who decided whether to keep detainees or release them on bail. He liked to look into the eyes of the detainee to decide whether he would skip bail. It turned out that information wasn’t nearly as valuable as other information that you can derive from machine learning, algorithms and so forth. Accuracy increased greatly with the algorithm.

Satopӓӓ: There is also a challenge with the current technology. Maybe in the future this will change. But as of today, it is still quite challenging to train machines to make predictions on these kinds of very complex, almost unique events that we’re dealing with. Imagine training a model to predict whether Brexit is actually going to happen by the end of this year or whether there’s going to be some big breakthrough in some emerging technology like quantum computing by a certain year. We haven’t seen anything like this before, so it’s difficult to train machines to predict these kinds of events. Humans, on the other hand, can make these kinds of predictions, but they turn out to be quite noisy.

This doesn’t mean that we cannot have machines and humans working together to come up with even more powerful predictions for these complex events. One thing you can do is to look at past predictions made by people on events where we already know what happened, and then you can train a machine on those predictions. The machines can learn patterns in the human predictions. Now I have a trained model, and when I go into a new future event and try to forecast that, I ask a bunch of people and input those predictions into my machine-learning model. What pops out will be a prediction that can have less noise, less bias and be more accurate.

Or we can use machines also to combine multiple predictions in a very powerful way to come up with consensus predictions that are less biased and less noisy or even more informed than any one of the individuals in the group. This way, we can let the machines work together with humans.

Mellers: Hybrid. I think that’s the way we’ll be headed in the future.

Knowledge at Wharton: Based on your model, if the goal is to reduce noise as much as possible, is there any downside to reducing noise too much? Can that go to an extreme as well?

“As of today, it is still quite challenging to train machines to make predictions on these kinds of very complex, almost unique events that we’re dealing with.” –Ville Satopaa

Satopӓӓ: From a theoretical perspective at least, there isn’t. If you look at a noisy forecaster, on a single event it might be that a noisy forecaster gets lucky and gets really accurate. But if we follow a noisy forecaster in the long run for many events, they will be less accurate than another forecaster who is less noisy. So, from the accuracy perspective, having less noise is beneficial in the long run.

But I want to bring up something a little bit different because it’s not all about accuracy. Oftentimes, these predictions are then input into some sort of decision-making process. It’s not the forecaster who is ultimately making the decision; it is that they are acting almost like a consultant to somebody else. If you have a forecaster who is very low on bias, very low on noise, their predictions are much more interpretable. This goes back to what we were talking about, this idea of calibration. If a forecaster like this says that this event is going to happen 80% of the time, we know what that means. That is a probability. I understand how that works. I can trust that and make a decision now.

But suppose now this forecaster is very noisy. Suddenly, we don’t know what that means anymore. It’s not a probability the way we usually think about probabilities. It’s not reliable. And because of that, it’s difficult to make decisions based on very noisy predictions. There is that side to it as well.

Mellers: I completely agree with Ville that from a theoretical perspective, no noise is good noise. But again, these are societal questions of how we want to make decisions and judgments. We’re pretty bad at forecasting. There are numerous studies showing how poor eyewitness testimonies are and how poorly we forecast our pleasure in the future with choosing a house or a job. There is lots of room for improvement here.

There is an interesting thing, I think, that goes on. We find prediction really hard, but we find explanation fairly easy. As soon as an event occurs, we’ve got an explanation for it and it just comes naturally. For example, I don’t get a job at some university; I say, “Oh well, they’re biased. They must be biased against women,” or something. And I’ve got a complex, intricate story about the evil intentions of everyone involved.

I think that we ought to put all this in perspective and focus on what we can do to improve human judgment. We’ll see greater fairness. We’ll see greater equity, if that’s what we want. We’ll see lots of things getting better from noise reduction. I’m happy to be working on this problem and trying to tamp down noise with Ville. It’s been a great research project.

Knowledge at Wharton: What will be the next inquiry in this area?

Satopӓӓ: I think there are a lot of things we could do from here. It really opens up a much more detailed view into different interventions, forecasting accuracy performance. But one thing I wanted to mention here is that in this paper we are almost entirely focusing on how these interventions improve individual forecasters in terms of their bias, noise and information. But often when we make decisions, we don’t do that based on a single forecaster’s opinion. Instead, we will consult a lot of different experts. What we need to do then is to have all these experts somehow come to a consensus that we then will input into our decision-making.

There are many different ways we could do this. Probably the most natural way is to have all of these experts in a single room, let them discuss, share ideas and come up with some sort of consensus. Another is aggregation, which Barb mentioned earlier, the idea that I will ask each one of the experts individually for their predictions. Then I will combine these predictions with some sort of a mathematical tool like averaging. Or we can go and use something even more involved like a prediction market. These are all different ways to harness the wisdom of the crowd.

The literature says that these methods do help to improve accuracy. But exactly how do they do it? Do they tamp down noise? Do they just reduce bias? Or do they find the little bits of information scattered around all the experts and combine that somehow? We don’t really know. This is a fantastic application for the BIN model — to answer these questions. We could apply the BIN model to these kinds of methods and then understand them in a much more detailed manner. Once we have that understanding, this is a next step to make progress in developing other methods that are even more powerful in harnessing the wisdom of the crowd.

Mellers: Or we could send out requests for all existing data from our internet friends and re-analyze human predictions using the BIN model to see what exactly is wrong with, say, economic forecasts from experts or stock predictions or disease predictions. We could even do it without algorithms to see what the remaining variability consists of. There are a lot of ways that we could build on this.

Further reading: “The Secret Ingredients of ‘Superforecasting’” (INSEAD Knowledge)