Financial advisors, TV pundits and an endless stream of experts and nonexperts readily offer their predictions about the future, whether related to the stock market, international relations or the next Presidential election. But how good are those predictions? As it turns out, most of us are not very good at making forecasts, and even the best-known experts do not have a solid track record.
So what makes a good forecaster? In the new book, Superforecasting: The Art and Science of Prediction, Wharton management professor Philip Tetlock and co-author Dan Gardner look into what makes people good forecasters. Tetlock, who is also a professor of psychology at Penn’s School of Arts and Sciences, recently spoke to Knowledge at Wharton about his decades of research on the topic and how you can incorporate some of these forecasting techniques into your own life.
An edited transcript of the conversation follows.
Knowledge at Wharton: Thanks to people like New York Times columnist Thomas L. Friedman and Nate Silver, [editor-in-chief of ESPN’s FiveThirtyEight blog,] and to the rise of big data, there seems to be a lot of interest in forecasting. I was really surprised to learn from your book that despite all this interest in forecasting and interest in people who have fashioned themselves as forecasters, forecasting itself is not very well studied or analyzed.
Philip Tetlock: I think that’s fair to say. It’s pretty threatening to keep score of your forecasting accuracy. Imagine you’re a big-shot pundit. What incentive would you have to submit to a forecasting tournament in which you had to play on a level playing field against ordinary human beings? The answer is, not much, because the best possible outcome you could obtain is to tie it. You’re expected to win. So the best outcome is a tie. There’s a good chance, our research suggests, that you’re not going to win.
Knowledge at Wharton: The book was actually decades in the making, and your research into this has stretched back for years. In some ways, this all started with a dart-throwing chimp. Could you tell us a little bit about that story? In the book, you said that people don’t exactly get the right takeaways out of that study, but I thought it was interesting to show what happens when we try to test forecasting.
Tetlock: In our early work, which … goes back into the mid-1980s, we did use the metaphor of the dart-throwing chimp to capture a baseline for performance, which is, how much better can you do than chance? If you had a system that was just generating forecasts by chance, how well would you do relative to that?
That actually is a baseline that some people can’t beat. They can’t beat it for a lot of reasons. Sometimes the environment is just hopelessly difficult. If you were trying to bet on a roulette wheel in Las Vegas, you’re not going to be able to do any better than a dart-throwing chimp. But people sometimes fail to beat the dart-throwing chimp even in environments where there are predictable regularities that could be picked up if you were being astute enough.
Knowledge at Wharton: In Superforecasting, you point out that what a lot of people take away from the chimp study is that all predictions are bad, that forecasting is bad. But in fact what you were pointing out really is that there are actually limits on predictability and that it’s not all bad.
Tetlock: That’s right. You don’t want to be too hard on people because there’s a lot of irreducible uncertainty in some environments. It’s very difficult to bring it down below a certain point. It’s unfair to portray people as being dumb, in some sense, if they’re failing to do something that’s impossible. Of course, we don’t know what’s impossible until we try, until we try in earnest…. You don’t discover how good it’s possible to become in a particular forecasting environment until you run forecasting tournaments, competitive tournaments. You plug in your best techniques for maximizing accuracy. You see how good you can get.
That’s essentially what we did in the forecasting tournaments with the U.S. government, one sponsored by the intelligence community, the Intelligence Advanced Research Projects Agency [IARPA]. These are forecasting tournaments that were run between 2011 and 2015 involving tens of thousands of forecasters trying to predict about 500 questions posed by the intelligence community over that period of time. We found that some people could do quite a bit better than the dart-throwing chimp, and they could beat some more demanding baselines as well.
Knowledge at Wharton: In the book, you talk a little bit about where you found these forecasters who were part of your study, which is called the Good Judgment Project. Can you talk a little bit about where you recruited these people from? And then also about how these tournaments actually took place, what people were called on to do and how they did?
“How much better can you do than chance?…That actually is a baseline that some people can’t beat. They can’t beat it for a lot of reasons. Sometimes the environment is just hopelessly difficult.”
Tetlock: We were very opportunistic. We recruited forecasters by advertising through professional societies, by advertising through blogs. A number of high-profile bloggers help us to recruit forecasters, people like Tyler Cowen and Nate Silver. Various people were helpful in recruiting forecasters. Plus we knew quite a few people from the earlier work that I had done on expert political judgment. We were able to gather initially a group of several thousand, and we were able to build on that in subsequent years.
I have to be careful about making big generalizations about how good or bad people are as forecasters. As I mentioned before, you can make people look really bad if you want to. You can pose intractably difficult questions…. Or you can make people look really good. You can pose questions that aren’t all that hard. So you want to be wary of research that does cherry picking. There are some aspects of that in some of the literature.
What we were looking for was a process of generating questions that wasn’t rigged one way or the other. The method we came up with was generating questions through the U.S. intelligence community. They were questions that people inside the U.S. intelligence community felt would be of national security interest and relevance and reasonably representative of the types of tasks that intelligence analysts are asked to do. [T]ypically they asked people to see out into the future several months, occasionally a bit longer, occasionally shorter. We scored the accuracy of their judgments over time.
We didn’t have people make judgments one way or the other. It wasn’t yes or no. We had people make judgments on what’s called a probability scale ranging from zero to one. We carefully computed accuracy over time. We identified some people who are really good at it — we called them super forecasters — and they were later assembled into super teams of superforecasters. And they dominated the tournament essentially, over the next four years. But we did a number of other experiments as well looking for techniques that could be used to improve accuracy, and we found some.
Knowledge at Wharton: These superforecasters came from all walks of life. But one of the things you point out in the book is that what makes a good forecaster is really how you think. Can you talk a little bit about what you mean by that, and what are some of the unifying characteristics of superforecasters?
Tetlock: When you ask people in the political world, “Who has good judgment?” The answer typically is, “People who think like me.” Liberals tend to think that liberals have good judgment and good forecasting judgment, and conservatives tend to think that they are better at it. It turns out to be the case that good forecasting accuracy is not very closely associated with ideology. There’s a slight tendency for people who are superforecasters to be more moderate and less ideological, but there are lots of superforecasters who have strong opinions. What distinguishes superforecasters is their ability to put aside their opinions, at least temporarily, and just focus on accuracy. That’s a very demanding exercise for people.
Knowledge at Wharton: Are there ways to make even superforecasters better, conditions or environments to make them super superforecasters?
Tetlock: Eventually you’re going to reach a point where you’re not going to get any better because, as I mentioned, the environment itself has some degree of irreducible uncertainty. So no matter how good you are, you’re probably not going to do a very good job predicting what the value of Google is going to be next week on the New York Stock Exchange. So there are some things that are very difficult to do. It’s not clear that even using superforecasters is going to let you make appreciable headway on that. But there are many things that are quite doable that we previously didn’t think were doable, and there’s a lot of room for improving the accuracy and probability judgments on those things.
Those are things like predicting whether international conflicts are going to escalate or deescalate, whether certain treaties are going to be signed or approved by legislatures, or whether Greece is going to leave the eurozone. So there are a lot of problems that have relevance to financial markets, have relevance to business decisions, where there is potential to improve probability judgment, where we have shown that experimentally now in the IARPA tournament, where people typically don’t do that. People typically rely on vague verbiage forecasts. You’ve heard people say, “Well, I think it’s possible. This could happen. This might happen. It’s likely.” Those are terms [are] not all that informative.
“What distinguishes superforecasters is their ability to put aside their opinions, at least temporarily, and just focus on accuracy. That’s a very demanding exercise for people.”
If I say that something could happen — for example, Greece could leave the eurozone by the end of 2017 — what does that mean? I could mean I think there’s a probability of 1% or 99%. Or we could be hit by an asteroid tomorrow.…[A]sking people to make crude, quantitative judgments, which become progressively more refined over time, is a very good way to both keep score and get better at it.
Knowledge at Wharton: We have all of these people in the world who have fashioned themselves as professional forecasters, or pundits on TV. They are really television personalities, media personalities, and it seems that to do that job it’s almost a cult of personality. You don’t want to be proven wrong. You would never admit that you’re wrong. You’re just going to keep kicking it down the road and say, “No, it’s going to happen.” But what you found is that one of the things that unites superforecasters is that they’re willing to be proven wrong and they are willing to look at evidence, retread and pivot. I found that kind of ironic. In the book, you tell a related story in the book about foxes versus hedgehogs that was very interesting.
Tetlock: The fox-hedgehog metaphor is drawn out of a surviving fragment of poetry from the Greek warrior poet Archilochus, 2,500 years ago. Scholars have puzzled over it over the centuries. It runs something like this, and of course, I don’t know ancient Greek, so I’m taking on faith that this is what it actually says: “The fox knows many things, but the hedgehog knows one big thing.”
Now you can think of hedgehogs in debates over political and economic issues as people who have a big ideological vision. Tom Friedman might be animated by a vision of, say, globalization: the world is flat. Libertarians are animated by the vision that there are free market solutions for the vast majority of problems that beset us. There are people on the left who see the need for major state intervention to address various inequities. There are environmentalists who think we’re on the cusp of an apocalypse of some sort. So you have people who are animated by a vision, and their forecasts are informed largely by that vision.
Whereas the foxes tend to be more eclectic. They kind of pick and choose their ideas from a variety of schools of thought. They might be a little bit environmentalist and a little bit libertarian, or they might be a little bit socialist and a little bit hawkish on certain national security issues. They blend things in unusual ways, and they are harder to classify politically.
Now in the early work we found that the foxes who were more eclectic in their style of thinking were better forecasters than the hedgehogs. In the later work, we found something similar. We found that people who scored high on psychological measures of active open-mindedness and need for cognition, those people who scored high on those personality variables tended to do quite a bit better as forecasters.
Knowledge at Wharton: What does that mean for trying to get more accurate forecasters? How do we get people to listen to the foxes when they might not be the sexiest or the most prominent or the people who we want to look at or listen to all the time?
Tetlock: That’s a bit of a dilemma. Imagine you are a producer for a major television show, and you have a choice between someone who’s going to come on the air and tell you … something decisive and bold and interesting — the eurozone is going to melt down in the next two years or the Chinese economy is going to melt down or there’s going to be a jihadist coup in Saudi Arabia. He’s got a big, interesting story to tell, and the person knows quite a bit and can mobilize a lot of reasons to support the doom-and-gloom prediction, say, on the eurozone or China or Saudi Arabia…. The person is charismatic and forceful and can generate a lot of reasons why he or she is right.
As opposed to someone who comes on and says, “Well, on the one hand there’s some danger the eurozone is going to melt down. But on the other hand there are these countervailing forces. On balance, probably nothing dramatic is going to happen in the next year or so, but it’s possible that this could work.” Who makes better television? To ask the question is to answer it.
There is a preference for hedgehogs in part because hedgehogs generate better sound bites. People who generate better sound bites generate better media ratings, and that is what people get promoted on in the media business. So there is a bit of a perverse inverse relationship between having the skills that go into being a good forecaster and having the skills that go into being an effective media presence.
Knowledge at Wharton: A lot of the book talks about framing — it’s not just about finding people who can make good forecasts, but it’s also about finding the right questions, finding the right way to frame the problem and breaking down a big problem into smaller clusters. There’s so much that goes into forecasting other than the actual forecast that comes out of it. Are people thinking enough about this as well, in addition to finding people who give good forecasts?
Tetlock: I see that as one of the big objectives of the next generation of forecasting tournaments: the focus on generating not just good answers, but good questions. In the book, we talk about the parable of Tom Friedman and Bill Flack. Tom Friedman is, of course, a famous New York Times columnist, a Pulitzer Prize‒winner who is a regular at Davos and the White House and circulates in networks of power. Bill Flack is an anonymous, retired hydrologist in Nebraska who also is a superforecaster. We know a huge amount about Bill Flack’s forecasting track record because he answered a very large number of questions in the course of the tournament and demonstrated he could do so effectively. But we know virtually nothing about Tom Friedman’s forecasting track record, notwithstanding that he’s written a great deal over the last 35 years, and he’s a powerful analyst and a writer and he does many things very well. But there’s no way really to reconstruct with any degree of certainty, how good a forecaster he is. Tom Friedman has detractors, he has admirers. His admirers might say, “Well, he was right that it was a bad idea to expand NATO eastward because it would provoke nationalist backlash in Russia.” Or “He was wrong about Iraq because he supported the 2003 invasion.” People have a lot of opinions about those things.
We did a careful analysis of Tom Friedman’s columns, and one of the things we noticed is even though it’s very difficult to discern whether or not he’s a good forecaster, going back after the fact, it is possible to detect some really good questions. He’s a pretty darn good question-generator. We’ve actually begun to insert some of his ideas for questions. They tend to be rather open ended. We’ve managed to translate some of them into future forecasting tournaments.
Let me give you an example from the past that illustrates the tension between being a super question generator and a superforecaster. So in late 2002‒early 2003 before the Iraq invasion, Tom Friedman wrote what I thought was a really quite brilliant column on Iraq, in which he posed the following question, which really cut to the essence of a key issue in deciding whether to go into Iraq. He asked: Is Iraq the way it is today because Saddam Hussein is the way he is, or is Saddam Hussein the way he is because Iraq is the way it is?
Knowledge at Wharton: The chicken and the egg question.
Tetlock: The chicken and the egg. And what would happen if you took away Saddam Hussein? Would the country disintegrate into a war of all against all? Or would it move toward a Jeffersonian liberal democracy in the next 15 or 20 years? Now maybe not quite that fast, but in that direction, things would move in that direction.
Tom Friedman didn’t know the answer to that question. Many people think he made a big mistake in supporting the invasion of Iraq in 2003. But he was shrewd enough to pose the right question. If we’d been running forecasting tournaments in late 2002, early 2003 that would have been something we would have wanted very much to include in that exercise. The right way to think about Tom Friedman and Bill Flack is that they are complimentary. Tom Friedman’s greatest contribution to forecasting tournaments may well be his perspicacity in generating incisive questions. He may be a good forecaster, too, but we just don’t know that yet.
Knowledge at Wharton: But in order for good forecasting, we need the Tom Friedmans in the world, and the Bill Flacks. To me it would seem it’s just a question of trying to get them together in the right ways, in the right permutations to get better predictions.
Tetlock: That’s where we come around in the book. It’s not really Tom versus Bill; it’s Tom and Bill. It should be symbiotic.
Knowledge at Wharton: In the age of super computers and machine learning, how do you think the role of human forecasting is going to change? How does using computers, using data, complement or even compete with human forecasting?
“A lot of people spend quite a bit of money on advice about the future that probably isn’t worth the amount of money they are spending on it.”
Tetlock: In the book, we conducted an interview with David Ferrucci. When he was an IBM scientist, he was responsible for developing a famous computer program known as Watson, which defeated the best human Jeopardy players. We asked him a number of questions about his views about the human/machine forecasting. One line of questioning was particularly interesting. It was very clear to him that it would be possible for a system like Watson to answer the following question reasonably readily: Which two Russian leaders traded jobs in the last five years? For that question, Watson could search its historical database, it could figure it out. Reframe the question as: Will those same Russian leaders change jobs in the next five years? Would Watson have any capacity to answer a question like that? And [Ferrucci’s] answer was, no. The question was, how difficult would it be to reconfigure Watson so that it could answer a question like that? His answer was, massively difficult. It would not be something that would be easy to accomplish any time in the near future.
I think that’s probably true. I’m not an expert in that area, but he obviously is. But when I think about what would be required, what’s required to do the sorts of things that superforecasters collectively do, the amount of informed guesswork that goes into constructing a forecast, a reasonable forecast, it’s difficult for me to imagine existing AI — artificial intelligence systems — doing that in the near term.
Knowledge at Wharton: If someone is reading the book and wants to become a better forecaster in their daily lives, what do you hope that people will apply from this book?
Tetlock: A lot of people spend quite a bit of money on advice about the future that probably isn’t worth the amount of money they are spending on it. They have no way of knowing that because they have no way of knowing the track record of the people whose advice they are seeking.
The best example of that is probably in the domain of finance where a lot of money changes hands and is directed to people who claim to have some ability to predict the course of financial markets. That is an extraordinarily difficult thing to do. I’m not saying it’s impossible or that nobody can do it any better than a dart-throwing chimp, but it’s a very difficult thing to do. So I think if people were more skeptical about the people to whom they turn for advice about possible futures, I think finance would be a case in point. But I think more generally they should be very skeptical of the pundits they read and the claims that politicians and other people make about the future as well. It’s very common for people to make bold claims about the future and offer no evidence for their track records. I would say it’s almost universal.
Knowledge at Wharton: If someone is making a bold claim, is that the point where we should become suspicious?
Tetlock: Well, the bolder the claim the more the burden of proof should fall on the person to demonstrate that he or she has a good track record.
Knowledge at Wharton: It seems to me it’s more often the bolder the claim, the less likely someone’s going to question that person sometimes.
Tetlock: Well, that’s a great point. That’s a point about human psychology. We take our cues about whether somebody knows what he or she is talking about from how confident he or she seems to be. And the more confident, the more likely you’re going to be able to blunder bust your way through the conversation. That’s a problem, and it suggests that people need to think a little bit more carefully when they make appraisals of competence and not rely quite as heavily as they do on what we call the confidence heuristic. It is true that confidence is somewhat correlated with accuracy, but it’s also possible for manipulative human beings to use that heuristic and turn us into money pumps.
Learn more about the Good Judgment Project: https://www.gjopen.com/