Wharton's Gideon Nave discusses his research about the replicability of social science experiments.

Wharton marketing professor Gideon Nave has collaborated with a multinational team of researchers on a project that aimed to replicate the results of 21 social science experiments published in the journals Nature and Science. According to the team’s research, only 13 studies found results that supported the original studies. A surprising 38% of these studies failed to produce the same results. The paper is titled, “Evaluating the Replicability of Social Science Experience in Nature and Science between 2010 and 2015” and was published in Nature Human Behaviour. Nave joined Knowledge at Wharton to talk about the paper and what it means for the future of research.

An edited transcript of the conversation follows.

Knowledge at Wharton: We rely on research journals to vet what they publish, but studies like yours have shown that the results of high-profile experiments often can’t be replicated. Do you think there’s a “replication crisis,” as people are calling this?

Gideon Nave: I don’t know if I want to use the word crisis to describe it, but we certainly know that many results that are published in top academic journals, including classic results that are parts of textbooks and TED Talks, do not replicate well, which means that if you repeat the experiment with exactly the same materials in a different population, sometimes in very similar populations, the results do not seem to hold.

Top academic journals like Science and Nature, which are the ones that we used in this study, have acceptance rates of something like 5% of papers [that are submitted], so it’s not like they don’t have papers to select from. In my view, the replication rates that we have seen in these studies is lower than what you would expect.

Knowledge at Wharton: Can you describe some of the experiments you replicated?

Nave: The experiments that we used were social science experiments involving human participants, either online or in laboratory studies. The experiments we selected also typically had some manipulation, meaning there is an experimental setting where half of the population gets some treatment, and the other half gets another.

For example, we had a study in which people watched a picture of a statue. In one condition, it was Rodin’s “The Thinker,” and in the other one, it was a man throwing a discus. The assumption of the researchers was that when you show people Rodin’s Thinker, it makes them more analytical, so this was the manipulation. And then they measured people’s religious beliefs. The finding that the paper reported was that when you look at the picture of Rodin and become more analytical, you are less likely to report that you believe in God.

“Even the studies that did replicate well had on average an effect that was only 75% of the original, which means that the original studies probably overstated the size of the effect….”

Knowledge at Wharton: What did you find when you tried to replicate that one?

Nave: This study specifically did not replicate. I think [the problem is] the manipulation itself. I’m not sure if looking at the statue of Rodin makes you more analytical in the first place.

Knowledge at Wharton: You looked at a total of 21 experiments. What were some of the key takeaways from the entire project?

Nave: There is an ongoing debate in the social sciences as to whether there is a [replication] problem or not. The results of previous studies that failed to replicate a large number of papers published in top journals in psychology and economics were dismissed by some of the researchers. Some said that this was just some kind of statistical fluke … or maybe that the replications were not sufficiently similar to the original [experiments]. We wanted to overcome some of these limitations.

In order to do so, we first sent all of the materials to the original authors and got their endorsement of the experiment. In case we got something wrong, we also got comments from them. There was joint collaboration with the original authors in order to best replicate [the experiment] as close as possible to the original.

The second thing we did is we pre-registered the analysis, so everything was open online. People could go and read what we were doing. Everything was very clear a priori — before we ran the studies — [in terms of] what will be the analyses that we will use.

The third thing was using much larger samples than the original. Sample size is a very important factor in the experiment. If you have a large sample, you are more likely to be able to detect effects that are smaller…. The larger your sample is, the better the estimate you have of the effect size, and the better your capacity to detect effects that are smaller. One finding from the previous research that has been done in replicability is that even if studies do replicate, the effect in their replication seems to be smaller than in the original. We wanted to be ready for that. In order to do so, we had samples that were sufficiently large to detect effects that are even half of the original finding.

Knowledge at Wharton: Even in the studies that did replicate, the effect size was much smaller, correct?

Nave: Yes. We’ve seen it in previous studies. Again, in this study, because the samples were so large, the studies that failed to replicate had essentially a zero effect. But then we could tease apart the studies that didn’t replicate from the ones that did replicate. Even the studies that did replicate well had on average an effect that was only 75% of the original, which means that the original studies probably overstated the size of the effect by 33%.

This is something that one would expect to see if there is a publication bias in the literature. If results that are positive are being published, and results that are negative are not being published, you expect to see an inflation of the effect size. Indeed, this is what we saw in the studies. This means that if you want to replicate the study, you probably in the future want to use a number of participants that is larger than what you had in the original, so you can be sure that you will detect an effect that is smaller than what was reported originally.

Knowledge at Wharton: This was a collaboration of all of these researchers. What was their reaction to the results?

Nave: The reactions were pretty good, overall. When this crisis debate started and there were many failures to replicate the original findings, replication was not a normal thing to do. It was perceived by the authors of the original studies as something that is very hostile. I have to say, it doesn’t feel nice when your own study doesn’t replicate. But now after a few years, it’s becoming more and more normal. I think there is more acceptance that it is OK if your study doesn’t replicate. It does not mean that you did something bad on purpose. It can happen, and the researchers were quite open to this possibility.

“Sample size is a very important factor in the experiment. If you have a large sample, you are more likely to be able to detect effects that are smaller.”

If you look at the media coverage of our studies, one of the authors — the one of the Rodin analytical thinking and religious belief study — said that the [original] study was silly in the first place. [We have] commentaries of the eight authors of the papers that did not replicate well. In some cases they find reasons for why their experiment would not replicate — for example, the population is different. Many times, the subject pool has changed. If you are studying things like the influence of technology on behavior, then over the few years that went by since the original study and the applications, maybe there could have been changes in our reactions to technology and how technology influences us. This could, for example, be a reason for why a study fails to replicate. But overall, this is a very constructive process, and we’ve seen positive responses overall, even among those whose findings could not be replicated by us.

Knowledge at Wharton: Does this sort of failure to replicate occur more often in social science experiments versus medical ones, for example? If so, what could be done differently?

Nave: I don’t think that it’s more likely to be in the social sciences. I think that one important thing that was driving this replicability movement in the social sciences is that there were better settings to test human subjects either online or using laboratories that were professionally designed to run a large number of participants. We have to recognize that this is not the case in many other branches of science. For example, an MRI scan is something that takes two hours to do and costs $400. You would not expect a replicability researcher to run 500 participants in the MRI because it will take forever and cost a lot of money. We would have to accept the limitation of a small sample, and limited capacity to replicate exists when we have boundaries on the amount of participants that we can run.

Knowledge at Wharton: What kinds of implications does your study have?

Nave: One important feature of our study is that, before we ran the experiments, we recruited more than 200 scientists and had them predict what will happen in the replication. We asked them what they thought the probability is that the study would replicate. We also had them participate in a prediction market. There were 21 prediction markets for the 21 studies. In these markets, our participants started with some amount of money, and they could buy and sell stocks for the different experiments. Every stock at the end of the study would give them 100 cents if the study replicated well, and zero cents if it didn’t replicate. All of the stock prices started at 50 cents, and the prices slightly changed as a function of people’s beliefs about whether the experiments would replicate or not.

At the end of the study, we looked at the final stock prices. A high stock price implied that the market thought that the experiments would more likely replicate, and these prices very closely matched the results of the experiments. In fact, none of the studies for which the closing price was lower than 50 cents replicated. Only three studies that had higher than 50 cents could not replicate, which shows that people did know which studies would replicate before we even ran the studies.

“Replicability should be an integral part of the scientific process.”

That’s great news. It tells us that our scientists have the capacity to tell apart experiments that replicate and those that do not. I think one of the takeaways from this, which also relates to our future work, is the need to find out which properties of these studies predict whether they will replicate or not. It’s very clear that the sample size and the P value, which represent the strength of statistical evidence of the data, are very important. It seems like studies that had small samples and a high P value were less likely to replicate. Also, it seems like the strength of the theory is also a predictor.

From a practical angle, I think we should expect effects in replication studies to be smaller than original ones. If one wants to replicate an original study, I would definitely recommend not calculating the sample size based on the effect of the original study, but to have a sufficient amount of participants to detect 75% of the original effect. I think this is also a lesson we can generalize to the previous replication projects. I participated in one project that aimed to replicate studies in economics. And there were [similar] studies in psychology. These studies had large samples, but the samples were calibrated in order to detect the original effects. Because of that, it’s very possible that they failed to replicate findings because they could not detect effects that are smaller than that, as we now know one should expect.

Knowledge at Wharton: Could this research change how larger journals and high-profile journals like Nature and Science accept papers?

Nave: I think it will, and I think it already has. If we look at the studies that we replicated, all of the experiments that failed to replicate took place between the years 2010 and 2013. From the last two years of the studies that we selected, everything replicated. These are only four studies, so I’m not going to make bold claims, like “everything now replicates.” But it’s very clear that there were changes in journal policies. This is especially true for psychology journals, where one now has to share the data, share the analysis scripts. You get a special badge recognizing when you pre-register the study. Pre-registration is a very important thing. It’s committing to an analysis plan — the number of participants that you will run and how you will analyze the data — before you do [the experiment]. When you do that, you limit the amount of bias that your own decisions, when analyzing the data, can induce. There were previous studies conducted, mostly here at Wharton, showing that when you have some flexibility in the analysis, you are very likely to find results that are statistically significant but do not reflect the effect. By pre-registering, researchers tie their own hands before collecting the data, and that allows them to generate results that are more robust and replicable.

“If a study doesn’t replicate, you’d better know it before building on it and standing on the shoulders of the researchers that conducted it.”

Knowledge at Wharton: What’s the next step in your own research?

Nave:  With relation to the replicability, we are now looking at what made people predict so well whether those studies will replicate or not. One of the things that we’ve done here was also try to use machine learning in order to go over the papers — using features such as the P value, the sample size, the text of the papers and some [other] information — and see whether an algorithm can predict as well as the humans whether studies will replicate or not. For this specific [experiment], the algorithm can detect replicability in something like 80% of the cases, which is not bad at all. So, we are working on automating this process.

Another thing is just continue to replicate. Replicability should be an integral part of the scientific process. We have neglected it, maybe for some time. Maybe it was because people were perceived as belligerent or aggressive if they tried to challenge other people’s views. But when you think of it, this is the way science has progressed for many years. If a study doesn’t replicate, you’d better know it before building on it and standing on the shoulders of the researchers that conducted it.

The Rodin study that I just described had about 400 citations in as little as four years, and there were studies [that failed to replicate] that had even more citations. This is Science and Nature. These papers have a high impact on many disciplines, and they are accepted based on their potential impact. So, these early results are important. I think that we should keep replicating findings, and researchers and journal editors should be aware that if results are not replicable, it could lead to a waste of people’s time, of people’s careers and of public money that is used to generate additional studies based on the original result.