Listen to the podcast:
Wharton marketing professor Gideon Nave has collaborated with a multinational team of researchers on a project that aimed to replicate the results of 21 social science experiments published in the journals Nature and Science. According to the team’s research, only 13 studies found results that supported the original studies. A surprising 38% of these studies failed to produce the same results. The paper is titled, “Evaluating the Replicability of Social Science Experience in Nature and Science between 2010 and 2015” and was published in Nature Human Behaviour. Nave joined Knowledge@Wharton to talk about the paper and what it means for the future of research.
An edited transcript of the conversation follows.
Knowledge@Wharton: We rely on research journals to vet what they publish, but studies like yours have shown that the results of high-profile experiments often can’t be replicated. Do you think there’s a “replication crisis,” as people are calling this?
Gideon Nave: I don’t know if I want to use the word crisis to describe it, but we certainly know that many results that are published in top academic journals, including classic results that are parts of textbooks and TED Talks, do not replicate well, which means that if you repeat the experiment with exactly the same materials in a different population, sometimes in very similar populations, the results do not seem to hold.
Top academic journals like Science and Nature, which are the ones that we used in this study, have acceptance rates of something like 5% of papers [that are submitted], so it’s not like they don’t have papers to select from. In my view, the replication rates that we have seen in these studies is lower than what you would expect.
Knowledge@Wharton: Can you describe some of the experiments you replicated?
Nave: The experiments that we used were social science experiments involving human participants, either online or in laboratory studies. The experiments we selected also typically had some manipulation, meaning there is an experimental setting where half of the population gets some treatment, and the other half gets another.
For example, we had a study in which people watched a picture of a statue. In one condition, it was Rodin’s “The Thinker,” and in the other one, it was a man throwing a discus. The assumption of the researchers was that when you show people Rodin’s Thinker, it makes them more analytical, so this was the manipulation. And then they measured people’s religious beliefs. The finding that the paper reported was that when you look at the picture of Rodin and become more analytical, you are less likely to report that you believe in God.
“Even the studies that did replicate well had on average an effect that was only 75% of the original, which means that the original studies probably overstated the size of the effect….”
Knowledge@Wharton: What did you find when you tried to replicate that one?
Nave: This study specifically did not replicate. I think [the problem is] the manipulation itself. I’m not sure if looking at the statue of Rodin makes you more analytical in the first place.
Knowledge@Wharton: You looked at a total of 21 experiments. What were some of the key takeaways from the entire project?
Nave: There is an ongoing debate in the social sciences as to whether there is a [replication] problem or not. The results of previous studies that failed to replicate a large number of papers published in top journals in psychology and economics were dismissed by some of the researchers. Some said that this was just some kind of statistical fluke … or maybe that the replications were not sufficiently similar to the original [experiments]. We wanted to overcome some of these limitations.
In order to do so, we first sent all of the materials to the original authors and got their endorsement of the experiment. In case we got something wrong, we also got comments from them. There was joint collaboration with the original authors in order to best replicate [the experiment] as close as possible to the original.
The second thing we did is we pre-registered the analysis, so everything was open online. People could go and read what we were doing. Everything was very clear a priori — before we ran the studies — [in terms of] what will be the analyses that we will use.
The third thing was using much larger samples than the original. Sample size is a very important factor in the experiment. If you have a large sample, you are more likely to be able to detect effects that are smaller…. The larger your sample is, the better the estimate you have of the effect size, and the better your capacity to detect effects that are smaller. One finding from the previous research that has been done in replicability is that even if studies do replicate, the effect in their replication seems to be smaller than in the original. We wanted to be ready for that. In order to do so, we had samples that were sufficiently large to detect effects that are even half of the original finding.
Knowledge@Wharton: Even in the studies that did replicate, the effect size was much smaller, correct?
Nave: Yes. We’ve seen it in previous studies. Again, in this study, because the samples were so large, the studies that failed to replicate had essentially a zero effect. But then we could tease apart the studies that didn’t replicate from the ones that did replicate. Even the studies that did replicate well had on average an effect that was only 75% of the original, which means that the original studies probably overstated the size of the effect by 33%.
This is something that one would expect to see if there is a publication bias in the literature. If results that are positive are being published, and results that are negative are not being published, you expect to see an inflation of the effect size. Indeed, this is what we saw in the studies. This means that if you want to replicate the study, you probably in the future want to use a number of participants that is larger than what you had in the original, so you can be sure that you will detect an effect that is smaller than what was reported originally.
Knowledge@Wharton: This was a collaboration of all of these researchers. What was their reaction to the results?
Nave: The reactions were pretty good, overall. When this crisis debate started and there were many failures to replicate the original findings, replication was not a normal thing to do. It was perceived by the authors of the original studies as something that is very hostile. I have to say, it doesn’t feel nice when your own study doesn’t replicate. But now after a few years, it’s becoming more and more normal. I think there is more acceptance that it is OK if your study doesn’t replicate. It does not mean that you did something bad on purpose. It can happen, and the researchers were quite open to this possibility.
“Sample size is a very important factor in the experiment. If you have a large sample, you are more likely to be able to detect effects that are smaller.”
If you look at the media coverage of our studies, one of the authors — the one of the Rodin analytical thinking and religious belief study — said that the [original] study was silly in the first place. [We have] commentaries of the eight authors of the papers that did not replicate well. In some cases they find reasons for why their experiment would not replicate — for example, the population is different. Many times, the subject pool has changed. If you are studying things like the influence of technology on behavior, then over the few years that went by since the original study and the applications, maybe there could have been changes in our reactions to technology and how technology influences us. This could, for example, be a reason for why a study fails to replicate. But overall, this is a very constructive process, and we’ve seen positive responses overall, even among those whose findings could not be replicated by us.
Knowledge@Wharton: Does this sort of failure to replicate occur more often in social science experiments versus medical ones, for example? If so, what could be done differently?
Nave: I don’t think that it’s more likely to be in the social sciences. I think that one important thing that was driving this replicability movement in the social sciences is that there were better settings to test human subjects either online or using laboratories that were professionally designed to run a large number of participants. We have to recognize that this is not the case in many other branches of science. For example, an MRI scan is something that takes two hours to do and costs $400. You would not expect a replicability researcher to run 500 participants in the MRI because it will take forever and cost a lot of money. We would have to accept the limitation of a small sample, and limited capacity to replicate exists when we have boundaries on the amount of participants that we can run.
Knowledge@Wharton: What kinds of implications does your study have?
Nave: One important feature of our study is that, before we ran the experiments, we recruited more than 200 scientists and had them predict what will happen in the replication. We asked them what they thought the probability is that the study would replicate. We also had them participate in a prediction market. There were 21 prediction markets for the 21 studies. In these markets, our participants started with some amount of money, and they could buy and sell stocks for the different experiments. Every stock at the end of the study would give them 100 cents if the study replicated well, and zero cents if it didn’t replicate. All of the stock prices started at 50 cents, and the prices slightly changed as a function of people’s beliefs about whether the experiments would replicate or not.
At the end of the study, we looked at the final stock prices. A high stock price implied that the market thought that the experiments would more likely replicate, and these prices very closely matched the results of the experiments. In fact, none of the studies for which the closing price was lower than 50 cents replicated. Only three studies that had higher than 50 cents could not replicate, which shows that people did know which studies would replicate before we even ran the studies.
“Replicability should be an integral part of the scientific process.”
That’s great news. It tells us that our scientists have the capacity to tell apart experiments that replicate and those that do not. I think one of the takeaways from this, which also relates to our future work, is the need to find out which properties of these studies predict whether they will replicate or not. It’s very clear that the sample size and the P value, which represent the strength of statistical evidence of the data, are very important. It seems like studies that had small samples and a high P value were less likely to replicate. Also, it seems like the strength of the theory is also a predictor.
From a practical angle, I think we should expect effects in replication studies to be smaller than original ones. If one wants to replicate an original study, I would definitely recommend not calculating the sample size based on the effect of the original study, but to have a sufficient amount of participants to detect 75% of the original effect. I think this is also a lesson we can generalize to the previous replication projects. I participated in one project that aimed to replicate studies in economics. And there were [similar] studies in psychology. These studies had large samples, but the samples were calibrated in order to detect the original effects. Because of that, it’s very possible that they failed to replicate findings because they could not detect effects that are smaller than that, as we now know one should expect.
Knowledge@Wharton: Could this research change how larger journals and high-profile journals like Nature and Science accept papers?
Nave: I think it will, and I think it already has. If we look at the studies that we replicated, all of the experiments that failed to replicate took place between the years 2010 and 2013. From the last two years of the studies that we selected, everything replicated. These are only four studies, so I’m not going to make bold claims, like “everything now replicates.” But it’s very clear that there were changes in journal policies. This is especially true for psychology journals, where one now has to share the data, share the analysis scripts. You get a special badge recognizing when you pre-register the study. Pre-registration is a very important thing. It’s committing to an analysis plan — the number of participants that you will run and how you will analyze the data — before you do [the experiment]. When you do that, you limit the amount of bias that your own decisions, when analyzing the data, can induce. There were previous studies conducted, mostly here at Wharton, showing that when you have some flexibility in the analysis, you are very likely to find results that are statistically significant but do not reflect the effect. By pre-registering, researchers tie their own hands before collecting the data, and that allows them to generate results that are more robust and replicable.
“If a study doesn’t replicate, you’d better know it before building on it and standing on the shoulders of the researchers that conducted it.”
Knowledge@Wharton: What’s the next step in your own research?
Nave: With relation to the replicability, we are now looking at what made people predict so well whether those studies will replicate or not. One of the things that we’ve done here was also try to use machine learning in order to go over the papers — using features such as the P value, the sample size, the text of the papers and some [other] information — and see whether an algorithm can predict as well as the humans whether studies will replicate or not. For this specific [experiment], the algorithm can detect replicability in something like 80% of the cases, which is not bad at all. So, we are working on automating this process.
Another thing is just continue to replicate. Replicability should be an integral part of the scientific process. We have neglected it, maybe for some time. Maybe it was because people were perceived as belligerent or aggressive if they tried to challenge other people’s views. But when you think of it, this is the way science has progressed for many years. If a study doesn’t replicate, you’d better know it before building on it and standing on the shoulders of the researchers that conducted it.
The Rodin study that I just described had about 400 citations in as little as four years, and there were studies [that failed to replicate] that had even more citations. This is Science and Nature. These papers have a high impact on many disciplines, and they are accepted based on their potential impact. So, these early results are important. I think that we should keep replicating findings, and researchers and journal editors should be aware that if results are not replicable, it could lead to a waste of people’s time, of people’s careers and of public money that is used to generate additional studies based on the original result.
Join The Discussion
One Comment So Far
Anumakonda Jagadeesh
Excellent.
The replication crisis (or replicability crisis or reproducibility crisis) is an ongoing (2018) methodological crisis in sciencein which scholars have found that the results of many scientific studies are difficult or impossible to replicate or reproduce on subsequent investigation, either by independent researchers or by the original researchers themselves. The crisis has long-standing roots; the phrase was coined in the early 2010s as part of a growing awareness of the problem.
Because the reproducibility of experiments is an essential part of the scientific method,[4] the inability to replicate the studies of others has potentially grave consequences for many fields of science in which significant theories are grounded on unreproducible experimental work.
The replication crisis has been particularly widely discussed in the field of psychology (and in particular, social psychology) and in medicine, where a number of efforts have been made to re-investigate classic results, and to attempt to determine both the reliability of the results, and, if found to be unreliable, the reasons for the failure of replication.
According to a 2016 poll of 1,500 scientists reported in the journal Nature, 70% of them had failed to reproduce at least one other scientist’s experiment (50% had failed to reproduce one of their own experiments).
In 2009, 2% of scientists admitted to falsifying studies at least once and 14% admitted to personally knowing someone who did. Misconducts were reported more frequently by medical researchers than others.
In psychology:
Several factors have combined to put psychology at the center of the controversy. Much of the focus has been on the area of social psychology, although other areas of psychology such as clinical psychology have also been implicated.
Firstly, questionable research practices (QRPs) have been identified as common in the field. Such practices, while not intentionally fraudulent, involve capitalizing on the gray area of acceptable scientific practices or exploiting flexibility in data collection, analysis, and reporting, often in an effort to obtain a desired outcome. Examples of QRPs include selective reporting or partial publication of data (reporting only some of the study conditions or collected dependent measures in a publication), optional stopping (choosing when to stop data collection, often based on statistical significance of tests), p-value rounding (rounding p-values down to 0.05 to suggest statistical significance), file drawer effect (nonpublication of data), post-hoc storytelling (framing exploratory analyses as confirmatory analyses), and manipulation of outliers (either removing outliers or leaving outliers in a dataset to cause a statistical test to be significant). A survey of over 2,000 psychologists indicated that a majority of respondents admitted to using at least one QRP.[10] False positive conclusions, often resulting from the pressure to publish or the author’s own confirmation bias, are an inherent hazard in the field, requiring a certain degree of skepticism on the part of readers.
Secondly, psychology and social psychology in particular, has found itself at the center of several scandals involving outright fraudulent research, most notably the admitted data fabrication by Diederik Stapel as well as allegations against others. However, most scholars acknowledge that fraud is, perhaps, the lesser contribution to replication crises.
Third, several effects in psychological science have been found to be difficult to replicate even before the current replication crisis. For example the scientific journal Judgment and Decision Making has published several studies over the years that fail to provide support for the unconscious thought theory. Replications appear particularly difficult when research trials are pre-registered and conducted by research groups not highly invested in the theory under questioning.
These three elements together have resulted in renewed attention for replication supported by Kahneman. Scrutiny of many effects have shown that several core beliefs are hard to replicate. A recent special edition of the journal Social Psychology focused on replication studies and a number of previously held beliefs were found to be difficult to replicate. A 2012 special edition of the journal Perspectives on Psychological Science also focused on issues ranging from publication bias to null-aversion that contribute to the replication crises in psychology. In 2015, the first open empirical study of reproducibility in Psychology was published, called the Reproducibility Project. Researchers from around the world collaborated to replicate 100 empirical studies from three top Psychology journals. Fewer than half of the attempted replications were successful at producing statistically significant results in the expected directions, though most of the attempted replications did produce trends in the expected directions.
Many research trials and meta-analyses are compromised by poor quality and conflicts of interest that involve both authors and professional advocacy organizations, resulting in many false positives regarding the effectiveness of certain types of psychotherapy.
Although the British newspaper The Independent wrote that the results of the reproducibility project show that much of the published research is just “psycho-babble”, the replication crisis does not necessarily mean that psychology is unscientific. Rather this process is a healthy if sometimes acrimonious part of the scientific process in which old ideas or those that cannot withstand careful scrutiny are pruned, although this pruning process is not always effective. The consequence is that some areas of psychology once considered solid, such as social priming, have come under increased scrutiny due to failed replications.
Nobel laureate and professor emeritus in psychology Daniel Kahneman argued that the original authors should be involved in the replication effort because the published methods are often too vague. Others such as Dr. Andrew Wilson disagree and argue that the methods should be written down in detail. An investigation of replication rates in psychology in 2012 indicated higher success rates of replication in replication studies when there was author overlap with the original authors of a study (91.7% successful replication rates in studies with author overlap compared to 64.6% success replication rates without author overlap).
Psychology replication rates[edit]
A report by the Open Science Collaboration in August 2015 that was coordinated by Brian Nosek estimated the reproducibility of 100 studies in psychological science from three high-ranking psychology journals. Overall, 36% of the replications yielded significant findings (p value below 0.05) compared to 97% of the original studies that had significant effects. The mean effect size in the replications was approximately half the magnitude of the effects reported in the original studies.
The same paper examined the reproducibility rates and effect sizes by journal (Journal of Personality and Social Psychology[JPSP], Journal of Experimental Psychology: Learning, Memory, and Cognition [JEP:LMC], Psychological Science [PSCI]) and discipline (social psychology, cognitive psychology). Study replication rates were 23% for JPSP, 38% for JEP:LMC, and 38% for PSCI. Studies in the field of cognitive psychology had a higher replication rate (50%) than studies in the field of social psychology (25%).
An analysis of the publication history in the top 100 psychology journals between 1900 and 2012 indicated that approximately 1.6% of all psychology publications were replication attempts.[32] Articles were considered a replication attempt if the term “replication” appeared in the text. A subset of those studies (500 studies) was randomly selected for further examination and yielded a lower replication rate of 1.07% (342 of the 500 studies [68.4%] were actually replications). In the subset of 500 studies, analysis indicated that 78.9% of published replication attempts were successful. The rate of successful replication was significantly higher when at least one author of the original study was part of the replication attempt (91.7% relative to 64.6%).
A 2018 study in Nature Human Behaviour sought to replicate 21 social and behavioral science papers from Nature and Science, finding that only 13 could be successfully replicated.
In a work published in 2015 Glenn Begley and John Ioannidis offer five bullets as to summarize the present predicaments:
• Generation of new data/publications at an unprecedented rate.
• Compelling evidence that the majority of these discoveries will not stand the test of time.
• Causes: failure to adhere to good scientific practice & the desperation to publish or perish.
• This is a multifaceted, multistakeholder problem.
• No single party is solely responsible, and no single solution will suffice.
In fact some predictions of a possible crisis in the quality control mechanism of science can be traced back several decades, especially among scholars in science and technology studies (STS). Derek de Solla Price – considered the father of scientometrics – predicted that science could reach ‘senility’ as a result of its own exponential growth. Some present day literature seems to vindicate this ‘overflow’ prophesy, lamenting at decay in both attention and quality.
Philosopher and historian of science Jerome R. Ravetz predicted in his 1971 book Scientific Knowledge and Its Social Problems that science – in moving from the little science made of restricted communities of scientists to big science or techno-science – would suffer major problems in its internal system of quality control. Ravetz anticipated that modern science’s system of rewarding scientists for research might become dysfunctional, the present ‘publish or perish’ challenge, creating perverse incentives to publish any findings however dubious. For Ravetz quality in science is maintained when there is a community of scholars linked by norms and standards, and a willingness to stand by these.
Historian Philip Mirowski offered more recently a similar diagnosis in his 2011 book Science Mart (2011). ‘Mart’ is here a reference to the retail giant ‘Walmart’ and an allusion to the commodification of science. In the analysis of Mirowski, when science becomes a commodity being traded in a market, its quality collapses. Mirowski argues his case by tracing the decay of science to the decision of major corporations to close their in-house laboratories in order to outsource their work to universities, and subsequently to move their research away from universities to even cheaper contract research organizations (CRO).
The crisis of science’s quality control system is affecting the use of science for policy. This is the thesis of a recent work by a group of STS scholars, who identify in ‘evidence based (or informed) policy’ a point of present tension. Economist Noah Smith suggests that a factor in the crisis has been the overvaluing of research in academia and undervaluing of teaching ability, especially in fields with few major recent discoveries.
Addressing the replication crisis
Replication has been referred to as “the cornerstone of science”. Replication studies attempt to evaluate whether published results reflect true findings or false positives. The integrity of scientific findings and reproducibility of research are important as they form the knowledge foundation on which future studies are built(Wikipedia).
Marcus R. Munafò and George Davey Smith argue, in a piece published by Nature, that research should emphasize triangulation, not just replication. They claim that,
“replication alone will get us only so far (and) might actually make matters worse… We believe that an essential protection against flawed ideas is triangulation. This is the strategic use of multiple approaches to address one question. Each approach has its own unrelated assumptions, strengths and weaknesses. Results that agree across different methodologies are less likely to be artefacts…. Maybe one reason replication has captured so much interest is the often-repeated idea that falsification is at the heart of the scientific enterprise. This idea was popularized by Karl Popper’s 1950s maxim that theories can never be proved, only falsified. Yet an overemphasis on repeating experiments could provide an unfounded sense of certainty about findings that rely on a single approach…. philosophers of science have moved on since Popper. Better descriptions of how scientists actually work include what epistemologist Peter Lipton called in 1991 “inference to the best explanation”.
Dr.A.Jagadeesh Nellore(AP),India