517
Views
0
CrossRef citations to date
0
Altmetric
Report

From fact to fake: The importance of being significant

ORCID Icon
Article: 2236995 | Received 09 May 2023, Accepted 11 Jul 2023, Published online: 31 Jul 2023

Abstract

The controversy about statistical significance vs. scientific relevance is more than 100 years old. But still nowadays null hypothesis significance testing is considered the gold standard in many empirical fields from economics, social sciences, psychology to medicine, and small p-values are often the key to publish in journals of high scientific reputation. I highlight and discuss three cases of potential pitfalls of statistical significance testing. Each case is illustrated with a real data example and an accompanying artificial example.

1 Introduction

Public opinion expects scientific results to be correct and relevant. Hence, statistically significant findings sound tempting and suggestive, although they are not necessarily meaningful. Boring (Citation1919) observed an antagonism “Mathematical vs. scientific significance” and stated an “apparent inconsistency between scientific intuition and mathematical result.” One hundred years later, two prominent papers reconfirmed such early critique and expressed far more vehement objections. First, Amrhein et al. (Citation2019) gathered in Nature more than 800 scientists behind their “Rise up Against Statistical Significance.” They “are not calling for a ban on P values,” but “are calling for a stop to the use of P values in the conventional, dichotomous way–to decide whether a result refutes or supports a scientific hypothesis.” Second, The American Statistician published a supplementary issue (Volume 73, 2019) on Statistical Inference in the 21st Century: A World Beyond p<0.05. It comprises more than 40 papers including the editorial expressing a more radical view (Wasserstein, Schirm, and Lazar Citation2019, p. 2): “We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term ‘statistically significant’ entirely. Nor should variants such as ‘significantly different,’ ‘p < 0.05,’ and ‘nonsignificant’ survive, whether expressed in words, by asterisks in a table, or in some other way.” What is the reason behind the demanded rise against or even ban of statistical significance?

There is a rich tradition of critical comments and skeptical views on null hypothesis significance testing, see e.g., Berkson (Citation1938), Rozeboom (Citation1960), Bakan (Citation1966) and Meehl (Citation1978). Hence, Cohen (Citation1994) wrote in his paper with the polemic title The Earth Is Round (p < .05): “After 4 decades of severe criticism, the ritual of null hypothesis significance testing–mechanical dichotomous decisions around a sacred .05 criterion–still persists.” Consequently, after the turn of the century the criticism has become even more pronounced: Gigerenzer (Citation2004), Ioannidis (Citation2005) and Ziliak and McCloskey (Citation2008) headlined Mindless statistics, Why most published research findings are false and The cult of statistical significance, respectively.

In this review, I point out common pitfalls when statistical tests are applied in practice: Although data are not manipulated and actual results are reported (“facts”), the conclusions are faulty (“fake”), because the methods are not applied in a correct manner. Three examples abstracted from cases found in literature quantify this effect of malfunctioning. The mechanisms behind the fallacies are p-hacking (see Simonsohn et al. Citation2014) and MESSing (manipulating evidence subject to snooping, see Hassler and Pohle Citation2022) that both relate to HARKing (Hypothesizing After the Results are Known, see Kerr Citation1998). These techniques are typically employed with impure intentions, striving for results that otherwise would not show up. I also discuss a third case that time series analysts are particularly prone to, even benevolent researchers: Testing Observed Surprising Structures (TOSSing). Here, apparently striking random patterns catch our attention, although they are more likely than one naively believes. Having observed such a striking feature or cluster it is not hard to hit on some prior event that is suspected to be causal and have explanatory power for the pattern. Consequently, one tests for the “obviously” surprising structure–and may be misled by nonsensical significance. Even if in good faith, TOSSing may be just as harmful as p-hacking or MESSing.

The next section briefly reviews the very nature of the “ritual of significance testing.” The third section discusses examples that are representative of potential fallacies lurking in applied work. A detailed discussion is provided in Section 4, and conclusions are offered in the final section.

2 Meaning of significance

The concept of computing a p-value as criterion to judge whether some observations can be considered as generated by random or not has been formalized by Pearson (Citation1900), although he did not use the term “significant” yet. The earliest use of “statistical significance” I am aware of–admittedly not being a historian–is by Boring (Citation1916, p. 315). It was the highly influential book by Fisher (Citation1925) that firmly anchored “tests of significance” and p-values in much of the empirical research over the last century, where H0 signifies a null hypothesis to be tested. The index is motivated by (agricultural) treatment studies dominating in Fisher (Citation1925): Under the null hypothesis there is a zero effect.

The idea of a statistical test, however, is much older and can be traced back at least to Arbuthnot (Citation1710), see Stigler (Citation1986, pp. 225, 226) and Shoesmith (Citation1987). Arbuthnot (Citation1710) published the numbers of boys and girls baptized per year in London over 82 years from 1629 till 1710. In each single year, he observed more boys born and baptized than girls–of course he did not consider further sexes beyond male and female. He formulated the assumption of equal probability, i.e., that there is a 50% chance of a newborn being a boy or a girl. And he reckoned that under this assumption the probability of more boys born every year over 82 years equals the probability of getting 82 heads in row when tossing a fair coin: (12)82, which is a very small number indeed. Arbuthnot (Citation1710, p. 188) found this number “by the Table of Logarithms to be 14,836,000,000,000,000,000,000,000,” which is very close to the value given by my computer: 14,835,700,000,000,000,000,000,000. And Arbuthnot concluded that the assumption of equal probability must be wrong. Let us have a look into the structure of this argument.

What statistical inference takes for granted is a collection of data, from which the value of a test statistic, say t, is computed; the larger t, the less likely is the occurrence of this value if a null hypothesis H0 is true. Let T denote the test statistic that could take on different values than t. The p-value is defined as the probability to observe a value at least as extreme as t, assuming that H0 is true: (1) p-value=P(Tt if H0 is true) .(1)

Hence, the p-value is the probability to observe the value t of the test statistic T, or an even stronger violation of H0, if the null hypothesis is true. In other words: Before one observes the data resulting in the value t, one predicts the probability to find Tt–under the assumption the null hypothesis holds true–and this probability defines the p-value. The case by Arbuthnot (Citation1710) can be cast into this form as follows. The null hypothesis is that the birth of a boy is as probable as the birth of a girl: H0:P(boy)=P(girl)=12 .

Let T measure the number of years where more boys are born than girls out of a total of 82 years. Before collecting data, we do not know what value T will take on; it is a random variable (test statistic) used to test H0. Then Arbuthnot collected the data and observed t = 82. It follows (by the binomial distribution) that p-value=P(T82 if P(boy)=12)=(12)82 .

The trust in the truth of H0 is shattered for small p-values by the data summarized in t: The p-value is used to formulate confidence in the assumed null hypothesis. If p-value<0.05, one often says that the statistic T or the statistical test is significant, more precisely: significant at the 5% level. More generally, significance at the level α for some (small) number α, 0<α<1, is achieved for a p-value smaller than α: If a p-value is smaller than α, one rejects the null hypothesis at level α, knowing of course that such a decision may be wrong. But the probability of a wrongful rejection is controlled exactly by the level α in that the probability to reject erroneously is–by construction of a significance test–at most α.

If one fails to reject a null hypothesis this is not equivalent to proving that the null hypothesis is true. In fact, a significance test does not address the null hypothesis directly. Falk and Greenbaum (Citation1995, p. 75) criticized this since “the conclusion that, given a significant result, H0 becomes improbable is not generally true.” The p-value does not give the probability that the null hypothesis as statement of interest is true or false; it is only quite indirectly linked to the null hypothesis by (1), see also the discussion in Wasserstein and Lazar (Citation2016). Notice, in a “classical” (or: frequentist) view, the null hypothesis is a state of the world, which is either true or not. Such a view allows for stronger or weaker belief in H0, but it does not model the degree of confidence by means of probabilities. On the contrary, a Bayesian take is to model the degree of belief or disbelief as prior and posterior probabilities, see e.g., Good (Citation1975); but this is beyond the scope of this note.

3 Three fallacies

Data mining or data snooping techniques are widespread in empirical research. While such tools are perfectly legitimate for some purposes, they are prone to invalidate statistical inference. Care and caution are hence advisable when it comes to significance testing after preliminary data analysis. Three examples show that incorrect application of statistical tests may produce misleading results.

3.1 “Feeling the future”

Before turning to the real case by Bem (Citation2011) with the title “Feeling the future” I consider an artificial example.

Example 1.

Consider the following (computer) experiment. We draw a sample of size n=105 from (pseudo) random numbers 1,2,,100. Each number has the same probability (namely 1/100) to enter the sample. Then I count how often repdigits (numbers with repeated digits) occur. There are 9 repdigits between 1 and 100, namely E={11,22,,99}. Hence, one expects the event E to occur with probability P(E)=9100=0.09 when random sampling. The observed frequency in the experiment, however, is considerably smaller: 8,804 cases out of n = 100, 000, i.e., the observed relative frequency is P̂(E)=0.08804, where P̂(·) is short for the empirical relative frequency. What do I mean by “considerably smaller”? Is the difference between 0.09 and 0.088 simply due to sampling variability and caused by chance? Or is this difference “systematic”? In fact, it is “statistically significant,” or more precisely “significant at the 5% level” in that a statistic testing for the null hypothesis P(E)=0.09 produces a one-tailed p-value smaller than 0.05, namely 0.0151.

There seems to be no good reason why repdigits should occur less often in random samples than theoretically expected or predicted. So, what is going wrong in Example 1? Most statisticians would say that a sample size of n=105 is not too small. The experiment was carried out with the open source software R, and searching the internet you will not find comments hinting at evidence that the (pseudo) random number generator is defective. If nothing went wrong, should I submit my finding that repdigits occur significantly (at level 5%) less often than theoretically predicted to a scientific journal? Is it possible to publish empirical findings hardly anyone has trust in a priori? Isn’t it the very nature of science to dump an a priori hypothesis if data contradict it?

The last two questions have been discussed in connection with the (in)famous study by Bem (Citation2011) on extrasensory perception (ESP). 100 participants of an experiment had to guess, or rather predict, whether a picture would show up on the left or on the right of the screen in front of them. The position was determined randomly by computer with equal chances (50%); the pictures were of erotic or nonerotic content. Bem (Citation2011, p. 409) clarifies that “neither the picture itself nor its left/right position was determined until after the participant recorded his or her guess, making the procedure a test of detecting a future event (i.e., a test of precognition).” He summarizes “Across all 100 sessions, participants correctly identified the future position of the erotic pictures significantly more frequently than the 50% hit rate expected by chance: 53.1%,” resulting in (one-tailed) significance of 1%. Further, “In contrast, their hit rate on the nonerotic pictures did not differ significantly from chance: 49.8%” with a p-value larger than 0.5. Of course, I cannot tell how many tests were executed and how many alternatives were tried (without significance at 5%) before the significant alternative of erotic pictures showed up and the study got published in one of the leading journals of the American Psychological Association–and it seems that Daryl Bem cannot tell either. He is quoted by Engber (Citation2017) as follows: “I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” and “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t.” Note this is not a case of scientific fraud–but maybe sloppiness. Engber (Citation2017) quotes Bem as “I think probably some of the criticism could well be valid. I was never dishonest, but on the other hand, the critics were correct.” Early critics were Wagenmakers et al. (Citation2011). Notice, however, that such criticism does not rule out sound statistical analysis of ESP, see for instance Utts (Citation1999).

I now offer a way out of our discomforting empirical evidence given in Example 1. Similarly to Bem in Engber (Citation2017) I now admit that I actually tested 20 nonsensical hypotheses at the 5% level building on the events E1={2,4,8,16,32,64},i.e.,powers of 2,E2={3,9,27,81},i.e.,powers of 3,E3={2,3,5,8,13,21,34,55,89},i.e.,Fibonacci numbers,E20={2,3,5,7,11,13,,83,89,97},i.e.,prime numbers,including E={11,22,,99}. In each case I confronted the theoretical null hypothesis P(Ej) with the empirical sample pendant P̂(Ej), and only the case of repdigits was significant (at 5%). If you perform 20 tests at significance level 5% and if all null hypotheses are true, then you must expect one test out of 20 tests to be erroneously significant at the 5% level. This is part of the logic of significance tests. Hence, in Example 1 I did not manipulate the data or the statistics but reported the empirical facts. I only concealed that a lot of tests were run before finding and reporting a significant result, which for this reason is actually fake. Such a research practice is related to so-called p-hacking, see Simonsohn et al. (Citation2014) and Simmons et al. (Citation2011). One continues collecting data or carrying out experiments until a sufficiently small p-value shows up, thus promising some significant result, which is prone to “false positive” findings.

There is a second problematic aspect accompanying Example 1 beyond testing hypotheses until obtaining a rejection: the sheer sample size of n = 100, 000. One might think that large samples mean more information and are hence always beneficial. But having more information, one must become more critical, because “Any effect, no matter how tiny, can produce a small p-value if the sample size or measurement precision is high enough” (Wasserstein and Lazar Citation2016, p, 132), see also Cohen (Citation1990, p. 1308) or Utts (Citation1999, p. 621). Therefore, it has been proposed that smaller threshold p-values than 5% should be employed in the presence of larger sample sizes; but just as the 5% convention cannot be justified, it is not clear how decreasing significance levels reasonably could be linked to increasing sample sizes. Or, as Leamer (Citation1978, p. 89) puts it paraphrasing Berkson (Citation1938): “since a large sample is presumably more informative than a small one, and since it is apparently the case that we will reject the null hypothesis in a sufficiently large sample, we might as well begin by rejecting the hypothesis and not sample at all.”

3.2 (Un)Lucky numbers?

Again, we begin with a simulated example before turning to a real case. The background of Example 2 is a board game, where to enter a token into the play the player must roll a 6, and on top rolling a 6 gives the player an extra roll.

Example 2.

A student has been tossing the dice 60,000 times. In case of a fair die one expects the numbers 1 through 6 each 10,000 times–roughly, since tossing the dice is a random process. Of course, the student simulated throwing the dice by means of drawing (pseudo) random numbers with a computer. She reports to her supervisor that the number 6 occurred (much?) more often than expected: 10,203 times. “Is this significant–and at what level?” asks the supervisor. The student knows from her first year course in statistics how to transform the difference between the observed frequency of number 6, n(6)=10,203, and the expected frequency 10,000 into a test statistic that follows the standard normal distribution if the die is not loaded. The value of the test statistic produces a one-tailed p-value of 0.0131<0.05. She returns to her supervisor and claims that this is strong significance against the dice to be fair, or rather against the random number generator to simulate numbers with equal probability. What do you expect the supervisor to reply?

The example mimics the situation observed by Hassler and Pohle (Citation2022, p. 398) who evaluated 4,337 games of the German lottery with n = 26, 022 draws out of 49 balls carrying the numbers 1 through 49: “the number 13 was drawn only 471 times in the German lottery, while (roughly) 531 cases would have to be expected under equal probability of all 49 numbers.” In fact, the (unlucky?) number 13 was the one with the least favorable odds. And the massive deviation of the observed frequency from the expected number of occurences under the null hypothesis that the ball with number 13 has a probability of 1/49 resulted in a p-value of 0.0027. This p-value is much smaller than 0.05, even smaller than 0.01, which is sometimes celebrated as “highly significant.” This nonsense significance is explained by Hassler and Pohle (Citation2022, p. 402): “Generally, statistical tests are invalidated when one postulates hypotheses or test statistics subject to data snooping and uses the same data to test them.” Such a procedure has also been characterized by the so-called Texas sharpshooter fallacy where someone fires a gun first and paints the target afterwards around the bullet in the barn door.

Let us return to Example 2 where dice have been tossed. The complete frequency distribution is reported in . We find that the observed frequencies are in 3 cases larger than the expected value 10,000 and in 3 cases smaller. Some of the deviations are small in absolute value (number 1), some are larger, and clearly the largest one is the case of number 6. One of the numbers has to be the most frequent one. Hence, the flaw of the statistical significance in Example 2 consists in not testing all numbers of the die for equal probability 1/6, but to pick the most striking violation and test only that, namely Students null hypothesis:P(6)=16 .

Table 1 Frequency when tossing dice 60,000 times.

The supervisor from Example 2 replies one should not test this specific null hypothesis, but more reasonably that all numbers have equal probability, (2) Supervisors null hypothesis:P(1)=P(2)=P(3)=P(4)=P(5)=P(6)=16 .(2)

This test (based on a χ2 distribution with 5 degrees of freedom) provides as p-value 0.2521, clearly much larger than 0.05 and far from accepted significance levels. The student’s misdoing was to pick the most significant violation from , thus maximizing evidence subject to snooping, which is a special case of manipulating evidence subject to snooping (MESSing).

3.3 “An argument for divine providence”

With the last example we enter the realm of time series: a sequence of data observed in chronological order. Let us remember the historical study by Arbuthnot (Citation1710). He concluded that the assumption of equal probability for boys and girls being born must be wrong since it results in the tiny p-value of (12)82. How he cast this evidence into “An argument for divine providence” I leave to the reader. By the way, Arbuthnot’s finding of more boys being born than girls has been confirmed for many countries over different cultures and several centuries. This at first glance startling result has been explained and supported physiologically by Orzack et al. (Citation2015)–although not yet accounting for sex beyond male and female. Matters are different in the following artificial example.

Example 3.

A couple of weeks before the edition of new coins, a retired professor starts to toss a coin–not necessarily the same–every morning. He keeps a record of whether heads (coded as one) or tails (coded as zero) show up. The coins are assumed to be fair, i.e., there is a 50% chance for both zero and one. Over the first 75 days he observes 37 times the number zero, i.e., in 49.33% of all cases, which supports the null hypothesis of fair coins. The 76th day is the day when the new coins are available, and from now on the professor uses new coins only. During the following 25 days after release of the new coins the number 0 shows up 18 times, which amounts to 72% of all cases with new coins–and there is one week where zeros are tossed on 6 subsequent days! The professor is alarmed in the face of this striking cluster. He takes his calculator and checks (relying on the binomial distribution) that the probability to observe more than 17 zeros within 25 days under the assumption of equal probability is equal to 0.0216<0.05, such that his 18 new-edition zeros are not in accordance with the assumption (null hypothesis) of equal probability. He writes an email to a former student asking for an explanation.

When analyzing time series, one inevitably sooner or later observes clusters that seem to be inconsistent with randomness and ask for explanation. And since always something happens, there is always a potential cause: A new president has been elected introducing a new paradigm that may affect foreign policy and trade; economy may be affected by a new currency introduced in Europe; the Center for Disease Control and Prevention has approved of a new vaccination that may affect some medical status. In short: Having observed some striking cluster in a time series after some “dramatic event” at a certain point in time seems to call for action, namely testing. This is what the professor was exposed to. He had no bad intentions and was Testing Observed Surprising Structures (TOSSing), after the structures had struck his mind. And TOSSing was all the more plausible since he could come up with a seeming suggestive story or “reason” behind it, namely the new edition of coins. Even if in good faith, TOSSing is still another–naive if you wish–case of the Texas sharpshooter fallacy.

The former student explains to her retired professor: Yes, six zeros out of six trials or six ones out of six trials given a 50% chance in each trial is rather unlikely, the exact probability being 3.125%. But no, a spell of 6 consecutive zeros or 6 consecutive ones within 25 trials is not at all unlikely under equal chances: The probability for the longest spell to have a length of 6 is roughly 0.15, see Hassler and Hosseinkouchack (Citation2022, Fig. 1). In fact, the expected value of the length of such a longest spell within 25 trials is 5, see Hassler and Hosseinkouchack (Citation2022, Table 2). So, there is no reason to panic and to test the data generated with coins from the new edition, the “surprising structure” is not so surprising after all. And evaluating the full sample of 100 days the professor observed 55 zeros; the probability to observe 55 or more zeros is 0.1841 if the probabilities for ones and zeros are equal. Again, 0.1841 is much larger than the ritualized 5% level, offering only very weak evidence against the null hypothesis of fair coins.

4 Discussion

The previous examples belong to the generic case of data snooping, or: Manipulating Evidence Subject to Snooping (MESSing after Hassler and Pohle Citation2022). One specific strategy is HARKing (Hypothesizing After the Results are Known, see Kerr Citation1998) or SHARKing (secretly HARKing, see Hollenbeck and Wright (Citation2017)): With data mining techniques, empirical or experimental data are screened until a striking feature shows up; having detected the feature one tests whether it is significant. HARKing can be legitimate: Looking at the data behind Example 2 in , the frequency of number 6 is striking. Hence, one may test the null hypothesis (2) of equal probability with p-value 0.2521. Malevolent, however, is to pretend that one had first postulated the null hypothesis P(6)=1/6, then drawn the sample of 60,000 repetitions leading to a rejection. Similarly in Example 1: Reporting that 20 test were performed at the 5% level and only one was significant is informative (in that it tells us that there is little evidence against the equal probability hypothesis); but claiming that one had the idea of repdigits occurring less often a priori and presenting only this single, significant test result–this amounts to SHARKing, which is a way of cheating. Unfortunately, this seems to be a common research practice: John et al. (Citation2012, ) reported a survey among academic psychologists, where one third of them admitted cases of having published unexpected results as if they had been debated from the beginning.

The case of SHARKing is far from harmless. Consider once more Example 1 and assume that somebody wishes to replicate these findings and sets up the analogous experiment. In fact, I did run the experiment independently a second time. In this second experiment the relative frequency of repdigits was P̂(E)=0.08964, a value so close to the hypothesized probability of P(E)=0.09 that the p-value of the test statistic becomes 34.54%, which is not indicative of the value P(E)=0.09 being violated. But in the second experiment I observed again one of the 20 tests at 5% to be significant, only that it was the event of prime numbers that significantly violated the theoretical probability. In a third experiment, most likely neither the “repdigit anomaly” nor the “prime number anomaly” will be confirmed, but a further anomaly may be discovered. That way increasing evidence does not create increasing insight but increasing confusion.

SHARKing is one way of p-hacking, which more generally means squeezing the evidence until the p-value is small enough to reject at a certain level (often: 5%), see Simmons et al. (Citation2011) and Simonsohn et al. (Citation2014). Such a strategy may consist of trying different hypotheses (like in Example 1) until “sufficient significance” is achieved. Specifically, one may not be satisfied by reaching a certain significance level but instead strive for maximum evidence (minimal p-value) as it was the case in Example 2. Often, p-hacking comes in disguise of sample choices: increasing or reducing the sample to achieve a certain significance. Example 3 shows that such a practice is not necessarily intended to be manipulative but may result from a naive approach that we call TOSSing here: One first observes some striking or surprising structure and then tests for this observation.

My reading and writing of Examples 1–3 is related to a new field of epistemology called “agnotology” (from the Greek word “agnosis”) that is devoted to “the cultural production of ignorance (and its study),” see Proctor (Citation2008, p. 1). Proctor (Citation2008, p. 3) distinguished several kinds of ignorance, one of them being “a deliberately engineered and strategic ploy (or active construct),” which “can be made or unmade, and science can be complicit in either process.” One of his “favorite examples of agnogenesis is the tobacco industry’s efforts to manufacture doubt about the hazards of smoking” (Proctor (Citation2008, p. 11), see also Proctor (Citation1995)). In addition to such actively created ignorance, the chapters in Kourany and Carrier (Citation2020) focus on passively constructed ignorance. Empirical research may actively or passively contribute to public ignorance by means of misleading significance tests in consequence of p-hacking, MESSing or TOSSing. Examples 1–3 are characterized by less information or even growing confusion as a result of statistical significance tests or as a consequence of data snooping prior to testing. Clearly, the more hypotheses are tested using the same data or in the process of data snooping, the less convincing is statistical significance. This is reminiscent of the recent pandemic. A quick internet search (carried out in May 2022) returns over 50 special issues on COVID-19 published between 2020 and 2022 in a variety of scientific journals (from fields like economics over social sciences and psychology to statistics, not accounting for medicine, virology and related fields). This means that hundreds of empirical studies have been published building on COVID related data and time series; many of them offer statistically significant results, but necessarily many of them analyze similar or identical data sets–that did not arise from random sampling.

5 Concluding remarks

Recently, several articles triggered off a discussion about so-called paper mills: commercial “companies that churn out fake scientific manuscripts” (Else and VanNoorden Citation2021, p. 516). They are particularly “productive” in the field of biomedical research, see also Christopher (Citation2021). More generally, Ioannidis (Citation2005) argued that most published research findings are false. Even if data are not manipulated and results are reported as ground out by the computer (“facts”), the consequences may be defective (“fake”), because the inferential tools are not applied in a proper manner. In Section 3, I presented 3 examples how and when nonsensical or exaggerated significance may occur in consequence of incorrect application.

Incorrect applications of statistical significance testing may have two roots. A) They may slip in unwantedly by ignorance of researchers who are otherwise in good faith; statistical pitfalls not only lurk in scientific publications but are even more likely to be encountered in everyday life studies in the fields of e.g., business, medicine, ecology and so on. B) Incorrect applications may be purposeful and exploited by smart researchers in order to generate dazzling, highly significant results. The latter practice may be spurred by the so-called publication bias first addressed by Sterling (Citation1959, p. 30): “[…] research which yields nonsignificant results is not published. Such research being unknown to other investigators may be repeated independently until eventually by chance a significant result occurs–an ‘error of the first kind’–and is published.”; see also Sterling et al. (Citation1995). Hence, there are strong incentives for p-hacking, see Simonsohn et al. (Citation2014). This clearly calls for action. To encounter the first case A), better and more careful statistical training is required to create awareness about when and how significance tests are valid. Further, Hirschauer et al. (Citation2019, p. 703) “suggest twenty immediately actionable steps to reduce widespread inferential errors” related to significance testing; the first suggestion being “Do not use p-values either if you have a nonrandom sample […], p-values are not interpretable for nonrandom samples.” To account for the second case B), it must be demanded that “Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed” (Wasserstein and Lazar Citation2016, p. 132). Since it is hard to control whether researchers comply with such a policy, it is important to overcome the publication bias; Sterling et al. (Citation1995) suggested that empirical studies should be accepted for publication if they tackle a relevant or interesting research question with adequate methods and data, irrespective of the level of significance of the outcome; see also Simmons et al. (Citation2011).

As hinted at in Section 2, more fundamental objections against significance testing have been brought up, too, often by statisticians advocating a Bayesian approach. The problem of p-hacking, however, is not automatically alleviated by Bayes techniques, see Simonsohn (Citation2014), and the same holds true for MESSing or TOSSing. Further, Bayesian inference crucially hinges on prior probability assumptions that may be hard to justify in practice.

The fact that significance tests can be misleading and deceptive does not conversely mean that abandonment of statistical tests guarantees direct and unbiased access to truth. Examples 1–3 are not meant to be a plea against empirical research and statistical significance, but against unthinking repetition of research rituals that may not only be useless but even harmful, namely agnogenetic. Sometimes less is more, and in this note I tried to identify such cases.

Acknowledgments

I am grateful to Jörg Breitung, Mehdi Hosseinkouchack, Michael Neugart, Marc-Oliver Pohle, Jan Reitz, Verena Werkmann, Jan-Lukas Wermuth, Michael Wolf and Tanja Zahn for many helpful comments. Moreover, I thank two anonymous referees for constructive critique and many useful suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

References

  • Amrhein V, Greenland S, McShane B. 2019. Scientists rise up against statistical significance. Nature. 567:305–307.
  • Arbuthnot J. 1710. An argument for divine providence, taken from the constant regularity observed in the births of both sexes. Philos Trans Royal Soc London. 27:186–190.
  • Bakan D. 1966. The test of significance in psychological research. Psychol Bull. 66(6):423–437.
  • Bem DJ. 2011. Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. J Pers Soc Psychol. 100(3):407–425.
  • Berkson J. 1938. Some difficulties of interpretation encountered in the application of the chi-square test. J Amer Stat Assoc. 33(203):526–536.
  • Boring EG. 1916. The number of observations upon which a limen may be based. Amer J Psychol. 27(3):315–319.
  • Boring EG. 1919. Mathematical vs. scientific significance. Psychol Bull. 16(10):335–338.
  • Christopher J. 2021. The raw truth about paper mills. FEBS Lett. 595(13):1751–1757.
  • Cohen J. 1990. Things I have learned (so far). Amer Psychol. 45:1304–1312.
  • Cohen J. 1994. The earth is round (p < .05). Amer Psychol. 49:997–1003.
  • Else H, VanNoorden R. 2021. The battle against paper mills. Nature. 591:516–519.
  • Engber D. 2017. Daryl Bem proved ESP is real: which means science is broken. Slate https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html. June 07, 2017.
  • Falk R, Greenbaum CW. 1995. Significance tests die hard: the amazing persistence of a probabilistic misconception. Theory Psychol. 5(1):75–98.
  • Fisher RA. 1925. Statistical methods for research workers. Edinburgh, London: Oliver & Boyd.
  • Gigerenzer G. 2004. Mindless statistics. J Socio-Econ. 33:587–606.
  • Good IJ. 1975. Explicativity, corroboration, and the relative odds of hypotheses. Synthese 30:39–73.
  • Hassler U, Hosseinkouchack M. 2022. Understanding nonsense correlation between (independent) random walks in finite samples. Stat Papers 63:181–195.
  • Hassler U, Pohle MO. 2022. Unlucky number 13? Manipulating evidence subject to snooping. Int Stat Rev. 90:397–410.
  • Hirschauer N, Grüner S, Mußhoff O, Becker C. 2019. Twenty steps towards an adequate inferential interpretation of p-values in econometrics. J Econ Stat. 239(4):703–721.
  • Hollenbeck JR, Wright PM. 2017. Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. J Manage. 43(1):5–18.
  • Ioannidis JPA. 2005. Why most published research findings are false. PLoS Med. 2(8):e124.
  • John LK, Loewenstein G, Prelec D. 2012. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci. 23(5):524–532.
  • Kerr NL. 1998. HARKing: hypothesizing after the results are known. Personal Social Psychol Rev 2(3):196–217.
  • Kourany J, Carrier M. 2020. Science and the production of ignorance–when the quest for knowledge is thwarted. Cambridge, MA: MIT Press.
  • Leamer EE. 1978. Specification searches: ad hoc inference with nonexperimental data. New York: Wiley.
  • Meehl PE. 1978. Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. J Consult Clin Psychol 46(4):806–834.
  • Orzack SH, Stubblefield JW, Akmaev VR, Colls P, Munné S, Scholl T, Steinsaltz D, Zuckerman JE (2015). The human sex ratio from conception to birth. Proc Natl Acad Sci. 112:E2102–E2111.
  • Pearson K. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos Mag (Ser 5). 50(302):157–175.
  • Proctor RN. 1995. Cancer wars: how politics shapes what we know and don’t know about cancer. New York: BasicBooks.
  • Proctor RN. 2008. Agnotoloty: A missing term to describe the cultural production of ignorance (and its study). In: Proctor RN, Schiebinger L, editors. Agnotoloty–the making and unmaking of ignorance. Redwood City, CA: Stanford University Press. p. 1–33.
  • Rozeboom WW. 1960. The fallacy of the null-hypothesis significance test. Psychol Bull. 57:416–428.
  • Shoesmith E. 1987. The continental controversy over Arbuthnot’s argument for divine providence. Hist Math. 14(2):133–146.
  • Simmons JP, Nelson LD, Simonsohn U. 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 22(11):1359–1366.
  • Simonsohn U. 2014. Posterior-hacking: Selective reporting invalidates Bayesian results also. Available at SSRN 2374040.
  • Simonsohn U, Nelson LD, Simmons JP. 2014. p-curve: A key to the file-drawer. J Exp Psychol Gen. 143(2):534–547.
  • Sterling TD. 1959. Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa. J Amer Stat Assoc. 54(285):30–34.
  • Sterling TD, Rosenbaum WL, Weinkam JJ. 1995. Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. Amer Stat. 49(1):108–112.
  • Stigler SM. 1986. The history of statistics: the measurement of uncertainty before 1900. Cambridge, MA and London, England: Belknap Press of Harvard University Press.
  • Utts J. 1999. The significance of statistics in mind-matter research. J Sci Explor. 13(4):615–638.
  • Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas HLJ. 2011. Why psychologists must change the way they analyze their data: the case of psi: Comment on Bem (2011). J Pers Soc Psychol. 100(3):426–432.
  • Wasserstein RL, Lazar NA. 2016. The ASA’s statement on p-values: Context, process, and purpose. Amer Stat. 70(2):129–133.
  • Wasserstein RL, Schirm AL, Lazar NA. 2019. Moving to a world beyond “p < 0.05”. Amer Stat. 73(sup1):1–19.
  • Ziliak ST, McCloskey DN. 2008. The cult of statistical significance: how the standard error costs us jobs, justice, and lives. Ann Arbor, MI: University of Michigan Press.