543
Views
0
CrossRef citations to date
0
Altmetric
Research Articles

Probability of minority inclusion is underestimated

ORCID Icon
Pages 502-522 | Received 23 Jun 2023, Accepted 17 Apr 2024, Published online: 19 May 2024

ABSTRACT

Perception of the probability of minority inclusion in the groups with which we interact is important to daily behaviours (e.g., teachers may consider the probability that a class of 30 students includes at least one gay/bisexual student). The present study showed that the participants surprisingly underestimated this probability even when the group size and the prevalence of the minority were given. Approximately 90% of the participants estimated lower than the mathematically normative probability. The underestimation was larger than in the case of the arithmetically isomorphic probability of the cumulative risk, suggesting a cognitive bias specific to the probability of inclusion. Some of the heuristics used for the estimations, such as the participants using an expected value, were relevant to the underestimation. This cognitive bias may mislead people into believing that minorities are irrelevant to them. It was also shown that the participants’ attitudes became more inclusive when they were informed of the normative probability of inclusion.

1. Introduction

Imagine you are a teacher addressing 30 new students. What is the probability that one or more of these students has colour deficiency? What is the probability of including gay or bisexual students? If you estimate these probabilities to be low, you may not be motivated to use accessible colour schemes and bias-free language. However, these minority traits are not visually apparent. In addition, minority people often hide these traits/identities to avoid harassment and discrimination. For example, in the EU and the UK, only 5% of LGBTI youth (aged 15–17) are very open (European Union Agency for Fundamental Rights, Citation2020). We have difficulty perceiving the presence of minorities in everyday situations, which requires probabilistic thinking. Hence, estimating the probability of inclusion (pinc), namely, the probability that a person with a given trait is included in a group, is critical for real-world decision making in uncertain situations.

In this study, I examine how people estimate pinc. When information about a group is scarce, pinc can be estimated using a binomial distribution. Given the group size n and prevalence of the trait in the general population q, pinc is obtained as 1 – (1 – q)n. However, in most cases, it is beyond humans’ cognitive ability to calculate pinc mentally (e.g. 1 – (1 – 0.03)30 can only be calculated using a computer). Further, the cognitive psychology literature indicates that binomial probabilities are often misperceived. Using gambling tasks, early studies showed that conjunctive probabilities (qn), namely, the probability of winning n consecutive gambles with a single win probability q, are overestimated, whereas disjunctive probabilities (1 – (1 – q)n), namely, the probability of at least one win, are underestimated (Bar-Hillel, Citation1973; Cohen et al., Citation1971; Cohen & Hansel, Citation1957; Slovic, Citation1969). The latter is arithmetically isomorphic to pinc. The same misperception has been reported for risk. If the risk of an accident occurring each year is q, then the cumulative risk (i.e. the probability of at least one accident during n years) is given by 1 – (1 – q)n. This is also isomorphic to pinc. Underestimation has often been reported for the cumulative risk (De La Maza et al., Citation2019; Juslin et al., Citation2015), but it can also be overestimated (Doyle, Citation1997; Fuller et al., Citation2004).

Although it is not surprising that people cannot mentally calculate these probabilities, the estimated probabilities are not random but biased because people often use cognitively effortless strategies, called heuristics, to estimate probabilities. Tversky and Kahneman (Citation1974) proposed an “anchoring and adjustment” heuristic as a source of such bias. The elementary risk q is usually small, say 1%, and serves as a starting point (“anchor”) for the estimation. However, the cumulative risk is often much higher than q (e.g. 39.5% if n is 50). While people do make adjustments, such adjustments are often insufficient, leading disjunctive probabilities to be underestimated. Subsequent studies have found a variety of heuristics for estimating the cumulative risk and shown that both underestimation and overestimation occur depending on q, n, and the heuristic used. People often use an additive heuristic (i.e. adding a constant value for each additional n). If one simply adds the elementary risk q for each n, it is also called a multiplicative heuristic (e.g. if annual risk q = 1%, then 10% risk for 10 years). The multiplicative heuristic provides good estimates when q and n are small, while it incurs overestimation bias for large q and n (Doyle, Citation1997; Fuller et al., Citation2004). Additive and multiplicative heuristics may even yield values above 100% (e.g. q = 5% and n = 50 yield 250%). In such a case, people consider the cumulative risk to be estimated as it is (De La Maza et al., Citation2019) or cap it at 100% (truncated heuristic; Doyle, Citation1997).

Another common heuristic is the constant heuristic, which ignores n and sticks to q (e.g. if annual risk q = 1%, then the cumulative risk is 1% for any number of years; De La Maza et al., Citation2019; Doyle, Citation1997; Shaklee & Fischhoff, Citation1990). Using the constant heuristic underestimates the cumulative risk. People also use a mean heuristic to estimate the cumulative risk (Juslin et al., Citation2015) based on the mean of elementary risks over time. This heuristic is equivalent to the constant heuristic when the elementary risk is constant.

Do people misperceive pinc as well as the cumulative risk? In the present study, I asked participants to estimate pinc with various topics, qs, and ns. It seemed likely that pinc would be estimated more accurately than the cumulative risk, since a knowledge-based cognitive schema may mitigate reasoning fallacies (Chen & Holyoak, Citation1985). For example, through daily lives, people may have learned that pinc for women among a group of 100 randomly chosen individuals should be much higher than the prevalence (≈ 50%), as gender is visually expressed in most cultures. Such learning is unlikely to occur for cumulative risks over many years. Another hypothesis was that people overestimate pinc because social surveys have revealed that the prevalence of ethnic and sexual minorities in the general population is often overestimated compared with reality (Citrin & Sides, Citation2008; Ipsos, Citation2015; Newport, Citation2015; Wong, Citation2007). Both sociological factors (Alba et al., Citation2005; Gallagher, Citation2003; Lee et al., Citation2019; Martinez et al., Citation2008) and cognitive factors (Kardosh et al., Citation2022; Khaw et al., Citation2021; Landy et al., Citation2018) underlie this misperception. If there is a general bias towards overestimating the number of minorities, pinc would also be overestimated.

Examining the perception of pinc also raises a theoretical issue about the study of probability judgements. As reviewed above, previous studies have considered disjunctive probabilities as an issue on the temporal axis (e.g. the cumulative risk over time and consecutive gambles). By contrast, pinc is a disjunctive probability over a population rather than over time. While cognitive fallacies in judgements of population-based probability are well known (e.g. base-rate neglect; Casscells et al., Citation1978; Kahneman & Tversky, Citation1973; Stengård et al., Citation2022), the perception of pinc has not been empirically studied yet. If people make probability judgements using different mental models in the time domain and the population domain, then pinc is estimated differently from the cumulative risk.

2. Experiment 1

Experiment 1 was preregistered (https://doi.org/10.17605/OSF.IO/3AQZE). The experimental procedure was approved in advance by the Niigata University Ethical Review Board for Human Research (2022-0030).

2.1. Method

2.1.1. Participants

A sample size of 90 was planned for each of five conditions. A total of 450 individuals were recruited through a crowdsourcing platform (crowdworks.jp). The participants received a monetary reward of JPY 250. Twenty-one participants did not follow the attention check instructions (Gummer et al., Citation2021) and their data were excluded from the analyses. The remaining 429 comprised the final sample (236 women, 193 men, self-reported age 18–76 years [M = 40.8, SD = 9.8]). See the Supplementary Materials for the other demographic data and determination of the sample size. All the participants provided consent to participate in advance.

2.1.2. Design and conditions

shows the overview of the experiment. The participants were randomly assigned to one of five conditions: Negative, Positive, Visible, Neutral, and Majority. Each participant made five estimations of pinc (Q1–Q5). In each problem, the prevalence q (as a percentage) of the trait in question and group size n were given. The content of Q1–Q4 varied among the conditions, while Q5 was constant (blood type problem, q = 20%, n = 30). Hereafter, for convenience, the combination of q and n values used in each problem is abbreviated to q%−n (e.g. “20%–30”).

Figure 1. Overview of the conditions and procedures in Experiment 1. Each participant estimated pinc (probability of inclusion) for five problems (Q1–Q5). The topics of Q1–Q4 varied across conditions. The original questions were in Japanese. q, prevalence presented in the problems. n, group size presented in the problems. N, sample size.

Figure 1. Overview of the conditions and procedures in Experiment 1. Each participant estimated pinc (probability of inclusion) for five problems (Q1–Q5). The topics of Q1–Q4 varied across conditions. The original questions were in Japanese. q, prevalence presented in the problems. n, group size presented in the problems. N, sample size.

In the Negative condition, the participants estimated pinc for minority traits often negatively stereotyped (colour deficiency [q = 3%], gay/bisexual students [7%]) in Q1–Q4. In the Positive condition, synaesthesia (3%) and absolute pitch (7%) were used as minority traits often positively stereotyped. These conditions were used to examine any effect of stereotypes on minorities. In the Visible condition, relatively visible minorities were used (foreigners [3%], police women [7%]). If participants had accurate knowledge of pinc through everyday experience, their estimations should be more accurate in this condition than in the Neutral condition. The Neutral condition was designed to examine baseline without the influence of participants’ knowledge of real minority traits. In this condition, fictional topics of sweat content “PS22” (3%) and gene “Cg-1X” (7%) were used. In the Majority condition, the participants estimated pinc for normal colour vision (97%) and heterosexual students (93%). The purpose of this condition was to test whether pinc for minorities and pinc for majorities are estimated in the same way. In all conditions, two group sizes (n = 30 and 80) were used. All the topics and q values were determined to be realistic in the context of Japan at the time of the experiment (see ).

Table 1. The topics used in Experiment 1 and the results of the prevalence estimation task. The prevalence of each topic was estimated by the Experiment 1 participants who did not see the topic in the preceding pinc problems (see ). Medians of the estimated prevalence are shown with 95% CIs. The actual prevalence values were adopted from the references. In reality, it is often impossible to determine the prevalence of the traits definitively. These traits have a wide continuum of individual differences and may even be multidimensional (Bermudez & Zatorre, Citation2009; Bosten, Citation2019; Epstein et al., Citation2012; Simner, Citation2012).

2.1.3. The questionnaires

The online questionnaires were designed using lab.js (Henninger et al., Citation2022). Since there were five conditions, five questionnaires were designed. See the Supplementary Materials for the complete list of the questions and texts. All the instructions and questions were written in Japanese, and the examples below are the English translations.

On the cover page, the participants reported their age, gender identity, highest level of education, and blood type. The blood type item was included to test whether the estimated pinc for blood type B (blood type problem, Q5) differed between the participants with blood type B and others. The contents of the subsequent pages varied by condition, as described below.

2.1.3.1. Negative condition

On page 2, two problems on colour deficiency were presented (Q1, 3%–30 and Q2, 3%–80):

Please answer in numbers. No need to enter “%”.

Some people have difficulty distinguishing between reddish and greenish colours. Medically, this is called colour deficiency. It is said that 3% of the population has colour deficiency.

What do you think the percent probability is that there is even one person having colour deficiency among 30 people? [  ]

What do you think the percent probability is that there is even one person having colour deficiency among 80 people? [  ]

On page 3, two problems on gay/bisexual students were presented (Q3, 7%–30 and Q4, 7%–80):

Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex. It is said that 7% of college students are gay/bisexual.

What do you think the percent probability is that there is even one gay/bisexual student among 30 students? [  ]

What do you think the percent probability is that there is even one gay/bisexual student among 80 students? [  ]

Then, the blood type problem (Q5, 20%–30) was presented on page 4 (see (a) for the question text). On the next page, the participants performed comprehension rating (Q6, “Did you understand the meaning of the question on the previous page?”, 1: not at all–4: very well) and attention check (Q7, “To confirm that you are reading the questions, choose 1 for the options below”, 1: not at all–4: very well). The attention check item was used to detect inattentive respondents (see Gummer et al., Citation2021).

Figure 2. Results of Experiment 1, participants’ estimates of the probability of inclusion (pinc) at various settings of q and n. A histogram of the estimated pinc for the blood type problem (Q5) is shown in a. The blood type problem was presented to all the participants of Experiment 1. The results of the other four pinc problems (Q1–Q4) of the conditions other than the Majority condition are shown in b to e. The topics used in these problems varied among the conditions. The results of the Majority condition are shown in f to i. Estimates less than 0 or larger than 100 are not shown in this figure. q, prevalence presented in the problems. n, group size presented in the problems. N, sample size. EV, expected value.

Figure 2. Results of Experiment 1, participants’ estimates of the probability of inclusion (pinc) at various settings of q and n. A histogram of the estimated pinc for the blood type problem (Q5) is shown in a. The blood type problem was presented to all the participants of Experiment 1. The results of the other four pinc problems (Q1–Q4) of the conditions other than the Majority condition are shown in b to e. The topics used in these problems varied among the conditions. The results of the Majority condition are shown in f to i. Estimates less than 0 or larger than 100 are not shown in this figure. q, prevalence presented in the problems. n, group size presented in the problems. N, sample size. EV, expected value.

The prevalence estimation task followed (page 6). This task was introduced to examine whether prevalence would be overestimated for the topics used in the experiment. The topics that appeared in the pinc problems (Q1–Q4) of the other conditions (except for the fictional topics of the Neutral condition) were used for this task. In the Negative condition for example, the participants estimated the prevalence of synaesthesia, absolute pitch, foreigners, and police women.

Finally, the information on the actual prevalence () of the minority examples that appeared in the questionnaire was provided (debriefing, page 7).

2.1.3.2. Positive condition

Two pinc estimation problems on synaesthesia were presented on page 2 (Q1, 3%–30 and Q2, 3%–80) and another two problems on absolute pitch were presented on page 3 (Q3, 7%–30 and Q4, 7%–80). The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were colour deficiency, gay/bisexual students, foreigners, and police women.

2.1.3.3. Visible condition

Two pinc estimation problems on foreigners were presented on page 2 (Q1, 3%–30 and Q2, 3%–80) and another two problems on police women were presented on page 3 (Q3, 7%–30 and Q4, 7%–80). The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were colour deficiency, gay/bisexual students, synaesthesia, and absolute pitch.

2.1.3.4. Neutral condition

Two pinc estimation problems on fictional sweat content were presented on page 2 (Q1, 3%–30 and Q2, 3%–80) and another two problems on a fictional gene were presented on page 3 (Q3, 7%–30 and Q4, 7%–80). The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were colour deficiency, gay/bisexual students, synaesthesia, absolute pitch, foreigners, and police women. In addition, the participants were debriefed that the topics used in the pinc estimation problems were fictional.

2.1.3.5. Majority condition

Two pinc estimation problems on normal colour vision were presented on page 2 (Q1, 97%–30 and Q2, 97%–80) and another two problems on heterosexual students were presented on page 3 (Q3, 93%–30 and Q4, 93%–80). The rest of the questionnaire was the same as in the Negative condition, except that the topics for the prevalence estimation task were synaesthesia, absolute pitch, foreigners, and police women.

2.2. Results and discussion

2.2.1. Analysis of the estimates

All the pinc estimates were recorded as integers because of the function of the online questionnaires. Negative estimates and estimates over 100 were included in the analyses, but were rare (0.2% of all the pinc estimates). shows the distributions of the participants’ pinc estimates.

2.2.2. Blood type problem (Q5)

The blood type problem (20%–30) was common to all the participants. As shown in (a), pinc was greatly underestimated. The normative pinc was 99.9, and 90.9% (390) of the estimates were below this. The median estimate was 20, which was significantly below the normative pinc (signed-rank test, T = 924, p < .001). Further, 50.8% of the estimates were equal to or less than q. A substantial proportion of the participants (17.9%) estimated “20” ( =  q), suggesting the use of the constant heuristic. An unexpected result was the considerable number of estimates well below q, particularly “6” (16.3%). Six estimates corresponded to the expected value (EV), namely, 20% of 30. These estimates were not attributable to using the previously reported heuristics to estimate the cumulative risk. The estimates by blood type B participants (Mdn = 25, N = 86) were not different from those by non-type B (including “don’t know”) participants (Mdn = 20, N = 343; Wilcoxon rank-sum test, W = 14088, p = .517, Cliff’s d = .045).

The participants were partially aware of their difficulty in estimating pinc. The blood type problem estimates of the participants who reported a subjectively good comprehension (rated 3 or 4 for the comprehension rating; N = 279) were significantly higher (Mdn = 40) than those who felt they had not comprehended the question (Mdn = 20; W = 17215, p = .002, Cliff's d = .177). However, 90.0% of those who reported a good comprehension still underestimated the pinc and 45.2% estimated equal to or less than q.

Did the participants lack the relevant mathematical knowledge? Although I did not ask the participants if they understood binomial probability, I did find an effect of education. When the participants were split into relatively highly educated (N = 265, approximately 4 or more years of higher education) and relatively less educated (163) groups, the former provided higher estimates for the blood type problem than the latter (Mdn = 30 and 20, respectively; W = 18122, p = .005, Cliff's d = .161). Nevertheless, the highly educated group still showed a substantial underestimation (90.6%). Participant age did not correlate with the estimates (Spearman’s ρ = .03).

2.2.3. Pinc estimates for minorities and majorities

For the pinc problems (Q1–Q4) of all the conditions, pinc was underestimated compared with the normative pinc (signed-rank tests, ps < 10−10). (b) shows the results of the 3%–30 problems for the 348 participants in the four minority conditions (Negative, Positive, Visible, and Neutral). Most of the estimates (85.1%) were less than the normative pinc of 59.9. Both the constant heuristic (estimated “3”) and the EV heuristic (“1” ≈ 0.03 × 30, note that the estimates were in integers) were evident. The results for the other settings of q and n (3%–80, 7%–30, and 7%–80) showed a virtually identical pattern ((c–e)). As shown in , when q was 3% or 7%, roughly 90% of the estimates were underestimations, while 40–70% were equal to or less than q.

Table 2. Descriptive statistics of the estimated probability of inclusion in Experiments 1 and 2. Prevalence (q) and group size (n) were presented in the problems. Underestimation was defined as estimates less than the normative pinc (probability of inclusion) for each problem.

The participants estimated pinc differently for minorities and majorities. In the Majority condition, the use of the constant heuristic was still apparent, whereas the EV heuristic was not used ((f–i)). About 75% of the estimates in the Majority condition were underestimations (). However, this was a lower percentage than for the Negative condition in which the participants estimated pinc for the counterpart minorities (Q2–Q3, χ2s(1) > 7.43, ps < .006, Cohen’s ws > .210), except for Q1 (χ2(1) = 2.72, p = .099, w = .127).

2.2.4. Possible determinants of the pinc estimations

Why was pinc underestimated? Did the participants believe that the prevalence of the minorities in question was much less than the q values presented? This was unlikely, as people often overestimate minority prevalence (Citrin & Sides, Citation2008; Ipsos, Citation2015; Newport, Citation2015; Wong, Citation2007). In fact, the prevalence estimates by the participants were comparable to or larger than the qs presented ().

Are negative stereotypes (if any) about minorities relevant? Affect and motivation may distort people’s probability judgements (Keller et al., Citation2006; Knäuper et al., Citation2005; Slovic & Peters, Citation2006; Weinstein, Citation1989). However, the pinc underestimation occurred as frequently in the Neutral condition (). Further, the estimates in the Negative condition were comparable with those in the Positive condition. The median estimates did not differ between these conditions (Wilcoxon rank-sum tests, ps > .192, Cliff’s ds < .116). The proportion of the underestimation did not differ either (Fisher’s exact probability tests, ps > .575, ws < .050). As an unexpectedly large number of estimates were below q, I examined the frequency of the estimates ≤ q in an ad-hoc analysis (). The results of χ2 tests on the four pinc problems (3%–30, 3%–80, 7%–30, and 7%–80) showed no significant difference between the Negative and Positive conditions (χ2s(1) < 2.40, ps > .122, ws < .120).

Would the underestimation be eliminated if relatively visible minority traits were used? The comparisons between the Visible and Neutral conditions revealed that this was partly the case, but only for the 7%–80 problem (Q4). The median estimate was significantly larger in the Visible condition Q4 than in the Neutral condition Q4 (W = 3392.5, p = .036, Cliff's d = .180). For the other settings of q and n (3%–30, 3%–80, and 7%–30), there were no such differences (ps > .139, Cliff's ds < .125). The proportion of estimates ≤ q was significantly lower in the Visible condition than in the Neutral condition for 7%–80 (χ2(1) = 10.54, p = .001, w = .241), but not for 3%–30, 3%–80, and 7%–30 (ps > .060, ws < .140). The frequency of underestimation did not differ between the conditions (ps > .253, ws < .100). Although a partial contribution of visibility to the pinc estimations was found, pinc was greatly underestimated in the Visible condition as well.

3. Experiment 2

The purpose of Experiment 2 was threefold. It tested the replicability of the pinc underestimation with a student sample. To find out what heuristics were used, the participants were asked to verbally report their heuristics. To rule out the effect of computational difficulty, they were also asked to report a solution without actually calculating and calculate the normative pinc using computers.

Experiment 2 was not preregistered because its primary purpose was to examine qualitatively the participants’ verbal reports on heuristics. The experimental procedure was approved in advance by the Niigata University Ethical Review Board for Human Research (2022-0030).

3.1. Method

3.1.1. Participants

Forty-eight undergraduate students completed Experiment 2 (30 women and 18 men, age M = 19.4, SD = 0.8). See the Supplementary Materials for the other demographic data. They received a voucher worth JPY 300 as a reward. They were instructed in advance that they would need a computer to participate in this experiment. All the participants provided consent to participate in advance.

3.1.2. The questionnaire

The participants were not divided into conditions and they completed the same online questionnaire. See the Supplementary Materials for the entire list of the questions and the texts.

On the cover page, the participants reported their age, gender identity, and the faculty to which they belonged. Then, pages 2 and 3 showed four pinc estimation problems (3%–30, 3%–80, 7%–30, 7%–80), which were identical to those in the Negative condition in Experiment 1. On page 4, the comprehension rating item was shown, which was also identical to that in Experiment 1. Thereafter, the participants retrospectively reported how they made their estimation in a text field (verbal report, “How did you determine your answer to the question on the previous page?”). Experiment 2 did not incorporate an attention check. On page 5, the maths solution question was presented: the participants were asked to provide a mathematical solution for estimating pinc (“If you had to solve the following problem as a maths problem, how would you solve it? Please describe the solution in formulae or words. You do not have to provide the actual answer”). Then, the fictional gene problem of 7%–30 used in the Neutral condition of Experiment 1 was provided as the problem. On the next page, the mathematically normative solution for the fictional gene problem of the previous page was provided to the participants. They were asked to calculate it using a computer and any other devices (the calculation task).

Similar to Experiment 1’s debriefing, the actual percentages of colour deficiency and gay/bisexual people known from surveys were given on page 7. Further, the participants were debriefed that the gene “Cg-1X” of the maths solution question was fictional.

3.2. Results and discussion

The general tendency to underestimate as well as the use of constant and EV heuristics were apparent (, ). For all the pinc estimation problems, the median estimates were significantly lower than the respective normative pinc (signed-rank tests, ps < .001). The percentage underestimation did not differ from those of the Negative condition in Experiment 1 that used identical problems (Fisher’s exact probability tests, ps > .220, ws < .127). However, the extent of the underestimation did reduce in Experiment 2. The median estimate was significantly larger than in the Negative condition in Experiment 1 in all the pinc problems (Wilcoxon rank-sum tests, ps < .001, Cliff’s ds > .340). The estimates ≤ q were also less frequent in Experiment 2 than in the Negative condition in Experiment 1 (χ2s(1) > 10.6, ps < .001, ws > .279). The students provided relatively more accurate estimates than did the online workers in Experiment 1.

Figure 3. Results of Experiment 2, the estimated probability of inclusion for the student sample. Students (N = 48) with PCs estimated the probabilities of inclusion for four problems (Q1–Q4) identical to those in the Negative condition of Experiment 1 (colour deficiency problems, gay/bisexual student problems). Estimates larger than 100 are omitted in this figure. EV, expected value. q, prevalence presented in the problems. n, group size presented in the problems.

Figure 3. Results of Experiment 2, the estimated probability of inclusion for the student sample. Students (N = 48) with PCs estimated the probabilities of inclusion for four problems (Q1–Q4) identical to those in the Negative condition of Experiment 1 (colour deficiency problems, gay/bisexual student problems). Estimates larger than 100 are omitted in this figure. EV, expected value. q, prevalence presented in the problems. n, group size presented in the problems.

By analysing the verbal reports after the pinc estimation problems, only four students (8.3%) were identified as having used the normative solution in any of the pinc estimation problems. This low percentage was not due to computational difficulty. For the maths solution question, only 12.5% of the students described the normative solution and an additional 8.3% reported partially normative solutions. However, when the normative solution was given (the calculation task), the majority of the students (60.4%) reported the correct answer (“88” or “89”) and an additional 14.6% reported its complementary probability (e.g. “11”).

In summary, the pinc underestimation was not due solely to computational difficulty. Even when no actual calculation was required, most of the participants could not find the normative solution on their own. The results suggested that people have difficulty understanding the nature of pinc.

4. Experiment 3

How can pinc estimations be improved? In Experiment 3, online workers estimated pinc for several modified versions of the 7%–30 problem in the Negative condition of Experiment 1.

Experiment 3 was preregistered. Four conditions (Control, Hint, Complementary, and No Group Size; see Section 4.1) were preregistered first (https://doi.org/10.17605/OSF.IO/JCPVG). After obtaining the data for these conditions, two additional conditions (Frequency and Cumulative Risk) were preregistered (https://doi.org/10.17605/OSF.IO/HMZB2) and conducted. The experimental procedures were approved in advance by the Niigata University Ethical Review Board for Human Research (2022-0170).

4.1. Method

4.1.1. Participants

Online workers were recruited in the same way as in Experiment 1. A sample size of 90 was planned for each of six conditions. As a result, 589 participants completed Experiment 3 in exchange for JPY 110. The author excluded the data from the 42 participants who had already participated in Experiment 1 or 3, and the 21 participants who failed to follow the instructions of the attention check (see ). The remaining 526 participants comprised the final sample (329 women and 197 men, age 18–74 years [M = 39.7, SD = 10.4]). See the Supplementary Materials for the other demographic data and determination of the sample size. All the participants provided consent to participate in advance.

Figure 4. Overview of the conditions and procedures in Experiment 3. The content of the critical question (Q2) varied between the conditions. The original questions were in Japanese. q, prevalence presented in the problems. n, group size presented in the problems. N, sample size.

Figure 4. Overview of the conditions and procedures in Experiment 3. The content of the critical question (Q2) varied between the conditions. The original questions were in Japanese. q, prevalence presented in the problems. n, group size presented in the problems. N, sample size.

4.1.2. Design and conditions

Each participant was randomly assigned to one of six conditions: Control, Hint, Complementary, No Group Size, Frequency, and Cumulative Risk. The key difference between the conditions was Q2 of the questionnaires (), which was the critical question to compare.

4.1.3. The questionnaires

Six online questionnaires were prepared corresponding to the six conditions. illustrates an overview of the conditions. All the instructions and questions were written in Japanese. See the Supplementary Materials for the original texts. On the cover page, the participants reported their age, gender identity, and highest level of education in the same manner as in Experiment 1. As shown below, the contents of the subsequent pages varied by condition.

4.1.3.1. Control condition

In the Control condition, the following question (Q2, critical question) was shown on page 2. Q1 was not shown (see ). The critical question was an improved version of the gay/bisexual student problem (7%–30) used in Experiment 1. For clarity, the phrase hitori demo (“even one”) in Experiment 1 was replaced by sukunakutomo hitori (“at least one”). Furthermore, “%” was displayed next to the response field to emphasise that a percentage, not a number of people, should be estimated.

(Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex. It is said that 7% of college students are gay/bisexual.

What do you think the percent probability is that there is at least one gay/bisexual student among 30 students? [  ] %

On page 3, confidence rating (1: not confident at all–4: very confident, Q4), attention check (Q5), and verbal report task (Q6) were presented. Q3 was not presented in the Control condition (see ). The attention check was identical to that in Experiment 1. The verbal report task asked the participants to report how they determined the answer to the critical question. Finally, similar to the Experiment 1 debriefing, the actual percentage prevalence of gay/bisexual people known from surveys was described for the participants (page 4).

4.1.3.2. Hint condition

On page 2 of the Hint condition, the gender problem (Q1) and the critical question (Q2) were presented. The gender problem was used only in this condition and the critical question was common to the Control condition. In the gender problem, the participants estimated the pinc of men/women (i.e. q = 50%) for n = 30:

(Q1) About 50% of the population is [male/female]. Imagine you are on a train. There are 30 passengers in the train carriage beside you.

What do you think the probability is that there is at least one [male/female] person among the passengers? [  ] %

The target gender (male/female) was randomly determined and displayed by the function of the online questionnaire. This question was expected to serve as a hint that pinc is higher than q. Then, the critical question followed. Pages 3 (confidence rating, attention check, and verbal report) and 4 (debriefing) were identical to those in the Control condition.

4.1.3.3. Complementary condition

This condition tested whether people could estimate the complementary probability of pinc, namely, the probability that there is no minority member in a group (100% − pinc). On page 2 of this condition, the following critical question was presented:

(Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex. It is said that 7% of college students are gay/bisexual.

What do you think the percent probability is that there is no gay/bisexual student among 30 students? [  ] %

The rest of the questionnaire (pages 3 and 4) was identical to in the Control condition.

4.1.3.4. No group size condition

The EV heuristic observed in Experiment 1 may have been due to the participants’ simple strategy to use the given numbers in a calculation without a specific aim (e.g. calculating the EV). Hence, it was hypothesised that if the group size n were not given, the EV heuristic would not be used, the constant heuristic would be used more, and thus the estimated pinc would be larger than that in the Control condition. To test this hypothesis, the participants of this condition were given q, but not n, in the critical question:

(Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex. It is said that 7% of college students are gay/bisexual.

A company held a job fair for college students. Imagine what the venue looks like. Students are gathered at the venue.

What do you think the percent probability is that the students in the venue include at least one gay/bisexual student? [  ] %

On page 3, the participants were first asked to report the imagined number of students in the critical question (“In the previous question, how many students did you imagine in the venue?”). This “imagined n” question (Q3) was presented only in this condition. The responses to this question were expected to be correlated with the estimated pinc in the critical question if the participants accounted for the imagined group size in the pinc estimations. Following the imagined n question, the confidence rating, attention check, verbal report questions were presented in the same way as in the Control condition. The debriefing (page 4), which was also identical to in the Control condition, followed.

4.1.3.5. Frequency condition

In the critical question (Q2) of this condition, the participants were asked to estimate how many classes out of 100 classes included at least one gay/bisexual student, where each class consisted of 30 students:

(Q2) Some people are gay, where they are attracted to the same sex, and some people are bisexual, where they are attracted to both the same and the opposite sex. It is said that 7% of college students are gay/bisexual.

In one college, there are 100 classes with 30 students per class. How many of these classes do you think include at least one gay/bisexual student? [  ] classes

The expected frequency was 88.7 classes, which was equivalent to the normative pinc. This condition was investigated because risk communication studies have found that risks expressed in a frequency format are perceived differently from risks expressed in a percentage or decimal format (Visschers et al., Citation2009), while a frequency format may also improve probability reasoning (Gigerenzer & Hoffrage, Citation1995; Girotto & Gonzalez, Citation2001; Hoffrage et al., Citation2000; but see also McCloy et al., Citation2010). The rest of the questionnaire (pages 3 and 4) was identical to in the Control condition.

4.1.3.6. Cumulative risk condition

For comparison purposes, in the critical question (Q2), the participants of the Cumulative Risk condition were asked to estimate the percentage cumulative risk isomorphic to pinc. They were informed of the elementary risk and asked to estimate the cumulative risk:

(Q2) A certain medicine has the side effect of causing a mild headache. If you take this medicine once, there is a 7% probability that you will experience the side effect.

What do you think the percent probability is that you will experience the side effect at least once if you take this medicine once a day for 30 days? [  ] %

Page 3 (confidence rating, attention check, and verbal report) was identical to in the Control condition. The debriefing (page 4) was not included because the topic of gay/bisexual students was not used in this condition.

4.2. Results and discussion

Even with the improved wording, the participants still showed a prominent underestimation in the Control condition critical question (Mdn = 7, 85.1% underestimation; see (a) and Table A1). In the Hint condition, the estimates for the critical question (Mdn = 7) did not differ from those in the Control condition (W = 3512.5, p = .863, Cliff’s d = .015; see (c)). The percentage underestimation (86.6%) did not differ from that in the Control condition, either (χ2 (1) = 0.1, p = .776, w = .022). Hence, providing the gender problem as a hint had no effect.

Figure 5. Results of Experiment 3 critical question (Q2). In the Control condition, (a) the participants estimated pinc for q = 7% and n = 30 (Q2, gay/bisexual student problem with revised wording). For reference, b shows the result of the gender problem (Q1), which appeared only in the Hint condition. The results of the Control condition are plotted together with the results of the other conditions (c, e, f, and g) for comparison purposes. EV, expected value. q, prevalence presented in the problems. n, group size presented in the problems.

Figure 5. Results of Experiment 3 critical question (Q2). In the Control condition, (a) the participants estimated pinc for q = 7% and n = 30 (Q2, gay/bisexual student problem with revised wording). For reference, b shows the result of the gender problem (Q1), which appeared only in the Hint condition. The results of the Control condition are plotted together with the results of the other conditions (c, e, f, and g) for comparison purposes. EV, expected value. q, prevalence presented in the problems. n, group size presented in the problems.

In the Complementary condition, the median estimate for the critical question was 40, which was significantly higher than the normative percent probability of 11.3 (signed-rank test, T = 3478, p < .001; see (d)). The participants overestimated the probability that there is no minority member in a group, consistent with the pinc underestimation.

In the No Group Size condition, the estimates for the critical question were comparable with those in the Control condition ((e)) even though n was not given. The median estimate was 9.5, which was not statistically different from that in the Control condition (W = 3458.5, p = .179, Cliff's d = .117). It was expected that the constant heuristic would be more frequently used since q was the only numerical information given. However, the proportion of estimate “7” ( = q) did not differ between this condition (11.1%) and the Control condition (5.7%; Fisher’s exact probability test, p = .281, w = .096). The imagined ns (Q3) ranged from 1 to 10,000 and the median was 100. No clear correlation between the imagined n and pinc estimate was found (ρ = .17, p = .115). These results suggest that it is difficult to account for group size when estimating pinc.

The estimates in the Frequency condition critical question were larger (Mdn = 10) than those in the Control condition (W = 2827, p = .001, Cliff's d = .278), suggesting the partial mitigation of the underestimation. As shown in (f), the estimates below q decreased compared with the Control condition, while the “7” estimates increased. The frequency format thus produced a clear difference in the participants’ responses. Nevertheless, many of the participants (76.7%) in this condition still provided an underestimation and this proportion did not differ from that in the Control condition (χ2 (1) = 2.0, p = .157, w = .106).

The estimates of the Cumulative Risk condition critical question (Mdn = 11) were higher than those in the Control condition (W = 2755, p = .002, Cliff's d = .272). Consistent with previous studies, the constant heuristic (“7”) was often used to estimate the cumulative risk ((g)). The estimates below q were infrequent. Hence, people estimate pinc and the cumulative risk differently. However, the proportion of underestimation (78.2%) was comparable with that in the Control condition (85.1%, χ2 (1) = 1.4, p = .240, w = .089).

The results of confidence rating (1: not confident at all–4: very confident) showed that the participants were not confident in any of the conditions. The average confidence rating for each condition ranged from 1.66 to 2.07.

5. Heuristics

5.1. Heuristics for estimating pinc and the cumulative risk

Did the heuristics used by the participants differ between estimating pinc and the cumulative risk? Based on the participants’ verbal reports (Experiments 2 and 3), I identified seven heuristics (). The composition of the heuristics used to estimate pinc significantly differed from that used to estimate the cumulative risk. A χ2 test of independence revealed that the frequency of the heuristics used significantly differed between the Control and Cumulative Risk conditions in Experiment 3 (χ2 (7) = 43.8, p < .001, with continuity correction, w = .502). Further, a post-hoc analysis of the residuals (α = .05) found that the frequencies for five of the seven heuristics were significantly different between the conditions (). The normative solution was more frequently observed in the Cumulative Risk condition than in the Control condition.

Table 3. Percentage frequency of the heuristics used in Experiments 2 and 3. The analyses of the estimates and participants’ retrospective verbal reports identified seven heuristics used in the estimation tasks in Experiments 2 and 3. Raw frequencies are in parentheses. In Experiment 3, the frequency of the heuristics was compared between the Control (pinc estimation) and Cumulative Risk conditions.

5.2. Use of the EV

A relatively common heuristic for estimating pinc was to calculate the EV (e.g. 0.07 × 30 = 2.1) and somehow translate it into a probability (EV-to-probability translation). Those participants using this heuristic typically stated that an EV of one or more indicated very high pinc (e.g. “90” and “100”). Although such a translation method is not mathematically normative, it is a relatively reasonable way to estimate pinc without the normative calculation. Indeed, high estimates (≥ 90) were often provided by using this heuristic, although some of the participants translated the EV into low probabilities (e.g. “10”). Importantly, this heuristic was infrequently used in the Cumulative Risk condition.

The EV heuristic was also frequently used to estimate pinc. It was apparent in Experiment 1 ((a–e)) and was used by 25% of the students in Experiment 2 (). It was relatively rare in the Cumulative Risk condition in Experiment 3, although no significant difference from the Control condition was found. Interestingly, those participants who tried to calculate the EV (2.1) as a pinc estimate in Experiment 3 actually reported two separate heuristics. Most of them calculated 0.07 × 30, probably because they simply confused pinc with the EV. Others calculated 0.3 × 7. They seemed to posit that a q of 7% meant that pinc for a group of 100 was also 7%, and assumed that pinc should be proportional to n. The EV heuristic implied that as well as the additive heuristic, the participants correctly thought that pinc increases as n increases, but it still yielded underestimations.

5.3. Constant and additive heuristics

Constant and additive heuristics were used less often for estimating pinc than for estimating the cumulative risk (), consistent with previous studies of cumulative risks (De La Maza et al., Citation2019; Doyle, Citation1997; Fuller et al., Citation2004). (g) also shows the predominant use of the constant heuristic (“7”) for estimating the cumulative risk. The additive heuristic typically multiplies q by n (7 × 30, i.e. multiplicative heuristic). Since the result exceeded 100, an additional heuristic was often used (e.g. to truncate it to 100), as also reported in previous studies (Doyle, Citation1997; Fuller et al., Citation2004).

Given the above findings, the pinc estimation task was likely to have encouraged the participants to focus on the EV. By contrast, they were more likely to focus on q to estimate the cumulative risk.

5.4. Other heuristics

A heuristic of calculating 1/n was unexpected. The participants using this heuristic seemed to confuse the pinc of group size n with the proportion of one person in n (“the probability of one person out of 30 is 3%”). The confusion between the proportion and pinc was similar to that when using the constant heuristic. Because of the 1/n heuristic, “3” was the most frequent estimate in the Control condition of Experiment 3 ((a)). However, it was not used in the Cumulative Risk condition () or in the pinc estimations in Experiments 1 and 2 ((d), ). These observations suggest that the 1/n heuristic is likely to be used when facing a single pinc estimation problem. Since two problems using different ns were given simultaneously on the questionnaires of Experiments 1 and 2, the participants easily noticed that the 1/n heuristic contradicts the intuition that increasing n must increase pinc.

Finally, approximately 30% of the participants in Experiments 2 and 3 simply guessed the estimate without making a calculation. Of these, many referred to their own experiences, beliefs, and knowledge from the media (e.g. “In my life, I have met those people at about that probability”, “According to the TV and Internet, gay and bisexual people are all around us more than I thought”, and “I estimated it based on the fact that my body often experiences side effects”).

While the EV-related heuristics and constant heuristic were consistently observed in the present study, other heuristics (e.g. 1/n heuristic) were less common. Future studies are needed to examine these heuristics in more detail.

6. Experiments 4a and 4b

Would the pinc underestimation be replicated in a more realistic, ecologically valid situation? Experiments 1–3 may have had low ecological validity because they were conducted online and the participants estimated pinc for a fictitious group. By contrast, the participants in Experiments 4a and 4b were asked to estimate pinc for the group of people in the classroom with them. Each experiment was a one-shot experiment conducted as a group. Experiment 4a was conducted for a relatively small group, whereas Experiment 4b used a relatively large group.

Experiments 4a and 4b were not preregistered because even if the sample size had been determined in advance, it would have been difficult to adhere to it. The experimental procedure was approved in advance by the Niigata University Ethical Review Board for Human Research (2023-0220).

6.1. Method

6.1.1. Participants

Students were recruited after a class. The experimenter asked them to remain in the classroom if they wanted to participate and those who remained became participants. This procedure was performed once for a small class (Experiment 4a) and once for a large class (Experiment 4b). As a result, 16 individuals (10 men and 6 women) participated in Experiment 4a. They comprised 15 undergraduates (19–21 years) and one student from the local community (65 years). In Experiment 4b, 50 undergraduates (26 men and 24 women, age 18–22 years, M = 19.1) participated. See the Supplementary Materials for the other demographic data. In both experiments, all the participants took part without receiving monetary rewards or course credits. They provided consent to participate in advance. Each experiment took approximately several minutes.

6.1.2. Procedure and the questionnaire

The procedure was the same in both experiments. First, questionnaire booklets were distributed to the participants. They were instructed to read the instructions on the cover page and respond to the demographic items (age, gender identity, and the faculty to which they belonged). Then, the experimenter asked them to proceed to the next page, which was as follows (all the content of the questionnaire was written in Japanese).

Do not fill this out until instructed.

There is a gene called ALDH2*2/*2. It is said that 7% of the population have this gene in Japan.

Now, there are [  ] people in this classroom.

What do you think the percent probability is that there is at least one person having the gene ALDH2*2/*2 in this classroom? [  ] %

Once you have completed your answers, proceed to the next page.

The experimenter instructed the participants to complete the number of people in the classroom in the first blank space, which was counted by the experimenter in advance. The number was 17 (16 participants and one experimenter) in Experiment 4a and 51 (50 participants and one experimenter) in Experiment 4b. Next, the participants were instructed to answer the pinc estimation problem (i.e. complete the second blank space).

After the pinc estimation, the participants proceeded to the knowledge check (page 3):

Do you know about the gene ALDH2*2/*2 (or ALDH2)? (Check one that applies)

[  ] I know

[  ] I’ve heard of it, but I don’t know much

[  ] I don’t know

Then, similar to other experiments, a description of the gene and its actual percentage prevalence was given as debriefing (page 4). After reading the debriefing, the participants submitted their questionnaires to the experimenter and the experiment finished.

The prevalence value (q) of the pinc estimation problem was set to 7% since it was also used in Experiments 1–3. The topic of the problem was a real genotype of the aldehyde dehydrogenase gene ALDH2. This was adopted because it is not fictional and the 7% prevalence value is realistic in Japan (Eng et al., Citation2007).

6.2. Results and discussion

The underestimation of pinc was replicated in both experiments (). In Experiment 4a (7%–17), the normative pinc was 70.9 and most of the participants (15 out of 16) estimated below it. The median estimate was 3.9, which was significantly lower than the normative pinc (signed-rank test, T = 2, p < .001). In Experiment 4b (7%–51), the normative pinc was 97.5 and the median estimate (7) was significantly lower than it (T = 3, p < .001). 96.0% of the participants underestimated, while 52.0% reported estimates equal to or less than q. Only one participant (Experiment 4b) reported prior knowledge of the gene.

Figure 6. Results of Experiments 4a and 4b. Each participant estimated pinc for a real group of people in the classroom. The only difference between the two experiments was n (i.e. the group size presented in the problems). EV, expected value. q, prevalence presented in the problems. N, sample size.

Figure 6. Results of Experiments 4a and 4b. Each participant estimated pinc for a real group of people in the classroom. The only difference between the two experiments was n (i.e. the group size presented in the problems). EV, expected value. q, prevalence presented in the problems. N, sample size.

The use of the EV and constant heuristics was again evident, as in Experiments 1–3. Four participants (25.0%) in Experiment 4a responded EV or near-EV (“1.19”, “1.2”, and “1”). In Experiment 4b, the EV was 3.57, and 22.0% of the participants estimated a value between 3 and 4. The constant heuristic (responded “7”) was used by four out of the participants (25.0%) in Experiment 4a compared with 5 of the participants (10.0%) in Experiment 4b.

The results were comparable to those observed in the online experiment with student sample (Experiment 2). The pinc underestimation and use of the heuristics were replicated even in the more realistic situation.

7. Experiment 5

The purpose of Experiment 5 was twofold. First, it examined whether the pinc underestimation is related to less inclusive attitudes towards minorities. Second, it examined whether being given information on the normative pinc changes people’s attitudes towards minorities to be more inclusive. The participants first rated the extent to which they agreed with inclusive statements on colour deficiency (first attitude rating), followed by a middle task comprising three conditions. In the pinc estimation condition, the participants estimated pinc for colour deficiency as the middle task. In the EV estimation condition, the middle task was to estimate the number of people with colour deficiency in a group. The participants in the Normative pinc condition were given the normative pinc and rated how believable that pinc was. Finally, the participants in all the conditions again rated the extent to which they agreed with the inclusive statements (second attitude rating). If the pinc underestimation were related to less inclusive attitudes towards minorities, the estimated pinc of the pinc estimation condition middle task would be correlated with the first attitude rating. If the information on the normative pinc changes participants’ attitudes and simply estimating pinc or EV without knowing the normative pinc does not, there would be a difference between the first and second attitude ratings in the Normative pinc condition, but no difference in the other conditions.

Experiment 5 was preregistered (https://doi.org/10.17605/OSF.IO/MG629). The experimental procedure was approved in advance by the Niigata University Ethical Review Board for Human Research (2023-0220).

7.1. Method

7.1.1. Participants

Online workers were recruited in the same way as in Experiments 1 and 3. A sample size of 100 per condition was planned. As a result, 301 participants completed the experiment. They received a monetary reward of JPY 100. The data from 12 participants who did not pass the attention check (see Sections 7.1.2 and 7.2.1) were excluded from the analyses. The remaining 289 participants comprised the final sample (145 women, 143 men, and 1 unknown; age 20–77 years [M = 41.5, SD = 9.5]). Ninety-nine were assigned to the pinc estimation condition, 96 to the EV estimation condition, and 94 to the Normative pinc condition. See the Supplementary Materials for the other demographic data and determination of the sample size. All the participants provided consent to participate in advance.

7.1.2. Conditions and the questionnaires

As in Experiments 1 and 3, three online questionnaires were designed for each of the three conditions. The only difference between the conditions was the middle task. See the Supplementary Materials for the complete list of the questions and texts. All the instructions and questions were in Japanese.

On the cover page, the participants reported their age, gender identity, and highest level of education, in the same manner as in Experiment 3. On the next page, they reported the extent to which they agreed with each of the three statements on colour deficiency (first attitude rating). A visual analogue scale (VAS) ranging from 0 (disagree) to 100 (agree) was presented under each statement.

Some people have difficulty distinguishing between reddish and greenish colours. Medically, this is called colour deficiency. It is said that 3% of the population has colour deficiency.

How much do you agree with the following statements? Please answer by moving the slider.

Displays and information boards in public facilities should have colour schemes that are easy for people with colour deficiency to see.

School teachers should always consider that there may be students with colour deficiency in their classes.

Company management should take responsibility for creating a workspace in which people with colour deficiency can work comfortably.

Then, the participants performed the middle task (page 3). As shown below, the content of this task varied by condition.

(Middle task, pinc estimation condition)

It is said that 3% of the population has colour deficiency. What do you think the percent probability is that there is at least one person having colour deficiency among 30 people? [  ] %

(Middle task, EV estimation condition)

It is said that 3% of the population has colour deficiency. How many of 80 people do you think have colour deficiency? [  ] people

(Middle task, Normative pinc condition)

It is said that 3% of the population has colour deficiency. According to mathematical calculations, the probability of having at least one person with colour deficiency among 80 people is 91%. How believable is this 91% probability to you?

1

Completely unbelievable

2

Unbelievable

3

A little unbelievable

4

Neither believable nor unbelievable

5

A little believable

6

Believable

7

Completely believable

The second attitude rating (page 4) followed the middle task. The participants again reported their attitudes towards the three statements on colour deficiency as in the first attitude rating. In addition, an attention check item was presented at the end of this page (“To ensure that you are reading the questions, please move the slider below to the leftmost position labelled disagree”).

On page 5, the actual percentage prevalence of colour deficiency known from surveys was described as a debriefing in the same way as in Experiment 1. This experiment took 5–10 min.

7.2. Results and discussion

7.2.1. Data exclusion

To exclude inattentive responses, data from 12 participants who responded 6–100 on the VAS in the attention check item were excluded from the analysis. The remaining 289 who responded 0–5 were considered to have passed the attention check.

7.2.2. Middle task results

In the pinc estimation condition, the participants underestimated pinc as in the previous experiments. The median estimate was 4, which was significantly lower than the normative pinc of 91.3 (q = 3%, n = 80) (signed-rank test, T = 230, p < .001). Eighty five participants (85.9%) underestimated pinc and 49 (49.5%) reported estimates equal to or less than q. See Figure A1 for the histogram.

In the EV estimation condition, most of the participants reported estimates very close to the EV of 2.4. Among the 96 participants, 72.9% responded “2” and 18.8% responded “3” (the online questionnaire only accepted integer responses). The minimum and maximum estimates were 1 and 5, respectively.

In the Normative pinc condition, the mean rated believability of the normative pinc (91.3) was 4.4 (SD = 1.69). On average, the participants rated the normative pinc as “neither believable nor unbelievable” (4) or “a little believable” (5). This result was unexpected given the strong underestimation of pinc in the previous experiments. Perhaps because the participants were not so confident in their pinc estimates, they were comfortable believing the normative pinc, which was inconsistent with their intuitive estimates.

7.2.3. Correlation of the initial attitudes with the pinc estimates

As an index for the initial attitude of each participant, the responses to the three items of the first attitude rating were averaged (first attitude score, see ). The higher the attitude score, the more inclusive the participant’s attitude. Likewise, the second attitude score was calculated by averaging the three items of the second attitude rating.

Figure 7. Results of Experiment 5. The participants provided their attitude ratings twice, namely, before (blue plot) and after (red plot) a middle task. The vertical axis represents the attitude score (the average of the three rating items). A high attitude rating score indicated strong agreement with the inclusive statements on colour deficiency. The middle task varied by condition: pinc estimation (a), EV estimation (b), and reading about the normative pinc value and rating how believable the value was (c). The open dots with thin lines represent the participants’ scores and the filled dots with thick lines represent the averages. The p-values were adjusted for the triplet of t-tests. EV, expected value.

Figure 7. Results of Experiment 5. The participants provided their attitude ratings twice, namely, before (blue plot) and after (red plot) a middle task. The vertical axis represents the attitude score (the average of the three rating items). A high attitude rating score indicated strong agreement with the inclusive statements on colour deficiency. The middle task varied by condition: pinc estimation (a), EV estimation (b), and reading about the normative pinc value and rating how believable the value was (c). The open dots with thin lines represent the participants’ scores and the filled dots with thick lines represent the averages. The p-values were adjusted for the triplet of t-tests. EV, expected value.

Contrary to the hypothesis, the first attitude scores in the pinc estimation condition did not correlate with the pinc estimates (Spearman’s ρ = .025, p = .810). The pinc estimates were not related to the second attitude score either (ρ = .046, p = .653).

7.2.4. Attitude change

To examine whether the middle tasks caused attitude changes, the second attitude scores were compared with the first attitude scores for each of the conditions. Each comparison was made using the within participant t-test, and the p-values were adjusted using Holm’s procedure so that the overall alpha was set to .05 for the triplet of t-tests.

As shown in , the second attitude scores were significantly higher than the first attitude scores in the Normative pinc condition (t(93) = 5.79, padj < .001, dz = .597). As hypothesised, being given information about the normative pinc changed the participants’ attitude in favour of the inclusive statements. In the pinc estimation condition, the second attitude scores were also significantly higher than the first attitude scores (t(98) = 2.60, padj = .021), while the effect size was relatively small (dz = .261). In the EV estimation condition, there was no such difference (t(95) = 0.71, padj = .482, dz = .072). In summary, either knowing the normative pinc or estimating the pinc significantly changed the attitude scores, while estimating the EV did not.

Did the magnitude of attitude changes differ across the conditions? As an ad-hoc analysis, I compared the attitude score change (the second attitude score minus the first attitude score) between the three conditions. The mean attitude score change was 2.08 (SD = 7.97), 0.55 (7.67), and 5.23 (8.76) in the pinc estimation, EV estimation, and Normative pinc condition, respectively. A pairwise Welch test with Holm’s correction revealed that the attitude score change was significantly larger for the Normative pinc condition than for the pinc estimation condition (t(183.7) = 3.91, padj < .001, Cohen’s d = .568) and the EV estimation condition (t(187.0) = 2.60, padj = .020, d = .376), while there was no difference between the pinc estimation and EV estimation conditions (t(193.0) = 1.37, padj = .173, d = .196).

These results suggested that knowing the normative pinc changed the participants’ attitudes towards being more inclusive. The attitude changes caused by simply estimating pinc or the number of minority members in a group without knowing the normative pinc were small or negligible.

8. General discussion

The presented results suggest a cognitive bias specific to estimating pinc. There were significant differences in the participants’ use of heuristics when estimating pinc and the cumulative risk, even though they require the same calculation. In particular, while the participants’ frequent focus on the EV suggested that it served as an “anchor” for estimating pinc, q was primarily used as an “anchor” for estimating the cumulative risk. However, both these anchors are much smaller than the normative probabilities. Therefore, pinc and the cumulative risk are often underestimated.

Another intriguing finding was that the EV heuristic was not used when the participants estimated pinc for majorities ((f–i)). Specifically, pinc estimates approximately equal to the EV became infrequent as q increased. For the cases in which n = 30, the proportion of estimates approximately equal to the EV was 35.9% for q = 3% ((b)) and 30.2% for q = 7% ((d)), whereas it dropped to 16.3% as q rose to 20% ((a)). For q = 50% (Experiment 3 gender problem, (b)), only one participant (1.2%) reported the EV (15) as the pinc estimate. Hence, the EV heuristic was specific to the pinc estimation with low q values, namely, the pinc of minorities.

The presented results do not indicate that the participants were irrational or intolerant towards minorities. The pinc underestimation was not because of negative stereotypes, but rather a common characteristic of human cognition. The participants often used heuristics that seemed reasonable, at least partially, in a situation where the normative calculation could not be made without using electronic devices. It is reported that people made normative judgements on the cumulative risk when the normative calculation is easy (Pelham et al., Citation1994).

By reframing the problem when estimating difficult-to-calculate probabilities, people can translate such problems into mental models that can be processed with limited cognitive capacity. However, unsuitable translations lead to erroneous estimations. In the cases of estimating pinc and the cumulative risk, people often fail to translate the problem properly. In studies on conditional probability judgements, methods have been suggested to help participants build appropriate mental models, such as using natural frequency formats (Cosmides & Tooby, Citation1996; Gigerenzer & Hoffrage, Citation1995), adopting visual presentations (Brase, Citation2009; Tubau et al., Citation2019), and partitioning cases into subsets (Girotto & Gonzalez, Citation2001; Sloman et al., Citation2003), although the effects can be small or even absent (Evans et al., Citation2000; Stengård et al., Citation2022). Partitioning cases into subsets also improves the accuracy of estimating the cumulative risk (McCloy et al., Citation2010), whereas using the other two methods does not (Bar-Hillel, Citation1973; McCloy et al., Citation2010). Similarly, Experiment 3 showed a very slight improvement in the pinc estimation by reframing the problem in terms of the natural frequency of classes. Further studies are needed to assess how to improve the pinc estimation.

The present study clearly showed the difference between estimating pinc and the cumulative risk, despite arithmetic isomorphism. This implies that the mental models used to understand disjunctive probability over a population may differ from those used to understand disjunctive probability over time. Empirically, the additive and multiplicative heuristics are the most frequently used to judge the cumulative risk (Doyle, Citation1997; Fuller et al., Citation2004) and the major source of the overestimation of such risks. However, the present study showed that the EV and constant heuristics were the most frequently used for estimating pinc, which yielded an underestimation.

How did these differences arise? Doyle (Citation1997) analysed how the participants used additive heuristics (“multiplicative” in his paper) for cumulative risk judgements and found that they are likely to notice the flaws of the heuristic when the result of the simple multiplication exceeds 100% (e.g. annual risk 5% × 25 years = 125%). Thus, one of the reasons for the infrequent use of the additive heuristic in the present study may be that q and n exceeded 100% when multiplied (e.g. 7% and 30). However, even for the 3%–30 problem ((b)), the additive heuristic (3% × 30 = 90) was rarely used, while the use of the EV heuristic was the most frequent. Furthermore, even with the same setting of q and n, pinc was underestimated more than the cumulative risk, while the EV heuristic was frequently used to estimate pinc but not to estimate the cumulative risk (Experiment 3). Interestingly, Doyle (Citation1997) noted that participants using the additive heuristic to estimate the cumulative risk of flooding frequently referred to the EV (“Your home … will be hit 2 1/2 times in 50 years [by flooding]”, p. 520). Since the expected number of hits by flooding is proportional to years, Doyle (Citation1997) suggested that such a focus on the EV led participants to use the additive heuristic. For the pinc estimation, however, my participants often reported the EV as the probability; the additive heuristic was rarely used. Hence, while people are likely to reframe both pinc and the cumulative risk in terms of the EV, the resulting mental models seem to differ considerably. The causes of such a difference must be clarified in future research using a wider range of q and n.

7. Conclusions

People greatly underestimate the probability of inclusion. Some of the participants in this study made relatively reasonable estimates using the EV-to-probability translation heuristic, but many showed underestimations comparable to or greater than the case of the cumulative risk. This cognitive fallacy seemed to be due to the confusion between prevalence, EV, and the probability of inclusion. As probability of inclusion may be unfamiliar to many people, it is difficult to translate this concept into suitable mental models.

One might assume that if strong incentives exist (e.g. formal achievement tests), people provide more accurate estimates (Kruglanski & Freund, Citation1983). However, in everyday situations, there is no strong incentive to spend considerable effort making more accurate estimates. Hence, pinc estimations in everyday situations are likely to be comparable to those in the experiments reported herein, suggesting that people may be unaware of the relevance of minorities in their lives. Fortunately, as Experiment 5 suggested, knowing the normative pinc may help to reduce such unawareness. Further research is needed to determine how and to what extent pinc guides decision making in real-world settings.

Supplemental material

Supplemental Material

Download MS Word (56.7 KB)

Acknowledgment

The article publishing charge for this article was raised through crowdfunding. The author deeply thanks to all 92 crowdfunders. The cost of the experiments was supported by Niigata University, Japan. This work did not receive any grant from funding agencies.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Anonymized data, questionnaires, and analysis code are publicly available at Open Science Framework repository, https://doi.org/10.17605/OSF.IO/GXWN7

.

References

  • Alba, R., Rumbaut, R. G., & Marotz, K. (2005). A distorted nation: Perceptions of racial/ethnic group sizes and attitudes toward immigrants and other minorities. Social Forces, 84(2), 901–919. https://doi.org/10.1353/sof.2006.0002
  • Bar-Hillel, M. (1973). On the subjective probability of compound events. Organizational Behavior and Human Performance, 9(3), 396–406. https://doi.org/10.1016/0030-5073(73)90061-5
  • Bermudez, P., & Zatorre, R. J. (2009). A distribution of absolute pitch ability as revealed by computerized testing. Music Perception, 27(2), 89–101. https://doi.org/10.1525/mp.2009.27.2.89
  • Birch, J. (2012). Worldwide prevalence of red-green color deficiency. Journal of the Optical Society of America A, 29(3), 313–320. https://doi.org/10.1364/JOSAA.29.000313
  • Bosten, J. (2019). The known unknowns of anomalous trichromacy. Current Opinion in Behavioral Sciences, 30, 228–237. https://doi.org/10.1016/j.cobeha.2019.10.015
  • Brase, G. L. (2009). Pictorial representations in statistical reasoning. Applied Cognitive Psychology, 23(3), 369–381. https://doi.org/10.1002/acp.1460
  • Casscells, W., Schoenberger, A., & Graboys, T. B. (1978). Interpretation by physicians of clinical laboratory results. New England Journal of Medicine, 299(18), 999–1001. https://doi.org/10.1056/NEJM197811022991808
  • Chen, P. W., & Holyoak, K. J. (1985). Pragmatic reasoning schemas. Cognitive Psychology, 17(4), 391–416. https://doi.org/10.1016/0010-0285(85)90014-3
  • Citrin, J., & Sides, J. (2008). Immigration and the imagined community in Europe and the United States. Political Studies, 56(1), 33–56. https://doi.org/10.1111/j.1467-9248.2007.00716.x
  • Cohen, J., Chesnick, E., & Haran, D. (1971). Evaluation of compound probabilities in sequential choice. Nature, 232(5310), 414–416. https://doi.org/10.1038/232414a0
  • Cohen, J., & Hansel, C. E. M. (1957). The nature of decisions in gambling. Acta Psychologica, 13, 357–370. https://doi.org/10.1016/0001-6918(57)90031-8
  • Cosmides, L., & Tooby, J. (1996). Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty. Cognition, 58(1), 1–73. https://doi.org/10.1016/0010-0277(95)00664-8
  • De La Maza, C., Davis, A., Gonzalez, C., & Azevedo, I. (2019). Understanding cumulative risk perception from judgments and choices: An application to flood risks. Risk Analysis, 39(2), 488–504. https://doi.org/10.1111/risa.13206
  • Doyle, J. K. (1997). Judging cumulative risk. Journal of Applied Social Psychology, 27(6), 500–524. https://doi.org/10.1111/j.1559-1816.1997.tb00644.x
  • Eng, M. Y., Luczak, S. E., & Wall, T. L. (2007). ALDH2, ADH1B, and ADH1C genotypes in Asians: A literature review. Alcohol Research & Health, 30(1), 22–27.
  • Epstein, R., McKinney, P., Fox, S., & Garcia, C. (2012). Support for a fluid-continuum model of sexual orientation: A large-scale internet study. Journal of Homosexuality, 59(10), 1356–1381. https://doi.org/10.1080/00918369.2012.724634
  • European Union Agency for Fundamental Rights. (2020). A long way to go for LGBTI equality. https://doi.org/10.2811/7746
  • Evans, J. S. B. T., Handley, S. J., Perham, N., Over, D. E., & Thompson, V. A. (2000). Frequency versus probability formats in statistical word problems. Cognition, 77(3), 197–213. https://doi.org/10.1016/S0010-0277(00)00098-6
  • Fuller, R., Dudley, N., & Blacktop, J. (2004). Older people’s understanding of cumulative risks when provided with annual stroke risk information. Postgraduate Medical Journal, 80(949), 677–678. https://doi.org/10.1136/pgmj.2004.019489
  • Gallagher, C. (2003). Miscounting race: Explaining whites’ misperceptions of racial group size. Sociological Perspectives, 46(3), 381–396. https://doi.org/10.1525/sop.2003.46.3.381
  • Gigerenzer, C., & Hoffrage, U. (1995). How to improve Bayesian reasoning without instruction: Frequency formats. Psychological Review, 102(4), 684–704. https://doi.org/10.1037/0033-295X.102.4.684
  • Girotto, V., & Gonzalez, M. (2001). Solving probabilistic and statistical problems: A matter of information structure and question form. Cognition, 78(3), 247–276. https://doi.org/10.1016/S0010-0277(00)00133-5
  • Gummer, T., Roßmann, J., & Silber, H. (2021). Using instructed response items as attentional checks in web surveys: Properties and implementation. Sociological Methods & Research, 50(1), 238–264. https://doi.org/10.1177/0049124118769083
  • Henninger, F., Shevchenko, Y., Mertens, U. K., Kieslich, P. J., & Hilbig, B. E. (2022). lab.js: A free, open, online study builder. Behavior Research Methods, 54(2), 556–573. https://doi.org/10.3758/s13428-019-01283-5
  • Hoffrage, U., Lindsey, S., Hertwig, R., & Gigerenzer, G. (2000). Communicating statistical information. Science, 290(5500), 2261–2262. https://doi.org/10.1126/science.290.5500.2261
  • Ikuta, N., Koike, Y., Aoyagi, N., Matsuzaka, A., Fuse-Nagase, Y., Kogawa, K., & Takizawa, T. (2017). Prevalence of lesbian, gay, bisexual, and transgender among Japanese university students: A single institution survey. International Journal of Adolescent Medicine and Health, 29, Article 20150113. https://doi.org/10.1515/ijamh-2015-0113
  • Ipsos. (2015, December 2). Perils of perception 2015. https://www.ipsos.com/en-uk/perils-perception-2015
  • Juslin, P., Lindskog, M., & Mayerhofer, B. (2015). Is there something special with probabilities? – Insight vs. computational ability in multiple risk combination. Cognition, 136, 282–303. https://doi.org/10.1016/j.cognition.2014.11.041
  • Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80(4), 237–251. https://doi.org/10.1037/h0034747
  • Kardosh, R., Sklar, A. Y., Goldstein, A., Pertzov, Y., & Hassin, R. R. (2022). Minority salience and the overestimation of individuals from minority groups in perception and memory. Proceedings of the National Academy of Sciences, 119(12), Article e2116884119. https://doi.org/10.1073/pnas.2116884119
  • Keller, C., Siegrist, M., & Gutscher, H. (2006). The role of the affect and availability heuristics in risk communication. Risk Analysis, 26(3), 631–639. https://doi.org/10.1111/j.1539-6924.2006.00773.x
  • Khaw, M. W., Kranton, R., & Huettel, S. (2021). Oversampling of minority categories drives misperceptions of group compositions. Cognition, 214, 104756. https://doi.org/10.1016/j.cognition.2021.104756
  • Knäuper, B., Kornik, R., Atkinson, K., Guberman, C., & Aydin, C. (2005). Motivation influences the underestimation of cumulative risk. Personality and Social Psychology Bulletin, 31(11), 1511–1523. https://doi.org/10.1177/0146167205276864
  • Kruglanski, A. W., & Freund, T. (1983). The freezing and unfreezing of lay-inferences: Effects on impressional primacy, ethnic stereotyping, and numerical anchoring. Journal of Experimental Social Psychology, 19(5), 448–468. https://doi.org/10.1016/0022-1031(83)90022-7
  • Landy, D., Guay, B., & Marghetis, T. (2018). Bias and ignorance in demographic perception. Psychonomic Bulletin & Review, 25(5), 1606–1618. https://doi.org/10.3758/s13423-017-1360-2
  • Lee, E., Karimi, F., Wagner, C., Jo, H.-H., Strohmaier, M., & Galesic, M. (2019). Homophily and minority-group size explain perception biases in social networks. Nature Human Behaviour, 3(10), 1078–1087. https://doi.org/10.1038/s41562-019-0677-4
  • Martinez, M. D., Wald, K. D., & Craig, S. C. (2008). Homophobic innumeracy? Estimating the size of the gay and lesbian population. Public Opinion Quarterly, 72(4), 753–767. https://doi.org/10.1093/poq/nfn049
  • McCloy, R., Byrne, R. M. J., & Johnson-Laird, P. J. (2010). Understanding cumulative risk. Quarterly Journal of Experimental Psychology, 63(3), 499–515. https://doi.org/10.1080/17470210903024784
  • Miyazaki, K., Makomaska, S., & Rakowski, A. (2012). Prevalence of absolute pitch: A comparison between Japanese and Polish music students. The Journal of the Acoustical Society of America, 132(5), 3484–3493. https://doi.org/10.1121/1.4756956
  • National Police Agency of Japan. (2021). Reiwa 3 nen ban keisatsu hakusho [The white paper on police 2021].
  • Newport, F. (2015, May 21). Americans greatly overestimate percent gay, lesbian in U.S. Gallup. https://www.gallup.com/poll/183383/americans-greatly-overestimate-percent-gay-lesbian.aspx
  • Pelham, B. W., Sumarta, T. T., & Myaskovsky, L. (1994). The easy path from many to much: The numerosity heuristic. Cognitive Psychology, 26(2), 103–133. https://doi.org/10.1006/cogp.1994.1004
  • Saitama Prefecture. (2021). Tayosei o soncho-suru kyoseishakai-zukuri ni kansuru chosa hokokusho [Report of the survey for a coexisting society that respects diversity]. https://www.pref.saitama.lg.jp/documents/183194/lgbtqchousahoukokusho.pdf.
  • Shaklee, H., & Fischhoff, B. (1990). The psychology of contraceptive surprises: Cumulative risk and contraceptive effectiveness. Journal of Applied Social Psychology, 20(5), 385–403. https://doi.org/10.1111/j.1559-1816.1990.tb00418.x
  • Simner, J. (2012). Defining synaesthesia. British Journal of Psychology, 103(1), 1–15. https://doi.org/10.1348/000712610X528305
  • Simner, S., Mulvenna, C., Sagiv, N., Tsakanikos, E., Witherby, S. A., Fraser, C., Scott, K., & Ward, J. (2006). Synaesthesia: The prevalence of atypical cross-modal experiences. Perception, 35(8), 1024–1033. https://doi.org/10.1068/p5469
  • Sloman, S. A., Over, D., Slovak, L., & Stibel, J. M. (2003). Frequency illusions and other fallacies. Organizational Behavior and Human Decision Processes, 91(2), 296–309. https://doi.org/10.1016/S0749-5978(03)00021-9
  • Slovic, P. (1969). Manipulating the attractiveness of a gamble without changing its expected value. Journal of Experimental Psychology, 79(1, Pt.1), 139–145. https://doi.org/10.1037/h0026970
  • Slovic, P., & Peters, E. (2006). Risk perception and affect. Current Directions in Psychological Science, 15(6), 322–325. https://doi.org/10.1111/j.1467-8721.2006.00461.x
  • Statistics Bureau of Japan. (2021). 2020 population census. https://www.stat.go.jp/english/data/kokusei/2020/summary.html
  • Stengård, E., Juslin, P., Hahn, U., & van den Berg, R. (2022). On the generality and cognitive basis of base-rate neglect. Cognition, 226, 105160. https://doi.org/10.1016/j.cognition.2022.105160
  • Tubau, E., Rodríguez-Ferreiro, J., Barberia, I., & Colomé, À. (2019). From reading numbers to seeing ratios: A benefit of icons for risk comprehension. Psychological Research, 83(8), 1808–1816. https://doi.org/10.1007/s00426-018-1041-4
  • Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131. https://doi.org/10.1126/science.185.4157.1124
  • Visschers, V. H. M., Meertens, R. M., Passchier, W. W. F., & De Vries, N. N. K. (2009). Probability information in risk communication: A review of the research literature. Risk Analysis, 29(2), 267–287. https://doi.org/10.1111/j.1539-6924.2008.01137.x
  • Weinstein, N. D. (1989). Optimistic biases about personal risks. Science, 246(4935), 1232–1233. https://doi.org/10.1126/science.2686031
  • Wong, C. J. (2007). “Little” and “big” pictures in our heads: Race, local context, and innumeracy about racial groups in the United States. Public Opinion Quarterly, 71(3), 392–412. https://doi.org/10.1093/poq/nfm023

Appendix

Table A1. Results of Experiment 3 critical questions (Q2). Prevalence (q) and group size (n) were presented in the problems. Underestimation was defined as estimates less than the normative answer for each problem.

Figure A1. Results of the middle task of the pinc estimation condition (N = 99), Experiment 5.

Figure A1. Results of the middle task of the pinc estimation condition (N = 99), Experiment 5.