57
Views
0
CrossRef citations to date
0
Altmetric
Comment

F*, an interpretable transformation of the F measure, equates to the critical success index

ORCID Icon & ORCID Icon
Article: 2326685 | Received 12 Feb 2024, Accepted 27 Feb 2024, Published online: 07 May 2024

ABSTRACT

When analysing large datasets, suitable outcome measures are needed. For example, in clinical medicine, we often want to know how well a test identifies people with or without a particular diagnosis. Traditionally, this evaluation has been calculated in terms of Sensitivity and Specificity, and Positive and Negative Predictive Values (PPV, NPV), but many other test outcome measures are available. Recently, a measure termed F* was described, as an interpretable transformation of the F measure. The F measure is a unitary metric calculated as the harmonic mean of Sensitivity and PPV. We show that F* is in fact identical to a previously described measure which is monotonically related to the F measure, and which has been variously named in previous publications as the ratio of verification, the Jaccard similarity measure or index, the threat score, the Tanimoto index, and the critical success index (CSI). The origins of these different terms, which date from the late 19th–20th century, in different scientific disciplines (weather forecasting, ecology, machine learning) may explain the repeated independent redescription of this measure. More recently, the advantages of applying CSI, a metric routinely used to verify weather forecast accuracy, have been demonstrated in medical diagnostic accuracy studies.

PLAIN LANGUAGE SUMMARY

When medical studies have collected large amounts of health data on people classified in the data as having a particular disease, it may often be difficult to “see the wood from the trees” in terms of being sure that that disease classification is actually correct for each person in the dataset. Therefore, it is important that researchers first verify the disease classification accuracy of the data against an external “gold standard” source of information holding the correct diagnosis. This is usually a patient’s medical records. A comparison of the health data disease classification against the medical records diagnosis generates various measures of disease classification accuracy, all of which have advantages and disadvantages as verification measures. Here, we examine a verification measure which has recently been characterised as “F*”. It turns out, in fact, as shown by elementary mathematical techniques, that F* is not novel; having been previously described under different names for more than a century in various scientific disciplines (weather forecasting, ecology, machine learning). Our preference is to call this measure the “critical success index” (CSI). It is a single measure which, in clinical medicine, combines information about the classification performance with respect to both test and patient. It is easy to calculate, simple to use, and to interpret (higher values are better).

1. Introduction

We live in the age of “Big Data”. However, the question of how best to “mine” these data to extract what “really matters”, in terms of correct identification of items being searched for, or guiding future actions or interventions based on the aggregate data, remains to be resolved. Is there a single measure, or a minimum set of measures, which can meaningfully evaluate large amounts of data? What methods should be used?

For example, in medical practice, diagnostic and screening accuracy studies of clinical investigations or, increasingly, of diagnostic algorithms are often reported using some of the many measures which may be derived from a standard 2×2 contingency table. This cross-classifies actual and predicted classes of the diagnosis [Citation1]. Typically, these outcomes include paired measures of diagnostic or screening classification by the test, Sensitivity (Sens) and Specificity (Spec), and paired measures of classification of the patient, Positive and Negative Predictive Values (PPV, NPV). All of these are recognised to have potential shortcomings as outcome measures. A single, unitary or global measure which combines information relevant to both test and patient classification may thus be desirable.

One such measure is the F measure, defined as the harmonic mean of PPV (also known as precision) and Sens (also known as recall) [Citation2]. This measure corresponds to the coefficient previously described by Dice [Citation3], and independently by Sørensen [Citation4], hence sometimes known as the Dice coefficient or the Sørensen-Dice coefficient, and to the approach advocated by van Rijsbergen [Citation5].

In terms of the base data from a 2×2 contingency table containing N elements with four degrees of freedom, the following four classifications may be made: TP = true positive (or a “hit” or “true hit”), FP = false positive (or “false hit” or “false alarm”), FN = false negative (or a “miss”), and TN = true negative (or a “non-event” or “correct rejection”). From these: F=2.TP/(2.TP+FP+FN)This may also be expressed in terms of PPV and Sens, as: F=2.PPV.Sens/(PPV+Sens)=2/[1/Sens+1/PPV]that is, F is the harmonic mean of PPV and Sens.

More recently, Hand et al. [Citation6] have described “F*” as “an interpretable transformation of the F measure” [Citation6], where: F=F/(2F)They acknowledge that “researchers may recognise this as the Jaccard coefficient widely used in areas where TN may not be relevant”. As will be shown, these authors have indeed redescribed an already existing binary classification measure, first reported in the late nineteenth century as the ratio of verification in the context of forecasting tornadoes [Citation7], and subsequently as the Jaccard index or similarity coefficient (J) [Citation8], the threat score [Citation9], the Tanimoto index [Citation10], and later still as the critical success index (CSI) [Citation11,Citation12]. Here we use the latter terminology.

2. Mathematical proofs of identity of F* and CSI

The identity of F* and CSI may be shown in several ways using elementary mathematical methods.

2.1. From the base data of a 2 × 2 contingency table

Hand et al. [Citation6] showed that: F=TP/(NTN)This also holds for CSI, since in terms of the base data: CSI=TP/(TP+FP+FN)=TP/(NTN)Hence F* = CSI, QED.

2.2. From the monotonic relationship of F to CSI

The monotonic relationship between F and CSI, as shown for example by Jolliffe [Citation13] (modified), is given by: F=2.CSI/(1+CSI)The equivalence of F* and CSI may thus be shown. Since Hand et al. [Citation6] showed that: F=F/(2F)Then rearranging: F=F(2F)=2.FF.FDividing through by F and rearranging: (2.F/F)F=1F+1=2.F/FHence: F=2.F/(F+1)Hence F* = CSI, QED.

2.3. From the combination of PPV (or precision) and Sens (or recall)

Like F, CSI may be characterised in terms of PPV and Sens: CSI=1/[(1/PPV)+(1/Sens)1]Again, the equivalence of F* and CSI may be shown. Hand et al. [Citation6] found that: F=(PPV×Sens)/PPV+Sens(PPV×Sens)Dividing through by (PPV × Sens) gives: F=1/[(1/Sens)+(1/PPV)1]Hence F* = CSI, QED.

2.4. From the combination of Sens, PPV, P, and Q

In the 2×2 contingency table, prevalence or base rate P = (TP + FN)/N, and bias or threshold Q = (TP + FP)/N. Thus, from Powers [Citation2]: F=2.Sens.P/(Q+P)=2.PPV.Q/(Q+P)For CSI the equations are [Citation1]: CSI=1/[(Q+P)/Sens.P1]=1/[(Q+P)/PPV.Q1]Since from Hand et al. [Citation6]: F=F/(2F)Then substituting and rearranging: F=[2.Sens.P/(Q+P)]/(2[2.Sens.P/(Q+P)])=1/[(Q+P)/Sens.P1] F=[2.PPV.Q/(Q+P)]/(2[2.PPV.Q/(Q+P)])=1/[(Q+P)/PPV.Q1]Hence F* = CSI, QED.

3. Conclusion

As mentioned above, Hand et al. noted in their formulation of F* that “researchers may recognise this as the Jaccard coefficient widely used in areas where TN may not be relevant” [Citation6] and they cite Jaccard’s 1908 paper [Citation14], although others [Citation2] cite his 1901 paper [Citation15] as the forerunner of the 1912 English translation [Citation8].

We suggest that this is a parameter which, like F [Citation3–5], has undergone periodic redescriptions. The first report of which we are aware is Gilbert’s “ratio of verification” of 1884 [Citation7], predating the Jaccard similarity coefficient [Citation8]. This latter measure is equivalent in set theory to union over intersection, which was also proposed by Tanimoto in 1958 when working for IBM [Citation10], without reference to either Gilbert or Jaccard. The same measure has also been described by Palmer & Allen in 1949 as the threat score [Citation9], and as the CSI by Donaldson et al. [Citation11] in 1975 and by Schaefer [Citation12] in 1990, and now as F* by Hand et al. [Citation6] These multiple redescriptions may reflect use of this measure by researchers in different disciplines (weather forecasting, ecology, machine learning) unaware of prior authors and unbeknownst to later authors. This case illustrates the potential for convergent evolution of verification metrics over time. A system for greater cohesion between disciplines over the years may be required to prevent duplication in literature. We see little merit in using the F* nomenclature rather than CSI, as the latter makes clear its difference from, and hence avoids any confusion with, F.

The CSI has recently been exported to the domain of clinical medicine. Examples of its use include the evaluation of the accuracy of instruments used in day-to-day clinical practice for screening cognitive function in patients with possible dementia or mild cognitive impairment [Citation16], as well as in diagnostic accuracy studies of administrative epilepsy data [Citation17]. In these studies the identity of F* and CSI has been confirmed using the respective datasets. We have also suggested possible application of CSI in assessing both NICE criteria for 2-week-wait suspected brain and CNS cancer referrals [Citation18] and polygenic hazard scores [Citation19]. These are all situations in which large numbers of TN may complicate the interpretation of more traditional measures such as PPV and Sens. The large number of TN instances make it difficult to discriminate diagnostic algorithm accuracy on the basis of negative predictive value (NPV) and specificity as these become inflated (usually approaching 100%). It is not uncommon for diagnostic accuracy studies to rank algorithms in order of highest to lowest PPV and Sens, with priority given to those with higher values in both estimates [Citation17,Citation20]. This approach is challenged by lack of a single threshold, and the trade-off relationship between PPV and Sens. A convenient way to combine PPV and Sens into a single metric to make it easier and more objective to rank diagnostic algorithms by their accuracy is to use CSI [Citation17,Citation20].

Many outcome measures may be quantitatively, semi-quantitatively, or qualitatively classified, for example, as poor, fair, good, or excellent. Is it possible to classify CSI values in one or other of these ways? To date, diagnostic accuracy thresholds based on combined PPV and Sens have yet to be determined. However, as for Sens, Spec, PPV, and NPV, all of which are conditional probabilities ranging from 0 to 1, a higher value is evidently better. Empirical recommendations may thus be made, such as the proposal that CSI ≥0.80 would be acceptable as diagnostically accurate from prior epilepsy literature, based on observation that both the underlying PPV and Sens were ≥80% at this CSI threshold [Citation17]. Future work should be undertaken to explore and cross-validate this threshold in other datasets within and outside epilepsy. Clearly, absolute values may be used when comparing different testing strategies or algorithms [Citation17,Citation20]. CSI may also be treated as a proportion and a non-exact binomial method used to calculate its confidence intervals, thereby aiding the ability to make valid comparisons [Citation1]. As CSI values are derived from both Sens and PPV, they are dependent on both test threshold (Q) and class prevalence (P) [Citation21].

Authors’ contributions

Conceptualization, A.J.L; Methodology, A.J.L.; Formal Analysis, A.J.L; Writing – Original Draft, A.J.L.; Writing – Review & Editing, G.K.M., A.J.L., Supervision, A.J.L.

Availability of data and materials

No datasets used and/or analysed during the current study.

Declarations

Ethics approval and consent to participate: not applicable as this is a commentary article on a previously publish study.

Consent for publication: not applicable.

Acknowledgements

We are grateful to Professor Iain Buchan and Dr Glen Martin for strategic advice on this work.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

GKM is supported by a National Institute for Health and Care Research (NIHR) Clinical Lectureship (CL-2022-07-002).

References

  • Larner AJ. The 2x2 matrix. Contingency, confusion, and the metrics of binary classification. 2nd ed. London: Springer; 2024.
  • Powers DMW. What the F measure doesn’t measure … Features, flaws, fallacies and fixes. arXiv. 2015. doi:10.48550/arXiv.1503.06410
  • Dice LR. Measures of the amount of ecological association between species. Ecology. 1945;26(3):297–302. doi:10.2307/1932409
  • Sørensen T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Kongelige Danske Videnskabernes Selskab. 1948;5(4):1–34.
  • van Rijsbergen CJ. Foundation of evaluation. J Doc. 1974;30:365–373. doi:10.1108/eb026584
  • Hand DJ, Christen P, Kirielle N. F*: an interpretable transformation of the F measure. Mach Learn. 2021;110(3):451–456. doi:10.1007/s10994-021-05964-1
  • Gilbert GK. Finley’s tornado predictions. Am Meteorol J. 1884;1:166–172.
  • Jaccard P. The distribution of the flora in the alpine zone. New Phytol. 1912;11:37–50. doi:10.1111/j.1469-8137.1912.tb05611.x
  • Palmer WC, Allen RA. Note on the accuracy of forecasts concerning the rain problem. Washington (DC): U.S. Weather Bureau manuscript; 1949.
  • Tanimoto TT. An elementary mathematical theory of classification and prediction. Internal IBM Technical Report; 17 Nov 1958; Available from: http://dalkescientific.com/tanimoto.pdf
  • Donaldson RJ, Dyer RM, Kraus MJ. An objective evaluator of techniques for predicting severe weather events. In: Preprints, 9th Conference on Severe Local Storms; Norman, Oklahoma; 1975. p. 312–326.
  • Schaefer JT. The critical success index as an indicator of warning skill. Weather Forecast. 1990;5:570–575. doi:10.1175/1520-0434(1990)005<0570:TCSIAA>2.0.CO;2
  • Jolliffe IT. The Dice co-efficient: a neglected verification performance measure for deterministic forecasts of binary events. Meteorol Appl. 2016;23(1):89–90. doi:10.1002/met.1532
  • Jaccard P. Nouvelles recherches sur la distribution florale. Bull Soc Vaudoise Sci Nat. 1908;44:223–270.
  • Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat. 1901;37:547–579.
  • Larner AJ. Assessing cognitive screening instruments with the critical success index. Prog Neurol Psychiatry. 2021;25(3):33–37. doi:10.1002/pnp.719
  • Mbizvo GK, Bennett KH, Simpson CR, et al. Using critical success index or Gilbert skill Score as composite measures of positive predictive value and sensitivity in diagnostic accuracy studies: weather forecasting informing epilepsy research. Epilepsia. 2023;64:1466–1468. doi:10.1111/epi.17537
  • Mbizvo GK, Larner AJ. Isolated headache is not a reliable indicator for brain cancer. Clin Med. 2022;22(1):92–93. doi:10.7861/clinmed.Let.22.1.2
  • Mbizvo GK, Larner AJ. Re: realistic expectations are key to realising the benefits of polygenic scores. BMJ. [cited 2022 Mar 11]. Available from: https://www.bmj.com/content/380/bmj-2022-073149/rapid-responses
  • Mbizvo GK, Simpson CR, Duncan SE, et al. Critical success index or F measure to validate the accuracy of administrative healthcare data identifying epilepsy in deceased adults in Scotland. Epilepsy Res. 2024;199:107275. doi:10.1016/j.eplepsyres.2023.107275
  • Mbizvo GK, Larner AJ. On the dependence of the critical success index (CSI) on prevalence. medRxiv. doi:10.1101/2023.12.03.23299335