ABSTRACT
R2 and adjusted R2 may exaggerate a model’s true ability to predict the dependent variable in the presence of overfitting, whereas leave-one-out R2 (LOOR2) is robust to overfitting. We demonstrate this by replicating 279 regressions from 100 papers in top economics journals, where the median increases of R2 and adjusted R2 over LOOR2 reach 40.2% and 21.4% respectively. The inflation of test errors over training errors increases with the severity of overfitting as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decreases with the sample size. These results are further validated by Monte Carlo simulations.
1. Introduction
In empirical studies, and adjusted (denoted as ) are routinely reported as measures of goodness-of-fit for linear regressions. For example, a of 0.8 is usually taken to imply that all explanatory variables jointly explain 80% of the variations in the dependent variable. But how reliable is this interpretation?
It is well known that and only measure in-sample fit, which may not be good indicators of the model’s true ability to explain or predict out of sample. In particular, it is common sense in the machine learning literature that training errors (as represented by and ) could be poor measures of the true test errors, when the model is used to predict data that it has not yet seen. Nevertheless, as of today, most economists still happily use and to measure goodness-of-fit, without worrying about its potential pitfalls.Footnote1
This paper takes this issue seriously. The essential problem is that and may exaggerate a model’s true ability to explain or predict the dependent variable, especially in the presence of overfitting. Overfitting occurs when a model is excessively fit to noisy sample data (e.g., a low degree of freedom resulting from a small sample size or too many covariates, a complicated functional form with many nonlinear terms, or the presence of outliers), which compromises the model’s ability to uncover the true relationship between the dependent and explanatory variables, as well as its performance in out-of-sample prediction.
To solve this problem, we recommend using leave-one-out cross-validated ( in short) as a better measure of goodness-of-fit for linear regressions. While has been around for some time, this paper takes it seriously and suggests that economists should routinely report in their empirical work alongside and adjusted (if not at the expense of the latter two). There are a number of advantages associated with . First, is robust to overfitting, as it measures the true test errors and the model’s real ability to explain or predict the dependent variable. Second, while five-fold or ten-fold cross-validations are popular in machine learning to measure test errors, the results are uncertain due to random splitting of the sample into five or ten folds (parts) of roughly equal sizes. On the other hand, the results from leave-one-out cross-validation is certain, since one observation is left out at a time, and no random sampling is involved. Last but not least, for linear regressions, there is a short-cut formula for computing such that only one regression is needed, thus the computational cost is minimal.
To support the above claims, we replicate 279 regressions from 100 empirical papers in four top economics journals during 2004–2021. In this sample, the median increases of R2 and over LOOR2 reach 40.2% and 21.4%, respectively, implying that both R2 and often exaggerate the estimated model’s true ability to explain or predict the variations in the dependent variable to a large extent. Moreover, we introduce “error inflation factor” (EIF) and “adjusted error inflation factor” (adjusted EIF) to measure the inflation of test errors (i.e., ) over training errors using and adjusted (i.e., and ) respectively. The regression results show that both EIF and adjusted EIF increase with the severity of overfitting as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decrease with the sample size. These results are further validated by Monte Carlo simulations.
Statisticians have long recognized that could be deceptively large as a measurement of a model’s true predictive ability on subsequent data. In fact, this recognition motivated the development of adjusted as a way to shrink by degree-of-freedom adjustment (Larson, Citation1931; Wherry, Citation1931).Footnote2 However, Mayer (Citation1975) demonstrates empirically that even is a poor guide to the post-sample fit, which may be caused by excessive data mining. An alternative route to the solution relies on cross-validation including leave-one-out cross-validation (Cochran, Citation1968; Hills, Citation1966; Lachenbruch & Mickey, Citation1968; Mosteller & Tukey, Citation1968), which turns out to be a more fruitful approach. Moreover, Efron and Morris (Citation1973), Geisser (Citation1974) and Stone (Citation1974) propose to use cross-validation for model selection. For a modern survey on the methodology of cross-validation, see Arlot and Celisse (Citation2010). This paper follows the tradition of cross-validation, as it measures test errors directly.
The rest of the paper is arranged as follows. Section 2 introduces leave-one-out R2 (LOOR2), error inflation factor (EIF), and adjusted error inflation factor (adjusted EIF). Section 3 studies the determinants of EIF and adjusted EIF via a meta-analysis by replicating 279 regressions from 100 prominent economic papers. Section 4 conducts Monte Carlo simulations for further investigation. Section 5 provides conclusion and suggestions for empirical researchers.
2. Leave-one-out R2 and error inflation factor
Consider the following linear regression model with observations,
where is the dependent variable for an individual , and is a vector of explanatory variables, is the corresponding vector of parameters, and is the error term. The model can be written in a matrix form,
where , and . The well-known OLS estimator is given by . With estimated and the fitted values given by , we have in the presence of a constant term,Footnote3 and adjusted R2 given by , where is the sample mean of , and is the OLS residual.
To implement leave-one-out regression omitting individual , we simply run OLS regression with all but the ith observations. Denoting as the data matrix without the ith row, and as the outcome vector without the ith element, the OLS estimator leaving out the ith observation is simply,
With estimated, we can make an out-of-sample prediction for the ith observation as . Repeat the procedure for all observations in the sample to obtain , and the leave-one-out (LOOR2) is given by
where is the correlation coefficient between and .
The procedure to compute LOOR2 appears to be cumbersome as it entails running regressions, which may be computationally costly if the sample size is very large. Fortunately, for linear regressions, there is a short-cut formula for running leave-one-out regression omitting the ith observation (Hansen, Citation2022, Chapter 3),
where is a scaled version of the OLS residual using the full sample, and is known as the leverage for the ith observation. Using EquationEquation (5)(5) (5) , the leave-one-out coefficient can be readily computed with existing information. Therefore, in the case of linear regressions, only one regression is needed to compute LOOR2 after all. Thus, calculating LOOR2 in addition to and adjusted only imposes a minimal computational cost for linear regressions.Footnote4
After introducing LOOR2, a natural question arises about the relationship among , adjusted and LOOR2. In general, and adjusted are larger than LOOR2, as it is usually more difficult to make out-of-sample predictions than in-sample predictions. For example, as simulations in Section 4.1 show, when noise variables are added to the regression, keeps rising while adjusted remains stable, but LOOR2 declines steadily.
To see it from a different perspective, () and () are generally smaller than (), as training errors are usually smaller than test errors. To measure the “inflation” of test errors over training errors, we define an error inflation factor (EIF) and an adjusted error inflation factor (adjusted EIF),Footnote5
where is adjusted .
We conjecture that both EIF and adjusted EIF increase with the severity of overfitting. Intuitively, when there is severe overfitting, training errors underestimate test errors to a great extent, resulting in large values of EIF and adjusted EIF. In the empirical study in Section 3, we consider three potential factors contributing to overfitting, i.e., the degree of freedom (sample size in excess of the number of regressors), the number of nonlinear terms (such as squared and interactive terms), and the presence of outliers. First, if the degree of freedom is small (e.g., a small sample size, or many regressors, or both), then linear regression is essentially fit to the noisy sample data, resulting in overfitting. Second, the presence of many nonlinear terms would increase the complexity of the regression function,Footnote6 and thus its ability to fit noisy data, which may also result in overfitting. Third, the nature of OLS estimation by minimizing the residual sum of squares implies that it is easily influenced by outliers, which again leads to overfitting.
In summary, based on the fact that overfitting reduces in-sample training errors at the expense of increasing out-of-sample test errors, we hypothesize that overfitting would result in elevated EIF and adjusted EIF. The next section investigates these relationships empirically.
3. A meta-analysis
3.1. Data source and variable definitions
In this section, we empirically compare R2, adjusted R2, and LOOR2, and investigate determinants of their gaps as represented by EIF and adjusted EIF. We focus on linear models where OLS is used for estimation in the recent literature. As a meta-analysis, our sample data is compiled by replicating linear regressions from 100 empirical papers selected from American Economic Review (23 papers), Economic Journal (35 papers), European Economic Review (18 papers) and Review of Economic Studies (24 papers) during 2004–2021.Footnote7 There are a total of 100 papers and 279 regression results in our sample with a sample size of 279, since each paper usually contains multiple OLS regressions.
For each of these 279 regressions, we calculate R2, adjusted R2, and LOOR2, as well as the error inflation factor (EIF, denoted as eif) and the adjusted error inflation factor (adjusted EIF, denoted as eif_a). The explanatory variables include the sample size (n), the number of regressors including the constant term (k), the number of nonlinear terms (nonlinear) in each regression, and the maximum value of leverage (lev_max) as well as its variance (lev_var).
An explanation of these two measures of outliers is in order. As mentioned in Section 2, the leverage for the ith observation is given by , which measures the influence of the ith observation on . Specifically, EquationEquation (5)(5) (5) implies that
It can be shown that with a sample average of (Hansen, Citation2022, Chapter 3). Therefore, a large implies a large discrepancy between and according to EquationEquation (8)(8) (8) . The variable lev_max is simply the maximum leverage for each regression, which captures the greatest influence of a single observation in a particular regression. In the same spirit, one may consider the second largest leverage, the third largest leverage, and so on. But this approach gets tedious. Instead, we use the variance of leverage (lev_var) as a parsimonious representation. The rationale is that given that the sum of all leverages is equal to the number of regressors (i.e., ), when some leverages are very large (i.e., close to the largest possible value of 1), then other leverages are squeezed towards their smallest possible value of 0, which results in an increase in the variance of leverage.
Summary statistics of the variables used in this study are presented in . While we focus on EIF (eif) and adjusted EIF (eif_a) in the regression analysis, it is intuitive to first look at the ratios (R2/LOOR2) and (adjusted R2/LOOR2) as reported in the first two rows of . The median of (R2/LOOR2) is 1.402, implying that the median increase of R2 over LOOR2 reaches 40.2% in the sample。Similarly, the median of (adjusted R2/LOOR2) is 1.214, implying that the median increase of adjusted R2 over LOOR2 is 21.4%。These show that R2 and adjusted R2 often exaggerate the estimated model’s true ability to explain or predict the dependent variable to a large extent, as measured by LOOR2.
The minimum values of (R2/LOOR2) and (adjusted R2/LOOR2) are both above 1 as expected. However, the maximum values of (R2/LOOR2) and (adjusted R2/LOOR2) reach alarming levels of 48,065.68 and 37,483.06, respectively. Therefore, it is instructive to take a closer look at these extreme values, which come from the fifth of five regressions in Dower et al. (Citation2021), as shown in .
In an effort to estimate the value of a statistical life under Stalin’s dictatorship, Dower et al. (Citation2021) ran cross-sectional OLS regressions with 58 regions of the former Soviet Union as the units of observations. The dependent variable is the number of citizens repressed during the German and Polish operations of the Great Terror during 1937–1938 per 1000 capita. As typically done in empirical papers, of Dower et al. (Citation2021) reports results from five regressions. As more regressors and nonlinear terms are added from regressions (1) through (5), R2 increases steadily from 0.244 to 0.584, while adjusted R2 increases from 0.202 to 0.456, indicating a significant boost to the goodness-of-fit at face value. However, while LOOR2 improves in regression (3), it drops to alarmingly low values of 0.003 and 0.000012 in regressions (4) and (5).Footnote8 Consequently, (R2/LOOR2) and (adjusted R2/LOOR2) reach outrageous levels of 48,065.68 and 37,483.06, respectively. Apparently, regressions (1) and (2) are underfit, whereas regressions (4) and (5) are severely overfit. Moreover, the maximum leverages are close to 1 in all regressions, indicating the presence of outliers.
3.2. Correlation analysis
As a preliminary exploration of determinants of EIF and adjusted EIF, presents a correlation matrix for major variables in the study. EIF (eif) is negatively correlated with the sample size (n) at the 5% level, while positively correlated with the number of regressors (k), the number of nonlinear terms (nonlinear), the maximum leverage (lev_max) and the variance of leverage (lev_var) at the 1% level. The correlation pattern between the adjusted EIF (eif_a) and these determinants is qualitatively similar. The only exception is that adjusted EIF (eif_a) is not significantly correlated with the number of regressors (k), perhaps due to the degree-of-freedom adjustment already made in adjusted R2.
3.3. Regression analysis
For the determinants of Log(EIF), we start from the following baseline regressionFootnote9
In addition, we also interact lnn and lnk with lev_max and lev_var in EquationEquation (9)(9) (9) to capture possible moderating effects of the sample size and number of regressors on the two measures of outliers. Our dataset consists of 279 observations (regressions) from 100 papers, where each paper contributes 2.79 regressions on average. Apparently, we have cluster data clustered at the paper level, where observations (regressions) from the same paper are likely correlated. Therefore, we use robust standard errors clustered at the paper level throughout. In addition, we may also control for the “paper fixed effects” by giving observations from the same paper a specific intercept. However, since sample size (n) varies little within a paper,Footnote10 adding the paper fixed effects may reduce our ability to detect the effects of sample size (n). Therefore, we report regression results both with and without the paper fixed effects.
reports results from OLS regressions with Log(EIF) as the dependent variable. Column (1) of reports the results from the baseline regression (9) without the paper fixed effects. The coefficient of lnn is negatively significant at the 1% level, indicating that a large sample size decreases overfitting, thus reducing the EIF. On the other hand, the coefficient of lnk is positively significant at the 1% level, implying that more regressors increases the chance of overfitting, which contributes to increased EIF. The coefficient of lev_var (variance of leverage) is positively significant at the 1% level, as outliers may result in overfitting, whereas the coefficients of lev_max and nonlinear are insignificant.
Column (2) of interacts lnn and lnk with lev_max and lev_var. The coefficient of lnn*lev_var is negatively significant at the 1% level, implying that the effect of lev_var on EIF may have been mitigated by increasing the sample size. On the other hand, the coefficient of lnk*lev_var is positively significant at the 1% level, indicating that the effect of lev_var on EIF may have been magnified by increasing the number of regressors. Interestingly, the coefficient of lev_max is now positively significant at the 1% level, whereas the coefficient of lev_var loses significance. Note that these two measures of outliers are somewhat collinear, since lev_max and lev_var are positively correlated at the 1% level with a correlation coefficient of 0.685 (see ).
Column (3) of adds the paper fixed effects to the baseline regression (9). The results are qualitatively similar to column (1), but with notable differences. In particular, the coefficient of lnn loses significance, perhaps due to too little variation in sample size (n) within the same paper. However, the coefficient of nonlinear (number of nonlinear terms) is now positively significant at the 1% level, as more nonlinear terms increase the model complexity, thus contributing to overfitting.
Column (4) of interacts lnn and lnk with lev_max and lev_var while keeping the paper fixed effects. The results in column (4) are mostly similar to column (3). However, the coefficient of lev_var surprisingly becomes negatively significant at the 5% level with an estimate of -26.05. Nevertheless, the coefficient of lnk*lev_var is positively significant at the 5% level with an estimate of 14.75. Overall, since the sample mean of lnk is 2.503, the marginal effect of lev_var evaluated at the sample mean of lnk is (-26.05 + 2.50314.75) = 10.87, which is similar in both magnitude and significance to the estimated coefficient of lev_var in columns (1) and (3) without interaction terms. This shows that lev_var increases overfitting more in high-dimensional data with a large number of covariates. Moreover, the coefficient of lnn*lev_max is negatively significant at the 1% level, implying that the effect of lev_max on overfitting could be mitigated by a large sample size.
reports regression results for the dependent variable Log(Adjusted EIF). The results in largely parallel those in , and the interpretations are also similar. In summary, these empirical results show that both Log(EIF) and Log(Adjusted EIF) increase with the severity of overfitting as measured by the number of regressors (lnk) and nonlinear terms (nonlinear), the maximum value of leverage (lev_max) and its variance (lev_var), but decreases with the sample size (lnn). Moreover, the effects of outliers (lev_max and lev_var) on overfitting could be moderated by the sample size and number of regressors (lnn and lnk).
4. Monte Carlo simulations
In this section, we conduct Monte Carlo simulations to study the behavior of R2, adjusted R2, LOOR2, EIF, and adjusted EIF as factors related to overfitting change. Overall, the results from simulations are consistent with our findings in the empirical study above.
In the baseline setting, we draw 100 random observations from a bivariate normal distribution . With a correlation coefficient of 0.9 between and , the population R2 is 0.81. The baseline regression is simply,
Throughout, we repeat each simulation for 1000 times, and compute the average values of R2, adjusted R2, LOOR2, EIF, and adjusted EIF. We then investigate their behaviors as factors related to overfitting change, including the number of regressors, the sample size, the number of nonlinear terms, and the presence of outliers.
4.1. Number of regressors
In this simulation, we increase the number of regressors simply by incrementally adding 1–50 noise variables into the baseline regression (10), where all noise variables are independently distributed as . The sample size is kept at 100. The results are presented in .
graphs R2, adjusted R2 and LOOR2 against the number of regressors, where the gray horizontal line shows the population R2 of 0.81. As the number of regressors increases from 2 to 51, R2 increases steadily to above 0.9, clearly overestimating the ability of the model to explain the variation in as a result of overfitting. On the other hand, adjusted R2 hovers between 0.8 and 0.81, showing the value of degree-of-freedom adjustment. Interesting, LOOR2 actually declines steadily to below 0.65, indicating that adding noise variables actually hurts the model’s ability to predict out of sample. Clearly, both R2 and adjusted R2 exaggerate the model’s true predictive ability, and the extent of exaggeration increases with the number of noise variables added. On the other hand, LOOR2 is robust to overfitting (at least as the model’s real predictive ability is concerned), as overfitting resulting from adding noise variables reduces LOOR2. graphs EIF and adjusted EIF against the number of regressors. The interpretation is essentially the same as .4.2. Sample size
In this simulation, the sample size is increased from 100 to 1000 at the increment of 50. On the other hand, we keep the number of regressors at 27, including the constant term, the signal variable , and 25 noise variables independently distributed as . The results are presented in .
graphs R2, adjusted R2 and LOOR2 against the sample size, where the gray horizontal line again shows the population R2 of 0.81. Apparently, sample size has little effect on adjusted R2, which hovers just below 0.81, as it has already compensated for the changing degree of freedom. On the other hand, when the sample size is relatively small (say, n = 100), R2 is clearly above 0.81, indicating that the model is overfit in the presence of 25 noise variables. However, as the sample size increases towards 1000, the overfitting phenomenon diminishes, and R2 declines towards 0.81 (but still above 0.81). On the contrary, when the sample size is relatively small, LOOR2 is well below 0.81, as the model’s predictive ability suffers in the presence of 25 noise variables. As the sample size is increased, LOOR2 climbs up towards 0.81, as a large sample size alleviates overfitting. graphs EIF and adjusted EIF against the sample size. The interpretation is similar to .4.3. Number of nonlinear terms
To consider the effect of nonlinear terms, we simply add second through eleventh power terms to EquationEquation (10)(10) (10) ,Footnote11
The sample size is still kept at 100. The results are presented in . graphs R2, adjusted R2 and LOOR2 against the number of nonlinear terms. In this simple data generating process, adding more nonlinear terms does not have much effect on either R2 or adjusted R2, although R2 does climb up slightly. However, when more nonlinear terms are added, LOOR2 decreases rapidly, as these nonlinear terms drive up the model’s complexity, resulting in overfitting and reduced ability to predict out of sample. graphs EIF and adjusted EIF against the number of nonlinear terms, and the interpretation is similar.
4.4. Outliers
In this simulation, we generate outliers simply by multiplying the largest value of in the sample by 2 through 100. As the multiplier on the largest grows from 1 to 100, the maximum leverage increases rapidly, and approaches its largest possible value of 1, as shown in .
presents the simulation results as the maximum leverage increases. graphs R2, adjusted R2 and LOOR2 against the maximum leverage. Initially, as the maximum leverage grows, LOOR2 drops much faster than R2 and adjusted R2, as the model’s true predictive ability declines, while overfitting occurs in the presence of an ever more extreme outlier. However, as LOOR2 drops closer to its lower bound of 0, its speed of declining inevitably falls behind than that of R2 and adjusted R2. In the end, as the multiplier on the largest increases towards 100, the OLS fit becomes very poor, thus R2, adjusted R2, and LOOR2 all decline towards their common lower bound of 0.
graphs EIF and adjusted EIF against the maximum leverage, which tells a similar story. Initially, both EIF and adjusted EIF increase, but they start to decline when the maximum leverage is around 0.5 (and the multiplier on the largest is 5), resulting in an inverted U-shape.5. Conclusion
Goodness-of-fit measures R2 and adjusted R2 are routinely reported in empirical studies with the implicit presumption that they represent the percentage by which the regressors jointly explain or predict the variation of the dependent variable. This paper shows that R2 and adjusted R2 are inaccurate in this regard and often overly optimistic in the presence of overfitting resulting from small sample size, many regressors and nonlinear terms, and existence of outliers. As a remedy, leave-one-out R2 (LOOR2) can be readily computed, and used as a reliable measure of the model’s true ability to predict out of sample.
Moreover, we introduce the concepts of “error inflation factor” (EIF) and “adjusted error inflation factor” (adjusted EIF) as the degree of inflation of test errors () over training errors represented by () and () respectively. We then conduct a meta-analysis about the determinants of EIF and adjusted EIF by replicating 273 regressions from 100 papers in four top economics journal during 2004–2021. The median increases of R2 and adjusted R2 over LOOR2 reach 40.2% and 21.4%, respectively, in this sample. The regression results show that both EIF and adjusted EIF increase with the severity of overfitting, as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decrease with the sample size. These results are further validated by Monte Carlo simulations.
For empirical researchers, we recommend that they report LOOR2 alongside R2 and adjusted R2, since LOOR2 is robust to overfitting as a measure of the model’s true predictive ability out of sample. Moreover, when LOOR2 diverges from either R2 or adjusted R2, this is a sign of overfitting, and empirical researchers should be concerned, and look for possible causes, such as a complicated functional form (e.g., too many nonlinear terms), and the presence of outliers (e.g., the maximum leverage is close to 1). As a practical matter, while overfitting reduces bias, it usually increases variance to a greater extent, which results in increased mean squared errors of the estimator, and reduced significance of the parameter of interest. Therefore, one way to increase parameter significance is to reduce overfitting.Footnote12
As model validation via out-of-sample prediction becomes increasingly common in many disciplines, it is time for economists to honestly embrace LOOR2 as a safeguard against overfitting, which is hard to detect by using conventional R2 and adjusted R2 based on in-sample fit. In this way, economists can more easily avoid the trap of overfitting, and make their empirical findings more robust. Providers of statistical software (e.g., Stata) can also help in this regard by routinely reporting LOOR2 alongside traditional R2 and adjusted R2 in the regression output.
Supplemental Material
Download Rich Text Format File (21.7 KB)Disclosure statement
No potential conflict of interest was reported by the authors.
Supplementary material
Supplemental data for this article can be accessed online at https://doi.org/10.1080/15140326.2023.2207326.
Additional information
Notes on contributors
Qiang Chen
Qiang Chen is a professor at the School of Economics, Shandong University.
Ji Qi
Ji Qi is a PhD student at the School of Economics, Shandong University.
Notes
1 To be sure, economics is not the only discipline in this regard. For example, Parady et al. (Citation2021) laments the overreliance on statistical goodness-of-fit and under-reliance on model validation in the transportation literature.
2 The original formula for adjusted R2 was first proposed in a paper by M. J. B. Ezekiel, who read it before the Mathematical Society at its annual meeting in 1928, but gave the credit to B. B. Smith.
3 We ignore the case of linear regression without a constant term, as it is rarely encountered in practice.
4 For example, the short-cut algorithm for computing LOOR2 could be implemented in Stata by using the user-written command “cv_regress” (Rios-Avila, Citation2018) after the usual “regress” command for OLS regression.
5 These terminologies are in the same spirit as “variance inflation factor” (VIF).
6 In fact, the presence of many covariates also increases the complexity of regression function.
7 These four journals are selected partly because their replication data and programs are more easily accessible. See the Appendix for a complete list of these 100 papers.
8 Note that Dower et al. (Citation2021) only report R2.
9 The results of using EIF or adjusted EIF as the dependent variables are qualitatively similar, but the fit is slightly worse. To save space, we only report results using Log(EIF) and Log(Adjusted EIF) as the dependent variables.:
10 Typically, the sample sizes of regressions within a paper change because of adding more variables, which may result in missing observations.
11 As pointed out by an anonymous referee, adding nonlinear terms can be viewed as a particular case of including additional correlated covariates.
12 We thank an anonymous referee for useful discussions about the relation between overfitting and parameter significance, and more studies are needed in this direction.
References
- Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–16. https://doi.org/10.1214/09-SS054
- Cochran, W. G. (1968). Commentary on estimation of error rates in discriminant analysis. Technometrics, 10(1), 204–205. https://doi.org/10.1080/00401706.1968.10490548
- Dower, P. C., Markevich, A., & Weber, S. (2021). The value of a statistical life in a dictatorship: Evidence from Stalin. European Economic Review, 133, 103663. https://doi.org/10.1016/j.euroecorev.2021.103663
- Efron, B., & Morris, C. (1973). Combining possibly related estimation problems (with discussion). Journal of the Royal Statistical Society, Series B, 35, 379–402.
- Geisser, S. (1974). A predictive approach to the random effect model. Biometrika, 61(1), 101–107. https://doi.org/10.1093/biomet/61.1.101
- Hansen, B. E. (2022). Econometrics. Princeton University Press.
- Hills, M. (1966). Allocation rules and their error rates. Journal of the Royal Statistical Society Series B (Methodological), 28(1), 1–31. https://doi.org/10.1111/j.2517-6161.1966.tb00614.x
- Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10(1), 1–11. https://doi.org/10.1080/00401706.1968.10490530
- Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22(1), 45–55. https://doi.org/10.1037/h0072400
- Mayer, T. (1975). Selecting economic hypotheses by goodness of fit. The Economic Journal, 85(340), 877–883. https://doi.org/10.2307/2230630
- Mosteller, F., & Tukey, J. W. (1968). Data analysis, including statistics. In G. Lindzey & E. Aronson (Eds.), Handbook of social psychology (Vol. 2). Addison-Wesley.
- Parady, G., Ory, D., & Walker, J. (2021). The overreliance on statistical goodness-of-fit and under-reliance on model validation in discrete choice models: A review of validation practices in the transportation academic literature. Journal of Choice Modelling, 38, 100257. https://doi.org/10.1016/j.jocm.2020.100257
- Rios-Avila, F. (2018). CV_REGRESS: Stata module to estimate the leave-one-out error for linear regression models. In Statistical software components, S458469. Boston College Department of Economics. Retrieved June 11, 2020.
- Stone, M. A. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological), 36(2), 111–147. https://doi.org/10.1111/j.2517-6161.1974.tb00994.x
- Wherry, R. J. (1931). A new formula for predicting the shrinkage of the coefficient of multiple correlation. Annals of Mathematical Statistics, 2(4), 440–457. https://doi.org/10.1214/aoms/1177732951