2,366
Views
0
CrossRef citations to date
0
Altmetric
Applied Econometrics

How much should we trust R2 and adjusted R2: evidence from regressions in top economics journals and Monte Carlo simulations

&
Article: 2207326 | Received 16 Nov 2022, Accepted 21 Apr 2023, Published online: 02 May 2023

ABSTRACT

R2 and adjusted R2 may exaggerate a model’s true ability to predict the dependent variable in the presence of overfitting, whereas leave-one-out R2 (LOOR2) is robust to overfitting. We demonstrate this by replicating 279 regressions from 100 papers in top economics journals, where the median increases of R2 and adjusted R2 over LOOR2 reach 40.2% and 21.4% respectively. The inflation of test errors over training errors increases with the severity of overfitting as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decreases with the sample size. These results are further validated by Monte Carlo simulations.

1. Introduction

In empirical studies, R2 and adjusted R2 (denoted as Rˉ2) are routinely reported as measures of goodness-of-fit for linear regressions. For example, a R2 of 0.8 is usually taken to imply that all explanatory variables jointly explain 80% of the variations in the dependent variable. But how reliable is this interpretation?

It is well known that R2 and Rˉ2 only measure in-sample fit, which may not be good indicators of the model’s true ability to explain or predict out of sample. In particular, it is common sense in the machine learning literature that training errors (as represented by1R2 and 1Rˉ2) could be poor measures of the true test errors, when the model is used to predict data that it has not yet seen. Nevertheless, as of today, most economists still happily use R2 and Rˉ2 to measure goodness-of-fit, without worrying about its potential pitfalls.Footnote1

This paper takes this issue seriously. The essential problem is that R2 and Rˉ2 may exaggerate a model’s true ability to explain or predict the dependent variable, especially in the presence of overfitting. Overfitting occurs when a model is excessively fit to noisy sample data (e.g., a low degree of freedom resulting from a small sample size or too many covariates, a complicated functional form with many nonlinear terms, or the presence of outliers), which compromises the model’s ability to uncover the true relationship between the dependent and explanatory variables, as well as its performance in out-of-sample prediction.

To solve this problem, we recommend using leave-one-out cross-validated R2 (LOOR2 in short) as a better measure of goodness-of-fit for linear regressions. While LOOR2 has been around for some time, this paper takes it seriously and suggests that economists should routinely report LOOR2 in their empirical work alongside R2 and adjusted R2 (if not at the expense of the latter two). There are a number of advantages associated with LOOR2. First, LOOR2 is robust to overfitting, as it measures the true test errors and the model’s real ability to explain or predict the dependent variable. Second, while five-fold or ten-fold cross-validations are popular in machine learning to measure test errors, the results are uncertain due to random splitting of the sample into five or ten folds (parts) of roughly equal sizes. On the other hand, the results from leave-one-out cross-validation is certain, since one observation is left out at a time, and no random sampling is involved. Last but not least, for linear regressions, there is a short-cut formula for computing LOOR2 such that only one regression is needed, thus the computational cost is minimal.

To support the above claims, we replicate 279 regressions from 100 empirical papers in four top economics journals during 2004–2021. In this sample, the median increases of R2 and Rˉ2 over LOOR2 reach 40.2% and 21.4%, respectively, implying that both R2 and Rˉ2 often exaggerate the estimated model’s true ability to explain or predict the variations in the dependent variable to a large extent. Moreover, we introduce “error inflation factor” (EIF) and “adjusted error inflation factor” (adjusted EIF) to measure the inflation of test errors (i.e., 1LOOR2) over training errors using R2 and adjusted R2 (i.e., 1R2 and 1Rˉ2) respectively. The regression results show that both EIF and adjusted EIF increase with the severity of overfitting as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decrease with the sample size. These results are further validated by Monte Carlo simulations.

Statisticians have long recognized that R2 could be deceptively large as a measurement of a model’s true predictive ability on subsequent data. In fact, this recognition motivated the development of adjusted R2 as a way to shrink R2 by degree-of-freedom adjustment (Larson, Citation1931; Wherry, Citation1931).Footnote2 However, Mayer (Citation1975) demonstrates empirically that even Rˉ2 is a poor guide to the post-sample fit, which may be caused by excessive data mining. An alternative route to the solution relies on cross-validation including leave-one-out cross-validation (Cochran, Citation1968; Hills, Citation1966; Lachenbruch & Mickey, Citation1968; Mosteller & Tukey, Citation1968), which turns out to be a more fruitful approach. Moreover, Efron and Morris (Citation1973), Geisser (Citation1974) and Stone (Citation1974) propose to use cross-validation for model selection. For a modern survey on the methodology of cross-validation, see Arlot and Celisse (Citation2010). This paper follows the tradition of cross-validation, as it measures test errors directly.

The rest of the paper is arranged as follows. Section 2 introduces leave-one-out R2 (LOOR2), error inflation factor (EIF), and adjusted error inflation factor (adjusted EIF). Section 3 studies the determinants of EIF and adjusted EIF via a meta-analysis by replicating 279 regressions from 100 prominent economic papers. Section 4 conducts Monte Carlo simulations for further investigation. Section 5 provides conclusion and suggestions for empirical researchers.

2. Leave-one-out R2 and error inflation factor

Consider the following linear regression model with n observations,

(1) yi=x iβ+εi(i=1,,n),(1)

where yi is the dependent variable for an individual i, and xi is a k×1 vector of explanatory variables, β is the corresponding k×1 vector of parameters, and εi is the error term. The model can be written in a matrix form,

(2) y=Xβ+ε,(2)

where y=(y1yn) , X=(x1xn)  and ε=(ε1εn) . The well-known OLS estimator is given by βˆ=(X X)1X y. With βˆ estimated and the fitted values given by yˆi=x iβˆ, we have R2=Corr(yi,yˆi)2=1i=1nei2i=1n(yiyˉ)2 in the presence of a constant term,Footnote3 and adjusted R2 given by Rˉ2=1i=1nei2/(nk)i=1n(yiyˉ)2/(n1), where yˉ is the sample mean of yi, and ei is the OLS residual.

To implement leave-one-out regression omitting individual i, we simply run OLS regression with all but the ith observations. Denoting X(i) as the data matrix X without the ith row, and y(i) as the outcome vector y without the ith element, the OLS estimator leaving out the ith observation is simply,

(3) βˆ(i)=X (i)X(i)1X (i)y(i).(3)

With βˆ(i) estimated, we can make an out-of-sample prediction for the ith observation as yˆ(i)=x iβˆ(i). Repeat the procedure for all observations in the sample to obtain yˆ(i)i=1n, and the leave-one-out R2(LOOR2) is given by

(4) 0LOOR2=Corr(yi,yˆ(i))21,(4)

where Corr(yi,yˆ(i)) is the correlation coefficient between yi and yˆ(i).

The procedure to compute LOOR2 appears to be cumbersome as it entails running n regressions, which may be computationally costly if the sample size n is very large. Fortunately, for linear regressions, there is a short-cut formula for running leave-one-out regression omitting the ith observation (Hansen, Citation2022, Chapter 3),

(5) βˆ(i)=βˆ(X X)1xie˜i,(5)

where e˜i=ei1levi is a scaled version of the OLS residual ei using the full sample, and levi=x i(X X)1xi is known as the leverage for the ith observation. Using EquationEquation (5), the leave-one-out coefficient βˆ(i) can be readily computed with existing information. Therefore, in the case of linear regressions, only one regression is needed to compute LOOR2 after all. Thus, calculating LOOR2 in addition to R2 and adjusted R2 only imposes a minimal computational cost for linear regressions.Footnote4

After introducing LOOR2, a natural question arises about the relationship among R2, adjusted R2and LOOR2. In general, R2 and adjusted R2 are larger than LOOR2, as it is usually more difficult to make out-of-sample predictions than in-sample predictions. For example, as simulations in Section 4.1 show, when noise variables are added to the regression, R2 keeps rising while adjusted R2remains stable, but LOOR2 declines steadily.

To see it from a different perspective, (1R2) and (1AdjustedR2) are generally smaller than (1LOOR2), as training errors are usually smaller than test errors. To measure the “inflation” of test errors over training errors, we define an error inflation factor (EIF) and an adjusted error inflation factor (adjusted EIF),Footnote5

(6) EIF=1LOOR21R2,(6)
(7) AdjustedEIF=1LOOR21Rˉ2,(7)

where Rˉ2 is adjusted R2.

We conjecture that both EIF and adjusted EIF increase with the severity of overfitting. Intuitively, when there is severe overfitting, training errors underestimate test errors to a great extent, resulting in large values of EIF and adjusted EIF. In the empirical study in Section 3, we consider three potential factors contributing to overfitting, i.e., the degree of freedom (sample size in excess of the number of regressors), the number of nonlinear terms (such as squared and interactive terms), and the presence of outliers. First, if the degree of freedom is small (e.g., a small sample size, or many regressors, or both), then linear regression is essentially fit to the noisy sample data, resulting in overfitting. Second, the presence of many nonlinear terms would increase the complexity of the regression function,Footnote6 and thus its ability to fit noisy data, which may also result in overfitting. Third, the nature of OLS estimation by minimizing the residual sum of squares implies that it is easily influenced by outliers, which again leads to overfitting.

In summary, based on the fact that overfitting reduces in-sample training errors at the expense of increasing out-of-sample test errors, we hypothesize that overfitting would result in elevated EIF and adjusted EIF. The next section investigates these relationships empirically.

3. A meta-analysis

3.1. Data source and variable definitions

In this section, we empirically compare R2, adjusted R2, and LOOR2, and investigate determinants of their gaps as represented by EIF and adjusted EIF. We focus on linear models where OLS is used for estimation in the recent literature. As a meta-analysis, our sample data is compiled by replicating linear regressions from 100 empirical papers selected from American Economic Review (23 papers), Economic Journal (35 papers), European Economic Review (18 papers) and Review of Economic Studies (24 papers) during 2004–2021.Footnote7 There are a total of 100 papers and 279 regression results in our sample with a sample size of 279, since each paper usually contains multiple OLS regressions.

For each of these 279 regressions, we calculate R2, adjusted R2, and LOOR2, as well as the error inflation factor (EIF, denoted as eif) and the adjusted error inflation factor (adjusted EIF, denoted as eif_a). The explanatory variables include the sample size (n), the number of regressors including the constant term (k), the number of nonlinear terms (nonlinear) in each regression, and the maximum value of leverage (lev_max) as well as its variance (lev_var).

An explanation of these two measures of outliers is in order. As mentioned in Section 2, the leverage for the ith observation is given by levi=x i(X X)1xi, which measures the influence of the ith observation on βˆ. Specifically, EquationEquation (5) implies that

(8) βˆβˆ(i)=(X X)1xiei1levi.(8)

It can be shown that 0levi1 with a sample average of k/n (Hansen, Citation2022, Chapter 3). Therefore, a large levi implies a large discrepancy between βˆ and βˆ(i) according to EquationEquation (8). The variable lev_max is simply the maximum leverage for each regression, which captures the greatest influence of a single observation in a particular regression. In the same spirit, one may consider the second largest leverage, the third largest leverage, and so on. But this approach gets tedious. Instead, we use the variance of leverage (lev_var) as a parsimonious representation. The rationale is that given that the sum of all leverages is equal to the number of regressors (i.e., i=1nlevi=k), when some leverages are very large (i.e., close to the largest possible value of 1), then other leverages are squeezed towards their smallest possible value of 0, which results in an increase in the variance of leverage.

Summary statistics of the variables used in this study are presented in . While we focus on EIF (eif) and adjusted EIF (eif_a) in the regression analysis, it is intuitive to first look at the ratios (R2/LOOR2) and (adjusted R2/LOOR2) as reported in the first two rows of . The median of (R2/LOOR2) is 1.402, implying that the median increase of R2 over LOOR2 reaches 40.2% in the sample。Similarly, the median of (adjusted R2/LOOR2) is 1.214, implying that the median increase of adjusted R2 over LOOR2 is 21.4%。These show that R2 and adjusted R2 often exaggerate the estimated model’s true ability to explain or predict the dependent variable to a large extent, as measured by LOOR2.

Table 1. Summary statistics.

The minimum values of (R2/LOOR2) and (adjusted R2/LOOR2) are both above 1 as expected. However, the maximum values of (R2/LOOR2) and (adjusted R2/LOOR2) reach alarming levels of 48,065.68 and 37,483.06, respectively. Therefore, it is instructive to take a closer look at these extreme values, which come from the fifth of five regressions in Dower et al. (Citation2021), as shown in .

Table 2. Five regressions in Dower et al. (Citation2021).

In an effort to estimate the value of a statistical life under Stalin’s dictatorship, Dower et al. (Citation2021) ran cross-sectional OLS regressions with 58 regions of the former Soviet Union as the units of observations. The dependent variable is the number of citizens repressed during the German and Polish operations of the Great Terror during 1937–1938 per 1000 capita. As typically done in empirical papers, of Dower et al. (Citation2021) reports results from five regressions. As more regressors and nonlinear terms are added from regressions (1) through (5), R2 increases steadily from 0.244 to 0.584, while adjusted R2 increases from 0.202 to 0.456, indicating a significant boost to the goodness-of-fit at face value. However, while LOOR2 improves in regression (3), it drops to alarmingly low values of 0.003 and 0.000012 in regressions (4) and (5).Footnote8 Consequently, (R2/LOOR2) and (adjusted R2/LOOR2) reach outrageous levels of 48,065.68 and 37,483.06, respectively. Apparently, regressions (1) and (2) are underfit, whereas regressions (4) and (5) are severely overfit. Moreover, the maximum leverages are close to 1 in all regressions, indicating the presence of outliers.

3.2. Correlation analysis

As a preliminary exploration of determinants of EIF and adjusted EIF, presents a correlation matrix for major variables in the study. EIF (eif) is negatively correlated with the sample size (n) at the 5% level, while positively correlated with the number of regressors (k), the number of nonlinear terms (nonlinear), the maximum leverage (lev_max) and the variance of leverage (lev_var) at the 1% level. The correlation pattern between the adjusted EIF (eif_a) and these determinants is qualitatively similar. The only exception is that adjusted EIF (eif_a) is not significantly correlated with the number of regressors (k), perhaps due to the degree-of-freedom adjustment already made in adjusted R2.

Table 3. Correlation matrix for major variables in the study.

3.3. Regression analysis

For the determinants of Log(EIF), we start from the following baseline regressionFootnote9

(9) lneifi=β0+β1lnn+β2lnk+β3nonlinear+β4lev_maxi+β5lev_vari+εi.(9)

In addition, we also interact lnn and lnk with lev_max and lev_var in EquationEquation (9) to capture possible moderating effects of the sample size and number of regressors on the two measures of outliers. Our dataset consists of 279 observations (regressions) from 100 papers, where each paper contributes 2.79 regressions on average. Apparently, we have cluster data clustered at the paper level, where observations (regressions) from the same paper are likely correlated. Therefore, we use robust standard errors clustered at the paper level throughout. In addition, we may also control for the “paper fixed effects” by giving observations from the same paper a specific intercept. However, since sample size (n) varies little within a paper,Footnote10 adding the paper fixed effects may reduce our ability to detect the effects of sample size (n). Therefore, we report regression results both with and without the paper fixed effects.

reports results from OLS regressions with Log(EIF) as the dependent variable. Column (1) of reports the results from the baseline regression (9) without the paper fixed effects. The coefficient of lnn is negatively significant at the 1% level, indicating that a large sample size decreases overfitting, thus reducing the EIF. On the other hand, the coefficient of lnk is positively significant at the 1% level, implying that more regressors increases the chance of overfitting, which contributes to increased EIF. The coefficient of lev_var (variance of leverage) is positively significant at the 1% level, as outliers may result in overfitting, whereas the coefficients of lev_max and nonlinear are insignificant.

Table 4. Determinants of log(EIF).

Column (2) of interacts lnn and lnk with lev_max and lev_var. The coefficient of lnn*lev_var is negatively significant at the 1% level, implying that the effect of lev_var on EIF may have been mitigated by increasing the sample size. On the other hand, the coefficient of lnk*lev_var is positively significant at the 1% level, indicating that the effect of lev_var on EIF may have been magnified by increasing the number of regressors. Interestingly, the coefficient of lev_max is now positively significant at the 1% level, whereas the coefficient of lev_var loses significance. Note that these two measures of outliers are somewhat collinear, since lev_max and lev_var are positively correlated at the 1% level with a correlation coefficient of 0.685 (see ).

Column (3) of adds the paper fixed effects to the baseline regression (9). The results are qualitatively similar to column (1), but with notable differences. In particular, the coefficient of lnn loses significance, perhaps due to too little variation in sample size (n) within the same paper. However, the coefficient of nonlinear (number of nonlinear terms) is now positively significant at the 1% level, as more nonlinear terms increase the model complexity, thus contributing to overfitting.

Column (4) of interacts lnn and lnk with lev_max and lev_var while keeping the paper fixed effects. The results in column (4) are mostly similar to column (3). However, the coefficient of lev_var surprisingly becomes negatively significant at the 5% level with an estimate of -26.05. Nevertheless, the coefficient of lnk*lev_var is positively significant at the 5% level with an estimate of 14.75. Overall, since the sample mean of lnk is 2.503, the marginal effect of lev_var evaluated at the sample mean of lnk is (-26.05 + 2.503×14.75) = 10.87, which is similar in both magnitude and significance to the estimated coefficient of lev_var in columns (1) and (3) without interaction terms. This shows that lev_var increases overfitting more in high-dimensional data with a large number of covariates. Moreover, the coefficient of lnn*lev_max is negatively significant at the 1% level, implying that the effect of lev_max on overfitting could be mitigated by a large sample size.

reports regression results for the dependent variable Log(Adjusted EIF). The results in largely parallel those in , and the interpretations are also similar. In summary, these empirical results show that both Log(EIF) and Log(Adjusted EIF) increase with the severity of overfitting as measured by the number of regressors (lnk) and nonlinear terms (nonlinear), the maximum value of leverage (lev_max) and its variance (lev_var), but decreases with the sample size (lnn). Moreover, the effects of outliers (lev_max and lev_var) on overfitting could be moderated by the sample size and number of regressors (lnn and lnk).

Table 5. Determinants of log(adjusted EIF).

4. Monte Carlo simulations

In this section, we conduct Monte Carlo simulations to study the behavior of R2, adjusted R2, LOOR2, EIF, and adjusted EIF as factors related to overfitting change. Overall, the results from simulations are consistent with our findings in the empirical study above.

In the baseline setting, we draw 100 random observations from a bivariate normal distribution YXN00,10.90.91. With a correlation coefficient of 0.9 between Y and X, the population R2 is 0.81. The baseline regression is simply,

(10) Yi=β0+β1Xi+εi(i=1,,100).(10)

Throughout, we repeat each simulation for 1000 times, and compute the average values of R2, adjusted R2, LOOR2, EIF, and adjusted EIF. We then investigate their behaviors as factors related to overfitting change, including the number of regressors, the sample size, the number of nonlinear terms, and the presence of outliers.

4.1. Number of regressors

In this simulation, we increase the number of regressors simply by incrementally adding 1–50 noise variables into the baseline regression (10), where all noise variables are independently distributed as N(0,1). The sample size is kept at 100. The results are presented in .

Figure 1. The effects of number of regressors.

Figure 1. The effects of number of regressors.
graphs R2, adjusted R2 and LOOR2 against the number of regressors, where the gray horizontal line shows the population R2 of 0.81. As the number of regressors increases from 2 to 51, R2 increases steadily to above 0.9, clearly overestimating the ability of the model to explain the variation in y as a result of overfitting. On the other hand, adjusted R2 hovers between 0.8 and 0.81, showing the value of degree-of-freedom adjustment. Interesting, LOOR2 actually declines steadily to below 0.65, indicating that adding noise variables actually hurts the model’s ability to predict out of sample. Clearly, both R2 and adjusted R2 exaggerate the model’s true predictive ability, and the extent of exaggeration increases with the number of noise variables added. On the other hand, LOOR2 is robust to overfitting (at least as the model’s real predictive ability is concerned), as overfitting resulting from adding noise variables reduces LOOR2. graphs EIF and adjusted EIF against the number of regressors. The interpretation is essentially the same as .

4.2. Sample size

In this simulation, the sample size is increased from 100 to 1000 at the increment of 50. On the other hand, we keep the number of regressors at 27, including the constant term, the signal variable X, and 25 noise variables independently distributed as N(0,1). The results are presented in .

Figure 2. The effects of sample size.

Figure 2. The effects of sample size.
graphs R2, adjusted R2 and LOOR2 against the sample size, where the gray horizontal line again shows the population R2 of 0.81. Apparently, sample size has little effect on adjusted R2, which hovers just below 0.81, as it has already compensated for the changing degree of freedom. On the other hand, when the sample size is relatively small (say, n = 100), R2 is clearly above 0.81, indicating that the model is overfit in the presence of 25 noise variables. However, as the sample size increases towards 1000, the overfitting phenomenon diminishes, and R2 declines towards 0.81 (but still above 0.81). On the contrary, when the sample size is relatively small, LOOR2 is well below 0.81, as the model’s predictive ability suffers in the presence of 25 noise variables. As the sample size is increased, LOOR2 climbs up towards 0.81, as a large sample size alleviates overfitting. graphs EIF and adjusted EIF against the sample size. The interpretation is similar to .

4.3. Number of nonlinear terms

To consider the effect of nonlinear terms, we simply add second through eleventh power terms to EquationEquation (10),Footnote11

(11) Yi=β0+β1Xi+β2Xi2++β11Xi11+εi(i=1,,100).(11)

The sample size is still kept at 100. The results are presented in . graphs R2, adjusted R2 and LOOR2 against the number of nonlinear terms. In this simple data generating process, adding more nonlinear terms does not have much effect on either R2 or adjusted R2, although R2 does climb up slightly. However, when more nonlinear terms are added, LOOR2 decreases rapidly, as these nonlinear terms drive up the model’s complexity, resulting in overfitting and reduced ability to predict out of sample. graphs EIF and adjusted EIF against the number of nonlinear terms, and the interpretation is similar.

Figure 3. The effects of number of nonlinear terms.

Figure 3. The effects of number of nonlinear terms.

4.4. Outliers

In this simulation, we generate outliers simply by multiplying the largest value of X in the sample by 2 through 100. As the multiplier on the largest X grows from 1 to 100, the maximum leverage increases rapidly, and approaches its largest possible value of 1, as shown in .

Figure 4. Maximum leverage and multiplier on the largest X.

Figure 4. Maximum leverage and multiplier on the largest X.

presents the simulation results as the maximum leverage increases. graphs R2, adjusted R2 and LOOR2 against the maximum leverage. Initially, as the maximum leverage grows, LOOR2 drops much faster than R2 and adjusted R2, as the model’s true predictive ability declines, while overfitting occurs in the presence of an ever more extreme outlier. However, as LOOR2 drops closer to its lower bound of 0, its speed of declining inevitably falls behind than that of R2 and adjusted R2. In the end, as the multiplier on the largest X increases towards 100, the OLS fit becomes very poor, thus R2, adjusted R2, and LOOR2 all decline towards their common lower bound of 0.

Figure 5. The effect of outliers.

Figure 5. The effect of outliers.
graphs EIF and adjusted EIF against the maximum leverage, which tells a similar story. Initially, both EIF and adjusted EIF increase, but they start to decline when the maximum leverage is around 0.5 (and the multiplier on the largest X is 5), resulting in an inverted U-shape.

5. Conclusion

Goodness-of-fit measures R2 and adjusted R2 are routinely reported in empirical studies with the implicit presumption that they represent the percentage by which the regressors jointly explain or predict the variation of the dependent variable. This paper shows that R2 and adjusted R2 are inaccurate in this regard and often overly optimistic in the presence of overfitting resulting from small sample size, many regressors and nonlinear terms, and existence of outliers. As a remedy, leave-one-out R2 (LOOR2) can be readily computed, and used as a reliable measure of the model’s true ability to predict out of sample.

Moreover, we introduce the concepts of “error inflation factor” (EIF) and “adjusted error inflation factor” (adjusted EIF) as the degree of inflation of test errors (1LOOR2) over training errors represented by (1R2) and (1Rˉ2) respectively. We then conduct a meta-analysis about the determinants of EIF and adjusted EIF by replicating 273 regressions from 100 papers in four top economics journal during 2004–2021. The median increases of R2 and adjusted R2 over LOOR2 reach 40.2% and 21.4%, respectively, in this sample. The regression results show that both EIF and adjusted EIF increase with the severity of overfitting, as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decrease with the sample size. These results are further validated by Monte Carlo simulations.

For empirical researchers, we recommend that they report LOOR2 alongside R2 and adjusted R2, since LOOR2 is robust to overfitting as a measure of the model’s true predictive ability out of sample. Moreover, when LOOR2 diverges from either R2 or adjusted R2, this is a sign of overfitting, and empirical researchers should be concerned, and look for possible causes, such as a complicated functional form (e.g., too many nonlinear terms), and the presence of outliers (e.g., the maximum leverage is close to 1). As a practical matter, while overfitting reduces bias, it usually increases variance to a greater extent, which results in increased mean squared errors of the estimator, and reduced significance of the parameter of interest. Therefore, one way to increase parameter significance is to reduce overfitting.Footnote12

As model validation via out-of-sample prediction becomes increasingly common in many disciplines, it is time for economists to honestly embrace LOOR2 as a safeguard against overfitting, which is hard to detect by using conventional R2 and adjusted R2 based on in-sample fit. In this way, economists can more easily avoid the trap of overfitting, and make their empirical findings more robust. Providers of statistical software (e.g., Stata) can also help in this regard by routinely reporting LOOR2 alongside traditional R2 and adjusted R2 in the regression output.

Supplemental material

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/15140326.2023.2207326.

Additional information

Notes on contributors

Qiang Chen

Qiang Chen is a professor at the School of Economics, Shandong University.

Ji Qi

Ji Qi is a PhD student at the School of Economics, Shandong University.

Notes

1 To be sure, economics is not the only discipline in this regard. For example, Parady et al. (Citation2021) laments the overreliance on statistical goodness-of-fit and under-reliance on model validation in the transportation literature.

2 The original formula for adjusted R2 was first proposed in a paper by M. J. B. Ezekiel, who read it before the Mathematical Society at its annual meeting in 1928, but gave the credit to B. B. Smith.

3 We ignore the case of linear regression without a constant term, as it is rarely encountered in practice.

4 For example, the short-cut algorithm for computing LOOR2 could be implemented in Stata by using the user-written command “cv_regress” (Rios-Avila, Citation2018) after the usual “regress” command for OLS regression.

5 These terminologies are in the same spirit as “variance inflation factor” (VIF).

6 In fact, the presence of many covariates also increases the complexity of regression function.

7 These four journals are selected partly because their replication data and programs are more easily accessible. See the Appendix for a complete list of these 100 papers.

8 Note that Dower et al. (Citation2021) only report R2.

9 The results of using EIF or adjusted EIF as the dependent variables are qualitatively similar, but the fit is slightly worse. To save space, we only report results using Log(EIF) and Log(Adjusted EIF) as the dependent variables.:

10 Typically, the sample sizes of regressions within a paper change because of adding more variables, which may result in missing observations.

11 As pointed out by an anonymous referee, adding nonlinear terms can be viewed as a particular case of including additional correlated covariates.

12 We thank an anonymous referee for useful discussions about the relation between overfitting and parameter significance, and more studies are needed in this direction.

References

  • Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–16. https://doi.org/10.1214/09-SS054
  • Cochran, W. G. (1968). Commentary on estimation of error rates in discriminant analysis. Technometrics, 10(1), 204–205. https://doi.org/10.1080/00401706.1968.10490548
  • Dower, P. C., Markevich, A., & Weber, S. (2021). The value of a statistical life in a dictatorship: Evidence from Stalin. European Economic Review, 133, 103663. https://doi.org/10.1016/j.euroecorev.2021.103663
  • Efron, B., & Morris, C. (1973). Combining possibly related estimation problems (with discussion). Journal of the Royal Statistical Society, Series B, 35, 379–402.
  • Geisser, S. (1974). A predictive approach to the random effect model. Biometrika, 61(1), 101–107. https://doi.org/10.1093/biomet/61.1.101
  • Hansen, B. E. (2022). Econometrics. Princeton University Press.
  • Hills, M. (1966). Allocation rules and their error rates. Journal of the Royal Statistical Society Series B (Methodological), 28(1), 1–31. https://doi.org/10.1111/j.2517-6161.1966.tb00614.x
  • Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10(1), 1–11. https://doi.org/10.1080/00401706.1968.10490530
  • Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22(1), 45–55. https://doi.org/10.1037/h0072400
  • Mayer, T. (1975). Selecting economic hypotheses by goodness of fit. The Economic Journal, 85(340), 877–883. https://doi.org/10.2307/2230630
  • Mosteller, F., & Tukey, J. W. (1968). Data analysis, including statistics. In G. Lindzey & E. Aronson (Eds.), Handbook of social psychology (Vol. 2). Addison-Wesley.
  • Parady, G., Ory, D., & Walker, J. (2021). The overreliance on statistical goodness-of-fit and under-reliance on model validation in discrete choice models: A review of validation practices in the transportation academic literature. Journal of Choice Modelling, 38, 100257. https://doi.org/10.1016/j.jocm.2020.100257
  • Rios-Avila, F. (2018). CV_REGRESS: Stata module to estimate the leave-one-out error for linear regression models. In Statistical software components, S458469. Boston College Department of Economics. Retrieved June 11, 2020.
  • Stone, M. A. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological), 36(2), 111–147. https://doi.org/10.1111/j.2517-6161.1974.tb00994.x
  • Wherry, R. J. (1931). A new formula for predicting the shrinkage of the coefficient of multiple correlation. Annals of Mathematical Statistics, 2(4), 440–457. https://doi.org/10.1214/aoms/1177732951