Full article: How much should we trust R2 and adjusted R2: evidence from regressions in top economics journals and Monte Carlo simulations

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

R² and adjusted R² may exaggerate a model’s true ability to predict the dependent variable in the presence of overfitting, whereas leave-one-out R² (LOOR²) is robust to overfitting. We demonstrate this by replicating 279 regressions from 100 papers in top economics journals, where the median increases of R² and adjusted R² over LOOR² reach 40.2% and 21.4% respectively. The inflation of test errors over training errors increases with the severity of overfitting as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decreases with the sample size. These results are further validated by Monte Carlo simulations.

KEYWORDS:

1. Introduction

In empirical studies, $R^{2}$ and adjusted $R^{2}$ (denoted as ${\overset{ˉ}{R}}^{2}$ ) are routinely reported as measures of goodness-of-fit for linear regressions. For example, a $R^{2}$ of 0.8 is usually taken to imply that all explanatory variables jointly explain 80% of the variations in the dependent variable. But how reliable is this interpretation?

It is well known that $R^{2}$ and ${\overset{ˉ}{R}}^{2}$ only measure in-sample fit, which may not be good indicators of the model’s true ability to explain or predict out of sample. In particular, it is common sense in the machine learning literature that training errors (as represented by $1 - R^{2}$ and $1 - {\overset{ˉ}{R}}^{2}$ ) could be poor measures of the true test errors, when the model is used to predict data that it has not yet seen. Nevertheless, as of today, most economists still happily use $R^{2}$ and ${\overset{ˉ}{R}}^{2}$ to measure goodness-of-fit, without worrying about its potential pitfalls.Footnote¹

This paper takes this issue seriously. The essential problem is that $R^{2}$ and ${\overset{ˉ}{R}}^{2}$ may exaggerate a model’s true ability to explain or predict the dependent variable, especially in the presence of overfitting. Overfitting occurs when a model is excessively fit to noisy sample data (e.g., a low degree of freedom resulting from a small sample size or too many covariates, a complicated functional form with many nonlinear terms, or the presence of outliers), which compromises the model’s ability to uncover the true relationship between the dependent and explanatory variables, as well as its performance in out-of-sample prediction.

To solve this problem, we recommend using leave-one-out cross-validated $R^{2}$ ( $LOO R^{2}$ in short) as a better measure of goodness-of-fit for linear regressions. While $LOO R^{2}$ has been around for some time, this paper takes it seriously and suggests that economists should routinely report $LOO R^{2}$ in their empirical work alongside $R^{2}$ and adjusted $R^{2}$ (if not at the expense of the latter two). There are a number of advantages associated with $LOO R^{2}$ . First, $LOO R^{2}$ is robust to overfitting, as it measures the true test errors and the model’s real ability to explain or predict the dependent variable. Second, while five-fold or ten-fold cross-validations are popular in machine learning to measure test errors, the results are uncertain due to random splitting of the sample into five or ten folds (parts) of roughly equal sizes. On the other hand, the results from leave-one-out cross-validation is certain, since one observation is left out at a time, and no random sampling is involved. Last but not least, for linear regressions, there is a short-cut formula for computing $LOO R^{2}$ such that only one regression is needed, thus the computational cost is minimal.

To support the above claims, we replicate 279 regressions from 100 empirical papers in four top economics journals during 2004–2021. In this sample, the median increases of R² and ${\overset{ˉ}{R}}^{2}$ over LOOR² reach 40.2% and 21.4%, respectively, implying that both R² and ${\overset{ˉ}{R}}^{2}$ often exaggerate the estimated model’s true ability to explain or predict the variations in the dependent variable to a large extent. Moreover, we introduce “error inflation factor” (EIF) and “adjusted error inflation factor” (adjusted EIF) to measure the inflation of test errors (i.e., $1 - LOO R^{2}$ ) over training errors using $R^{2}$ and adjusted $R^{2}$ (i.e., $1 - R^{2}$ and $1 - {\overset{ˉ}{R}}^{2}$ ) respectively. The regression results show that both EIF and adjusted EIF increase with the severity of overfitting as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decrease with the sample size. These results are further validated by Monte Carlo simulations.

Statisticians have long recognized that $R^{2}$ could be deceptively large as a measurement of a model’s true predictive ability on subsequent data. In fact, this recognition motivated the development of adjusted $R^{2}$ as a way to shrink $R^{2}$ by degree-of-freedom adjustment (Larson, Citation1931; Wherry, Citation1931).Footnote² However, Mayer (Citation1975) demonstrates empirically that even ${\overset{ˉ}{R}}^{2}$ is a poor guide to the post-sample fit, which may be caused by excessive data mining. An alternative route to the solution relies on cross-validation including leave-one-out cross-validation (Cochran, Citation1968; Hills, Citation1966; Lachenbruch & Mickey, Citation1968; Mosteller & Tukey, Citation1968), which turns out to be a more fruitful approach. Moreover, Efron and Morris (Citation1973), Geisser (Citation1974) and Stone (Citation1974) propose to use cross-validation for model selection. For a modern survey on the methodology of cross-validation, see Arlot and Celisse (Citation2010). This paper follows the tradition of cross-validation, as it measures test errors directly.

The rest of the paper is arranged as follows. Section 2 introduces leave-one-out R² (LOOR²), error inflation factor (EIF), and adjusted error inflation factor (adjusted EIF). Section 3 studies the determinants of EIF and adjusted EIF via a meta-analysis by replicating 279 regressions from 100 prominent economic papers. Section 4 conducts Monte Carlo simulations for further investigation. Section 5 provides conclusion and suggestions for empirical researchers.

2. Leave-one-out R² and error inflation factor

Consider the following linear regression model with $n$ observations,

(1)

y_{i} = {x^{'}}_{i} β + ε_{i} (i = 1, \dots, n),

(1)

where $y_{i}$ is the dependent variable for an individual $i$ , and $x_{i}$ is a $k \times 1$ vector of explanatory variables, $β$ is the corresponding $k \times 1$ vector of parameters, and $ε_{i}$ is the error term. The model can be written in a matrix form,

(2)

y = X β + ε,

(2)

where $y = (y_{1} \dots y_{n})^{'}$ , $X = (x_{1} \dots x_{n})^{'}$ and $ε = (ε_{1} \dots ε_{n})^{'}$ . The well-known OLS estimator is given by $\hat{β} = (X^{'} X)^{- 1} X^{'} y$ . With $\hat{β}$ estimated and the fitted values given by ${\hat{y}}_{i} = {x^{'}}_{i} \hat{β}$ , we have $R^{2} = {[Corr (y_{i}, {\hat{y}}_{i})]}^{2} = 1 - \frac{\sum_{i = 1}^{n} e_{i}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \overset{ˉ}{y})}^{2}}$ in the presence of a constant term,Footnote³ and adjusted R² given by ${\overset{ˉ}{R}}^{2} = 1 - \frac{\sum_{i = 1}^{n} e_{i}^{2} / (n - k)}{\sum_{i = 1}^{n} {(y_{i} - \overset{ˉ}{y})}^{2} / (n - 1)}$ , where $\overset{ˉ}{y}$ is the sample mean of $y_{i}$ , and $e_{i}$ is the OLS residual.

To implement leave-one-out regression omitting individual $i$ , we simply run OLS regression with all but the ith observations. Denoting $X_{(- i)}$ as the data matrix $X$ without the ith row, and $y_{(- i)}$ as the outcome vector $y$ without the ith element, the OLS estimator leaving out the ith observation is simply,

(3)

{\hat{β}}_{(- i)} = {({X^{'}}_{(- i)} X_{(- i)})}^{- 1} {X^{'}}_{(- i)} y_{(- i)} .

(3)

With ${\hat{β}}_{(- i)}$ estimated, we can make an out-of-sample prediction for the ith observation as ${\hat{y}}_{(- i)} = {x^{'}}_{i} {\hat{β}}_{(- i)}$ . Repeat the procedure for all observations in the sample to obtain ${\{{\hat{y}}_{(- i)}\}}_{i = 1}^{n}$ , and the leave-one-out $R^{2}$ (LOOR²) is given by

(4)

0 \leq LOO R^{2} = {[Corr (y_{i}, {\hat{y}}_{(- i)})]}^{2} \leq 1,

(4)

where $Corr (y_{i}, {\hat{y}}_{(- i)})$ is the correlation coefficient between $y_{i}$ and ${\hat{y}}_{(- i)}$ .

The procedure to compute LOOR² appears to be cumbersome as it entails running $n$ regressions, which may be computationally costly if the sample size $n$ is very large. Fortunately, for linear regressions, there is a short-cut formula for running leave-one-out regression omitting the ith observation (Hansen, Citation2022, Chapter 3),

(5)

{\hat{β}}_{(- i)} = \hat{β} - (X^{'} X)^{- 1} x_{i} {\tilde{e}}_{i},

(5)

where ${\tilde{e}}_{i} = \frac{e_{i}}{1 - le v_{i}}$ is a scaled version of the OLS residual $e_{i}$ using the full sample, and $le v_{i} = {x^{'}}_{i} (X^{'} X)^{- 1} x_{i}$ is known as the leverage for the ith observation. Using EquationEquation (5)(5) ${\hat{β}}_{(- i)} = \hat{β} - (X^{'} X)^{- 1} x_{i} {\tilde{e}}_{i},$ (5) , the leave-one-out coefficient ${\hat{β}}_{(- i)}$ can be readily computed with existing information. Therefore, in the case of linear regressions, only one regression is needed to compute LOOR² after all. Thus, calculating LOOR² in addition to $R^{2}$ and adjusted $R^{2}$ only imposes a minimal computational cost for linear regressions.Footnote⁴

After introducing LOOR², a natural question arises about the relationship among $R^{2}$ , adjusted $R^{2}$ and LOOR². In general, $R^{2}$ and adjusted $R^{2}$ are larger than LOOR², as it is usually more difficult to make out-of-sample predictions than in-sample predictions. For example, as simulations in Section 4.1 show, when noise variables are added to the regression, $R^{2}$ keeps rising while adjusted $R^{2}$ remains stable, but LOOR² declines steadily.

To see it from a different perspective, ( $1 - R^{2}$ ) and ( $1 - Adjusted R^{2}$ ) are generally smaller than ( $1 - LOO R^{2}$ ), as training errors are usually smaller than test errors. To measure the “inflation” of test errors over training errors, we define an error inflation factor (EIF) and an adjusted error inflation factor (adjusted EIF),Footnote⁵

(6)

EIF = \frac{1 - L O O R^{2}}{1 - R^{2}},

(6)

(7)

Adjusted EIF = \frac{1 - L O O R^{2}}{1 - {\overset{ˉ}{R}}^{2}},

(7)

where ${\overset{ˉ}{R}}^{2}$ is adjusted $R^{2}$ .

We conjecture that both EIF and adjusted EIF increase with the severity of overfitting. Intuitively, when there is severe overfitting, training errors underestimate test errors to a great extent, resulting in large values of EIF and adjusted EIF. In the empirical study in Section 3, we consider three potential factors contributing to overfitting, i.e., the degree of freedom (sample size in excess of the number of regressors), the number of nonlinear terms (such as squared and interactive terms), and the presence of outliers. First, if the degree of freedom is small (e.g., a small sample size, or many regressors, or both), then linear regression is essentially fit to the noisy sample data, resulting in overfitting. Second, the presence of many nonlinear terms would increase the complexity of the regression function,Footnote⁶ and thus its ability to fit noisy data, which may also result in overfitting. Third, the nature of OLS estimation by minimizing the residual sum of squares implies that it is easily influenced by outliers, which again leads to overfitting.

In summary, based on the fact that overfitting reduces in-sample training errors at the expense of increasing out-of-sample test errors, we hypothesize that overfitting would result in elevated EIF and adjusted EIF. The next section investigates these relationships empirically.

3. A meta-analysis

3.1. Data source and variable definitions

In this section, we empirically compare R², adjusted R², and LOOR², and investigate determinants of their gaps as represented by EIF and adjusted EIF. We focus on linear models where OLS is used for estimation in the recent literature. As a meta-analysis, our sample data is compiled by replicating linear regressions from 100 empirical papers selected from American Economic Review (23 papers), Economic Journal (35 papers), European Economic Review (18 papers) and Review of Economic Studies (24 papers) during 2004–2021.Footnote⁷ There are a total of 100 papers and 279 regression results in our sample with a sample size of 279, since each paper usually contains multiple OLS regressions.

For each of these 279 regressions, we calculate R², adjusted R², and LOOR², as well as the error inflation factor (EIF, denoted as eif) and the adjusted error inflation factor (adjusted EIF, denoted as eif_a). The explanatory variables include the sample size (n), the number of regressors including the constant term (k), the number of nonlinear terms (nonlinear) in each regression, and the maximum value of leverage (lev_max) as well as its variance (lev_var).

An explanation of these two measures of outliers is in order. As mentioned in Section 2, the leverage for the ith observation is given by $le v_{i} = {x^{'}}_{i} (X^{'} X)^{- 1} x_{i}$ , which measures the influence of the ith observation on $\hat{β}$ . Specifically, EquationEquation (5)(5) ${\hat{β}}_{(- i)} = \hat{β} - (X^{'} X)^{- 1} x_{i} {\tilde{e}}_{i},$ (5) implies that

(8)

\hat{β} - {\hat{β}}_{(- i)} = \frac{{(X^{'} X)}^{- 1} x_{i} e_{i}}{1 - le v_{i}} .

(8)

It can be shown that $0 \leq le v_{i} \leq 1$ with a sample average of $k / n$ (Hansen, Citation2022, Chapter 3). Therefore, a large $le v_{i}$ implies a large discrepancy between $\hat{β}$ and ${\hat{β}}_{(- i)}$ according to EquationEquation (8)(8) $\hat{β} - {\hat{β}}_{(- i)} = \frac{{(X^{'} X)}^{- 1} x_{i} e_{i}}{1 - le v_{i}} .$ (8) . The variable lev_max is simply the maximum leverage for each regression, which captures the greatest influence of a single observation in a particular regression. In the same spirit, one may consider the second largest leverage, the third largest leverage, and so on. But this approach gets tedious. Instead, we use the variance of leverage (lev_var) as a parsimonious representation. The rationale is that given that the sum of all leverages is equal to the number of regressors (i.e., $\sum_{i = 1}^{n} le v_{i} = k$ ), when some leverages are very large (i.e., close to the largest possible value of 1), then other leverages are squeezed towards their smallest possible value of 0, which results in an increase in the variance of leverage.

Summary statistics of the variables used in this study are presented in . While we focus on EIF (eif) and adjusted EIF (eif_a) in the regression analysis, it is intuitive to first look at the ratios (R²/LOOR²) and (adjusted R²/LOOR²) as reported in the first two rows of . The median of (R²/LOOR²) is 1.402, implying that the median increase of R² over LOOR² reaches 40.2% in the sample。Similarly, the median of (adjusted R²/LOOR²) is 1.214, implying that the median increase of adjusted R² over LOOR² is 21.4%。These show that R² and adjusted R² often exaggerate the estimated model’s true ability to explain or predict the dependent variable to a large extent, as measured by LOOR².

Table 1. Summary statistics.

Download CSV Display Table

The minimum values of (R²/LOOR²) and (adjusted R²/LOOR²) are both above 1 as expected. However, the maximum values of (R²/LOOR²) and (adjusted R²/LOOR²) reach alarming levels of 48,065.68 and 37,483.06, respectively. Therefore, it is instructive to take a closer look at these extreme values, which come from the fifth of five regressions in Dower et al. (Citation2021), as shown in .

Table 2. Five regressions in Dower et al. (Citation2021).

Download CSV Display Table

In an effort to estimate the value of a statistical life under Stalin’s dictatorship, Dower et al. (Citation2021) ran cross-sectional OLS regressions with 58 regions of the former Soviet Union as the units of observations. The dependent variable is the number of citizens repressed during the German and Polish operations of the Great Terror during 1937–1938 per 1000 capita. As typically done in empirical papers, of Dower et al. (Citation2021) reports results from five regressions. As more regressors and nonlinear terms are added from regressions (1) through (5), R² increases steadily from 0.244 to 0.584, while adjusted R² increases from 0.202 to 0.456, indicating a significant boost to the goodness-of-fit at face value. However, while LOOR² improves in regression (3), it drops to alarmingly low values of 0.003 and 0.000012 in regressions (4) and (5).Footnote⁸ Consequently, (R²/LOOR²) and (adjusted R²/LOOR²) reach outrageous levels of 48,065.68 and 37,483.06, respectively. Apparently, regressions (1) and (2) are underfit, whereas regressions (4) and (5) are severely overfit. Moreover, the maximum leverages are close to 1 in all regressions, indicating the presence of outliers.

3.2. Correlation analysis

As a preliminary exploration of determinants of EIF and adjusted EIF, presents a correlation matrix for major variables in the study. EIF (eif) is negatively correlated with the sample size (n) at the 5% level, while positively correlated with the number of regressors (k), the number of nonlinear terms (nonlinear), the maximum leverage (lev_max) and the variance of leverage (lev_var) at the 1% level. The correlation pattern between the adjusted EIF (eif_a) and these determinants is qualitatively similar. The only exception is that adjusted EIF (eif_a) is not significantly correlated with the number of regressors (k), perhaps due to the degree-of-freedom adjustment already made in adjusted R².

Table 3. Correlation matrix for major variables in the study.

Display Table

3.3. Regression analysis

For the determinants of Log(EIF), we start from the following baseline regressionFootnote⁹

(9)

ln ei f_{i} = β_{0} + β_{1} ln n + β_{2} ln k + β_{3} nonlinear + β_{4} lev_ma x_{i} + β_{5} lev_va r_{i} + ε_{i} .

(9)

In addition, we also interact lnn and lnk with lev_max and lev_var in EquationEquation (9)(9) $ln ei f_{i} = β_{0} + β_{1} ln n + β_{2} ln k + β_{3} nonlinear + β_{4} lev_ma x_{i} + β_{5} lev_va r_{i} + ε_{i} .$ (9) to capture possible moderating effects of the sample size and number of regressors on the two measures of outliers. Our dataset consists of 279 observations (regressions) from 100 papers, where each paper contributes 2.79 regressions on average. Apparently, we have cluster data clustered at the paper level, where observations (regressions) from the same paper are likely correlated. Therefore, we use robust standard errors clustered at the paper level throughout. In addition, we may also control for the “paper fixed effects” by giving observations from the same paper a specific intercept. However, since sample size (n) varies little within a paper,Footnote¹⁰ adding the paper fixed effects may reduce our ability to detect the effects of sample size (n). Therefore, we report regression results both with and without the paper fixed effects.

reports results from OLS regressions with Log(EIF) as the dependent variable. Column (1) of reports the results from the baseline regression (9) without the paper fixed effects. The coefficient of lnn is negatively significant at the 1% level, indicating that a large sample size decreases overfitting, thus reducing the EIF. On the other hand, the coefficient of lnk is positively significant at the 1% level, implying that more regressors increases the chance of overfitting, which contributes to increased EIF. The coefficient of lev_var (variance of leverage) is positively significant at the 1% level, as outliers may result in overfitting, whereas the coefficients of lev_max and nonlinear are insignificant.

Table 4. Determinants of log(EIF).

Display Table

Column (2) of interacts lnn and lnk with lev_max and lev_var. The coefficient of lnn*lev_var is negatively significant at the 1% level, implying that the effect of lev_var on EIF may have been mitigated by increasing the sample size. On the other hand, the coefficient of lnk*lev_var is positively significant at the 1% level, indicating that the effect of lev_var on EIF may have been magnified by increasing the number of regressors. Interestingly, the coefficient of lev_max is now positively significant at the 1% level, whereas the coefficient of lev_var loses significance. Note that these two measures of outliers are somewhat collinear, since lev_max and lev_var are positively correlated at the 1% level with a correlation coefficient of 0.685 (see ).

Column (3) of adds the paper fixed effects to the baseline regression (9). The results are qualitatively similar to column (1), but with notable differences. In particular, the coefficient of lnn loses significance, perhaps due to too little variation in sample size (n) within the same paper. However, the coefficient of nonlinear (number of nonlinear terms) is now positively significant at the 1% level, as more nonlinear terms increase the model complexity, thus contributing to overfitting.

Column (4) of interacts lnn and lnk with lev_max and lev_var while keeping the paper fixed effects. The results in column (4) are mostly similar to column (3). However, the coefficient of lev_var surprisingly becomes negatively significant at the 5% level with an estimate of -26.05. Nevertheless, the coefficient of lnk*lev_var is positively significant at the 5% level with an estimate of 14.75. Overall, since the sample mean of lnk is 2.503, the marginal effect of lev_var evaluated at the sample mean of lnk is (-26.05 + 2.503 $\times$ 14.75) = 10.87, which is similar in both magnitude and significance to the estimated coefficient of lev_var in columns (1) and (3) without interaction terms. This shows that lev_var increases overfitting more in high-dimensional data with a large number of covariates. Moreover, the coefficient of lnn*lev_max is negatively significant at the 1% level, implying that the effect of lev_max on overfitting could be mitigated by a large sample size.

reports regression results for the dependent variable Log(Adjusted EIF). The results in largely parallel those in , and the interpretations are also similar. In summary, these empirical results show that both Log(EIF) and Log(Adjusted EIF) increase with the severity of overfitting as measured by the number of regressors (lnk) and nonlinear terms (nonlinear), the maximum value of leverage (lev_max) and its variance (lev_var), but decreases with the sample size (lnn). Moreover, the effects of outliers (lev_max and lev_var) on overfitting could be moderated by the sample size and number of regressors (lnn and lnk).

Table 5. Determinants of log(adjusted EIF).

Display Table

4. Monte Carlo simulations

In this section, we conduct Monte Carlo simulations to study the behavior of R², adjusted R², LOOR², EIF, and adjusted EIF as factors related to overfitting change. Overall, the results from simulations are consistent with our findings in the empirical study above.

In the baseline setting, we draw 100 random observations from a bivariate normal distribution $(\begin{matrix} Y \\ X \end{matrix}) \sim N [(\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} 1 & 0.9 \\ 0.9 & 1 \end{matrix})]$ . With a correlation coefficient of 0.9 between $Y$ and $X$ , the population R² is 0.81. The baseline regression is simply,

(10)

Y_{i} = β_{0} + β_{1} X_{i} + ε_{i} (i = 1, \dots, 100) .

(10)

Throughout, we repeat each simulation for 1000 times, and compute the average values of R², adjusted R², LOOR², EIF, and adjusted EIF. We then investigate their behaviors as factors related to overfitting change, including the number of regressors, the sample size, the number of nonlinear terms, and the presence of outliers.

4.1. Number of regressors

In this simulation, we increase the number of regressors simply by incrementally adding 1–50 noise variables into the baseline regression (10), where all noise variables are independently distributed as $N (0, 1)$ . The sample size is kept at 100. The results are presented in .

Figure 1. The effects of number of regressors.

graphs R², adjusted R² and LOOR² against the number of regressors, where the gray horizontal line shows the population R² of 0.81. As the number of regressors increases from 2 to 51, R² increases steadily to above 0.9, clearly overestimating the ability of the model to explain the variation in

y

as a result of overfitting. On the other hand, adjusted R² hovers between 0.8 and 0.81, showing the value of degree-of-freedom adjustment. Interesting, LOOR² actually declines steadily to below 0.65, indicating that adding noise variables actually hurts the model’s ability to predict out of sample. Clearly, both R² and adjusted R² exaggerate the model’s true predictive ability, and the extent of exaggeration increases with the number of noise variables added. On the other hand, LOOR² is robust to overfitting (at least as the model’s real predictive ability is concerned), as overfitting resulting from adding noise variables reduces LOOR². graphs EIF and adjusted EIF against the number of regressors. The interpretation is essentially the same as .

4.2. Sample size

In this simulation, the sample size is increased from 100 to 1000 at the increment of 50. On the other hand, we keep the number of regressors at 27, including the constant term, the signal variable $X$ , and 25 noise variables independently distributed as $N (0, 1)$ . The results are presented in .

Figure 2. The effects of sample size.

graphs R², adjusted R² and LOOR² against the sample size, where the gray horizontal line again shows the population R² of 0.81. Apparently, sample size has little effect on adjusted R², which hovers just below 0.81, as it has already compensated for the changing degree of freedom. On the other hand, when the sample size is relatively small (say, n = 100), R² is clearly above 0.81, indicating that the model is overfit in the presence of 25 noise variables. However, as the sample size increases towards 1000, the overfitting phenomenon diminishes, and R² declines towards 0.81 (but still above 0.81). On the contrary, when the sample size is relatively small, LOOR² is well below 0.81, as the model’s predictive ability suffers in the presence of 25 noise variables. As the sample size is increased, LOOR² climbs up towards 0.81, as a large sample size alleviates overfitting. graphs EIF and adjusted EIF against the sample size. The interpretation is similar to .

4.3. Number of nonlinear terms

To consider the effect of nonlinear terms, we simply add second through eleventh power terms to EquationEquation (10)(10) $Y_{i} = β_{0} + β_{1} X_{i} + ε_{i} (i = 1, \dots, 100) .$ (10) ,Footnote¹¹

(11)

Y_{i} = β_{0} + β_{1} X_{i} + β_{2} X_{i}^{2} + \dots + β_{11} X_{i}^{11} + ε_{i} (i = 1, \dots, 100) .

(11)

The sample size is still kept at 100. The results are presented in . graphs R², adjusted R² and LOOR² against the number of nonlinear terms. In this simple data generating process, adding more nonlinear terms does not have much effect on either R² or adjusted R², although R² does climb up slightly. However, when more nonlinear terms are added, LOOR² decreases rapidly, as these nonlinear terms drive up the model’s complexity, resulting in overfitting and reduced ability to predict out of sample. graphs EIF and adjusted EIF against the number of nonlinear terms, and the interpretation is similar.

Figure 3. The effects of number of nonlinear terms.

4.4. Outliers

In this simulation, we generate outliers simply by multiplying the largest value of $X$ in the sample by 2 through 100. As the multiplier on the largest $X$ grows from 1 to 100, the maximum leverage increases rapidly, and approaches its largest possible value of 1, as shown in .

Figure 4. Maximum leverage and multiplier on the largest X.

presents the simulation results as the maximum leverage increases. graphs R², adjusted R² and LOOR² against the maximum leverage. Initially, as the maximum leverage grows, LOOR² drops much faster than R² and adjusted R², as the model’s true predictive ability declines, while overfitting occurs in the presence of an ever more extreme outlier. However, as LOOR² drops closer to its lower bound of 0, its speed of declining inevitably falls behind than that of R² and adjusted R². In the end, as the multiplier on the largest $X$ increases towards 100, the OLS fit becomes very poor, thus R², adjusted R², and LOOR² all decline towards their common lower bound of 0.

Figure 5. The effect of outliers.

graphs EIF and adjusted EIF against the maximum leverage, which tells a similar story. Initially, both EIF and adjusted EIF increase, but they start to decline when the maximum leverage is around 0.5 (and the multiplier on the largest

X

is 5), resulting in an inverted U-shape.

5. Conclusion

Goodness-of-fit measures R² and adjusted R² are routinely reported in empirical studies with the implicit presumption that they represent the percentage by which the regressors jointly explain or predict the variation of the dependent variable. This paper shows that R² and adjusted R² are inaccurate in this regard and often overly optimistic in the presence of overfitting resulting from small sample size, many regressors and nonlinear terms, and existence of outliers. As a remedy, leave-one-out R² (LOOR²) can be readily computed, and used as a reliable measure of the model’s true ability to predict out of sample.

Moreover, we introduce the concepts of “error inflation factor” (EIF) and “adjusted error inflation factor” (adjusted EIF) as the degree of inflation of test errors ( $1 - LOO R^{2}$ ) over training errors represented by ( $1 - R^{2}$ ) and ( $1 - {\overset{ˉ}{R}}^{2}$ ) respectively. We then conduct a meta-analysis about the determinants of EIF and adjusted EIF by replicating 273 regressions from 100 papers in four top economics journal during 2004–2021. The median increases of R² and adjusted R² over LOOR² reach 40.2% and 21.4%, respectively, in this sample. The regression results show that both EIF and adjusted EIF increase with the severity of overfitting, as measured by the number of regressors and nonlinear terms, and the presence of outliers, but decrease with the sample size. These results are further validated by Monte Carlo simulations.

For empirical researchers, we recommend that they report LOOR² alongside R² and adjusted R², since LOOR² is robust to overfitting as a measure of the model’s true predictive ability out of sample. Moreover, when LOOR² diverges from either R² or adjusted R², this is a sign of overfitting, and empirical researchers should be concerned, and look for possible causes, such as a complicated functional form (e.g., too many nonlinear terms), and the presence of outliers (e.g., the maximum leverage is close to 1). As a practical matter, while overfitting reduces bias, it usually increases variance to a greater extent, which results in increased mean squared errors of the estimator, and reduced significance of the parameter of interest. Therefore, one way to increase parameter significance is to reduce overfitting.Footnote¹²

As model validation via out-of-sample prediction becomes increasingly common in many disciplines, it is time for economists to honestly embrace LOOR² as a safeguard against overfitting, which is hard to detect by using conventional R² and adjusted R² based on in-sample fit. In this way, economists can more easily avoid the trap of overfitting, and make their empirical findings more robust. Providers of statistical software (e.g., Stata) can also help in this regard by routinely reporting LOOR² alongside traditional R² and adjusted R² in the regression output.

Supplemental material

Supplemental Material

Download Rich Text Format File (21.7 KB)

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/15140326.2023.2207326.

Additional information

Notes on contributors

Qiang Chen

Qiang Chen is a professor at the School of Economics, Shandong University.

Ji Qi

Ji Qi is a PhD student at the School of Economics, Shandong University.

Notes

¹ To be sure, economics is not the only discipline in this regard. For example, Parady et al. (Citation2021) laments the overreliance on statistical goodness-of-fit and under-reliance on model validation in the transportation literature.

² The original formula for adjusted R² was first proposed in a paper by M. J. B. Ezekiel, who read it before the Mathematical Society at its annual meeting in 1928, but gave the credit to B. B. Smith.

³ We ignore the case of linear regression without a constant term, as it is rarely encountered in practice.

⁴ For example, the short-cut algorithm for computing LOOR² could be implemented in Stata by using the user-written command “cv_regress” (Rios-Avila, Citation2018) after the usual “regress” command for OLS regression.

⁵ These terminologies are in the same spirit as “variance inflation factor” (VIF).

⁶ In fact, the presence of many covariates also increases the complexity of regression function.

⁷ These four journals are selected partly because their replication data and programs are more easily accessible. See the Appendix for a complete list of these 100 papers.

⁸ Note that Dower et al. (Citation2021) only report R².

⁹ The results of using EIF or adjusted EIF as the dependent variables are qualitatively similar, but the fit is slightly worse. To save space, we only report results using Log(EIF) and Log(Adjusted EIF) as the dependent variables.:

¹⁰ Typically, the sample sizes of regressions within a paper change because of adding more variables, which may result in missing observations.

¹¹ As pointed out by an anonymous referee, adding nonlinear terms can be viewed as a particular case of including additional correlated covariates.

¹² We thank an anonymous referee for useful discussions about the relation between overfitting and parameter significance, and more studies are needed in this direction.

References

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–16. https://doi.org/10.1214/09-SS054
Google Scholar
Cochran, W. G. (1968). Commentary on estimation of error rates in discriminant analysis. Technometrics, 10(1), 204–205. https://doi.org/10.1080/00401706.1968.10490548
Web of Science ®Google Scholar
Dower, P. C., Markevich, A., & Weber, S. (2021). The value of a statistical life in a dictatorship: Evidence from Stalin. European Economic Review, 133, 103663. https://doi.org/10.1016/j.euroecorev.2021.103663
Web of Science ®Google Scholar
Efron, B., & Morris, C. (1973). Combining possibly related estimation problems (with discussion). Journal of the Royal Statistical Society, Series B, 35, 379–402.
Google Scholar
Geisser, S. (1974). A predictive approach to the random effect model. Biometrika, 61(1), 101–107. https://doi.org/10.1093/biomet/61.1.101
Web of Science ®Google Scholar
Hansen, B. E. (2022). Econometrics. Princeton University Press.
Google Scholar
Hills, M. (1966). Allocation rules and their error rates. Journal of the Royal Statistical Society Series B (Methodological), 28(1), 1–31. https://doi.org/10.1111/j.2517-6161.1966.tb00614.x
Web of Science ®Google Scholar
Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10(1), 1–11. https://doi.org/10.1080/00401706.1968.10490530
Web of Science ®Google Scholar
Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22(1), 45–55. https://doi.org/10.1037/h0072400
Google Scholar
Mayer, T. (1975). Selecting economic hypotheses by goodness of fit. The Economic Journal, 85(340), 877–883. https://doi.org/10.2307/2230630
Web of Science ®Google Scholar
Mosteller, F., & Tukey, J. W. (1968). Data analysis, including statistics. In G. Lindzey & E. Aronson (Eds.), Handbook of social psychology (Vol. 2). Addison-Wesley.
Google Scholar
Parady, G., Ory, D., & Walker, J. (2021). The overreliance on statistical goodness-of-fit and under-reliance on model validation in discrete choice models: A review of validation practices in the transportation academic literature. Journal of Choice Modelling, 38, 100257. https://doi.org/10.1016/j.jocm.2020.100257
Web of Science ®Google Scholar
Rios-Avila, F. (2018). CV_REGRESS: Stata module to estimate the leave-one-out error for linear regression models. In Statistical software components, S458469. Boston College Department of Economics. Retrieved June 11, 2020.
Google Scholar
Stone, M. A. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological), 36(2), 111–147. https://doi.org/10.1111/j.2517-6161.1974.tb00994.x
Web of Science ®Google Scholar
Wherry, R. J. (1931). A new formula for predicting the shrinkage of the coefficient of multiple correlation. Annals of Mathematical Statistics, 2(4), 440–457. https://doi.org/10.1214/aoms/1177732951
Google Scholar

How much should we trust R² and adjusted R²: evidence from regressions in top economics journals and Monte Carlo simulations

ABSTRACT

1. Introduction

2. Leave-one-out R² and error inflation factor

3. A meta-analysis