Full article: A Plot is Worth a Thousand Tests: Assessing Residual Diagnostics with the Lineup Protocol

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Regression experts consistently recommend plotting residuals for model diagnosis, despite the availability of many numerical hypothesis test procedures designed to use residuals to assess problems with a model fit. Here we provide evidence for why this is good advice using data from a visual inference experiment. We show how conventional tests are too sensitive, which means that too often the conclusion would be that the model fit is inadequate. The experiment uses the lineup protocol which puts a residual plot in the context of null plots. This helps generate reliable and consistent reading of residual plots for better model diagnosis. It can also help in an obverse situation where a conventional test would fail to detect a problem with a model due to contaminated data. The lineup protocol also detects a range of departures from good residuals simultaneously. Supplemental materials for the article are available online.

Keywords:

1 Introduction

“Since all models are wrong the scientist must be alert to what is importantly wrong.” (Box Citation1976)

Diagnosing a model is an important part of building an appropriate model. In linear regression analysis, studying the residuals from a model fit is a common diagnostic activity. Residuals summarize what is not captured by the model, and thus provide the capacity to identify what might be wrong.

We can assess residuals in multiple ways. To examine the univariate distribution, residuals may be plotted as a histogram or normal probability plot. Using the classical normal linear regression model as an example, if the distribution is symmetric and unimodal, we would consider it to be well-behaved. However, if the distribution is skewed, bimodal, multimodal, or contains outliers, there would be cause for concern. We can also inspect the distribution by conducting a goodness-of-fit test, such as the Shapiro-Wilk normality test (Shapiro and Wilk Citation1965).

Scatterplots of residuals against the fitted values, and each of the explanatory variables, are commonly used to scrutinize their relationships. If there are any visually discoverable associations, the model is potentially inadequate or incorrectly specified. We can also potentially discover patterns not directly connected to a linear model assumption from these residual plots, such as the discreteness or skewness of the fitted values, and outliers. To read residual plots, one looks for noticeable departures from the model such as nonlinear pattern or heteroscedasticity. A nonlinear pattern would suggest that the model needs to have some additional nonlinear terms. Heteroscedasticity suggests that the error is dependent on the predictors, and hence violates the independence assumption. Statistical tests were developed to provide objective assessment, for example, of nonlinear patterns (e.g., Ramsey Citation1969), and heteroscedasticity (e.g., Breusch and Pagan Citation1979).

The common wisdom of experts is that plotting the residuals is indispensable for diagnosing model fits (Cook and Weisberg Citation1982; Montgomery, Peck, and Vining Citation1982; Draper and Smith Citation1998). The lack of empirical evidence for the ubiquitous advice is curious, and is what this article tackles.

Additionally, relying solely on the subjective assessment of a single plot can be problematic. People will almost always see a pattern (see Kahneman Citation2011), so the question that really needs answering is whether any pattern perceived is consistent with randomness, or sampling variability, or noise. Correctly judging whether no pattern exists in a residual plot is a difficult task. Loy (Citation2021) emphasizes that this is especially difficult to teach to new analysts and students, and advocates to the broader use of the lineup protocol (Buja et al. Citation2009a).

The lineup protocol places a data plot in a field of null plots, allowing for a comparison of patterns due purely by chance to what is perceived in the data plot. For residual analysis this is especially helpful for gauging whether there is no pattern. ( shows an example of a lineup of residual plots.) In its strict use, one would insist that the data plot is not seen before seeing the lineup, so that the observer does not know which is the true plot. When used this way, it provides an objective test for data plots. Majumder, Hofmann, and Cook (Citation2013a) validated that results from lineups assessed by human observers performed similarly to conventional tests. One would not use a lineup when a conventional test exists and is adequate because it is more manually expensive to conduct. However, where no adequate conventional test exists, it is invaluable, as shown by Loy and Hofmann (Citation2013). Here we use the lineup as a vehicle to rigorously explore why experts advise that residual plots are indispensable despite the prevalence of numerical tests.

Fig. 1 Visual testing is conducted using a lineup, as in the example here. The residual plot computed from the observed data is embedded among 19 null plots, where the residuals are simulated from a standard error model. Computing the p-value requires that the lineup be examined by a number of human judges, each asked to select the most different plot. A small p-value would result from a substantial number selecting the data plot (at position 6, exhibiting non-linearity).

The article is structured as follows. Section 2 describes the background on the types of departures that one expects to detect, and outlines a formal statistical process for reading residual plots, called visual inference. Section 3 describes the calculation of the statistical significance and power of the test. Section 4 details the experimental design to compare the decisions made by formal hypothesis testing, and how humans would read diagnostic plots. The results are reported in Section 5. We conclude with a discussion of the presented work, and ideas for future directions.

2 Background

2.1 Departures from Good Residual Plots

Graphical summaries where residuals are plotted against fitted values, or other functions of the predictors (expected to be approximately orthogonal to the residuals) are considered to be the most important residual plots by Cook and Weisberg (1999). shows an example of an ideal residual plot where points are symmetrically distributed around the horizontal zero line (red), with no discernible patterns. There can be various types of departures from this ideal pattern. Non-linearity, heteroscedasticity and non-normality, shown in , respectively, are three commonly checked departures.

Fig. 2 Example residual versus fitted value plots (horizontal line indicates 0): (A) classically good looking residuals, (B) nonlinear pattern indicates that the model has not captured a nonlinear association, (C) heteroscedasticity indicating that variance around the fitted model is not uniform, and (D) non-normality where the residual distribution is not symmetric around 0. The latter pattern might best be assessed using a univariate plot of the residuals, but patterns B and C need to be assessed using a residual versus fitted value plot.

Model misspecification occurs if functions of predictors that needed to accurately describe the relationship with the response are incorrectly specified. This includes instances where a higher-order polynomial term of a predictor is wrongfully omitted. Any nonlinear pattern visible in the residual plot could be indicative of this problem. An example residual plot containing visual pattern of non-linearity is shown in . One can clearly observe the “S-shape” from the residual plot, which corresponds to the cubic term that should have been included in the model.

Heteroscedasticity refers to the presence of nonconstant error variance in a regression model. It indicates that the distribution of residuals depends on the predictors, violating the independence assumption. This can be seen in a residual plot as an inconsistent spread of the residuals relative to the fitted values or predictors. An example is the “butterfly” shape shown in , or a “left-triangle” and “right-triangle” shape where the smallest variance occurs at one side of the horizontal axis.

shows a scatterplot where the residuals have a skewed distribution, as seen by the uneven vertical spread. Unlike non-linearity and heteroscedasticity, non-normality is usually detected with a different type of residual plot: a histogram or a normal probability plot. Because we focus on scatterplots, non-normality is not one of the departures examined in depth in this article. (Loy, Follett, and Hofmann Citation2016, discuss related work on non-normality checking.)

2.2 Conventionally Testing for Departures

Many different hypothesis tests are available to detect specific model defects. For example, the presence of heteroscedasticity can usually be tested by applying the White test (White Citation1980) or the Breusch-Pagan (BP) test (Breusch and Pagan Citation1979), which are both derived from the Lagrange multiplier test (Silvey Citation1959) principle that relies on the asymptotic properties of the null distribution. To test specific forms of non-linearity, one may apply the F-test as a model structural test to examine the significance of a specific polynomial and nonlinear forms of the predictors, or the significance of proxy variables as in the Ramsey Regression Equation Specification Error Test (RESET) (Ramsey Citation1969). The Shapiro-Wilk (SW) normality test (Shapiro and Wilk Citation1965) is the most widely used test of non-normality included by many of the statistical software programs. The Jarque-Bera test (Jarque and Bera Citation1980) is also used to directly check whether the sample skewness and kurtosis match a normal distribution.

displays the $p$ -values from the RESET, BP, and SW tests applied to the residual plots in . The RESET test and BP test were computed using the resettest and bptest functions from the R package lmtest, respectively. The SW test was computed using the shapiro.test from the core R package stats.Footnote¹ The RESET test requires the selection of a power parameter. Ramsey (Citation1969) recommends a power of four, which we adopted in our analysis.

Table 1 Statistical significance testing for departures from good residuals for plots in Figure 2.

Download CSV Display Table

For residual plots in , we would expect the RESET test for non-linearity to reject residual plot B, the BP test for heteroscedasticity to reject the residual plot C, and the SW test for non-normality to reject residual plot D, which they all do and all tests also correctly fail to reject residual plot A. Interestingly, the BP and SW tests also reject the residual plots exhibiting structure that they were not designed for. Cook and Weisberg (Citation1982) explain that most residual-based tests for a particular departure from the model assumptions are also sensitive to other types of departures. This could be considered a Type III error (Kimball Citation1957), where the null hypothesis of good residuals is correctly rejected but for the wrong reason. Also, some types of departure can have elements of other types of departure, for example, non-linearity can appear like heteroscedasticity. Additionally, other data problems such as outliers can trigger rejection (or not) of the null hypothesis (Cook and Weisberg 1999).

With large sample sizes, hypothesis tests may reject the null hypothesis when there is only a small effect. (A good discussion can be found in Kirk (Citation1996).) While such rejections may be statistically correct, their sensitivity may render the results impractical. A key goal of residual plot diagnostics is to identify potential issues that could lead to incorrect conclusions or errors in subsequent analyses, but minor defects in the model are unlikely to have a significant impact and may be best disregarded for practical purposes. The experiment discussed in this article specifically addresses this tension between statistical significance and practical significance.

2.3 Visual Test Procedure based on Lineups

The examination of data plots to infer signals or patterns (or lack thereof) is fraught with variation in the human ability to interpret and decode the information embedded in a graph (Cleveland and McGill Citation1984). In practice, over-interpretation of a single plot is common. For instance, Roy Chowdhury et al. (Citation2015) described a published example where authors over-interpreted separation between gene groups from a two-dimensional projection of a linear discriminant analysis even when there were no differences in the expression levels between the gene groups.

One solution to over-interpretation is to examine the plot in the context of natural sampling variability assumed by the model, called the lineup protocol, as proposed in Buja et al. (2009b). Majumder, Hofmann, and Cook (2013b) showed that the lineup protocol is analogous to the null hypothesis significance testing framework. The protocol consists of $m$ randomly placed plots, where one plot is the data plot, and the remaining $m - 1$ plots, referred to as the null plots, are constructed using the same graphical procedure as the data plot but the data is replaced with null data that is generated in a manner consistent with the null hypothesis, $H_{0}$ . Then, an observer who has not seen the data plot is asked to point out the most different plot from the lineup. Under $H_{0}$ , it is expected that the data plot would have no distinguishable difference from the null plots, and the probability that the observer correctly picks the data plot is $1 / m$ . If one rejects $H_{0}$ as the observer correctly picks the data plot, then the Type I error of this test is $1 / m$ . This protocol requires a priori specification of $H_{0}$ (or at least a null data generating mechanism), much like the requirement of knowing the sampling distribution of the test statistic in null hypothesis significance testing framework.

is an example of a lineup protocol. If the data plot at position $6$ is identifiable, then it is evidence for the rejection of $H_{0}$ . In fact, the actual residual plot is obtained from a misspecified regression model with missing nonlinear terms.

Data used in the $m - 1$ null plots needs to be simulated. In regression diagnostics, sampling data consistent with $H_{0}$ is equivalent to sampling data from the assumed model. As Buja et al. (2009b) suggested, $H_{0}$ is usually a composite hypothesis controlled by nuisance parameters. Since regression models can have various forms, there is no general solution to this problem, but it sometimes can be reduced to a so called “reference distribution” by applying one of the three methods: (i) sampling from a conditional distribution given a minimal sufficient statistic under $H_{0}$ , (ii) parametric bootstrap sampling with nuisance parameters estimated under $H_{0}$ , and (iii) Bayesian posterior predictive sampling. The conditional distribution given a minimal sufficient statistic is the best justified reference distribution among the three (Buja et al. 2009b). Under this method, the null residuals can essentially be simulated by independent drawing from a standard normal random distribution, then regressing the draws on the predictors, and then rescaling it by the ratio of the residual sum of square in two regressions.

The effectiveness of lineup protocol for regression analysis has been validated by Majumder, Hofmann, and Cook (2013b) under relatively simple settings with up to two predictors. Their results suggest that visual tests are capable of testing the significance of a single predictor with a similar power to a t-test, though they express that in general it is unnecessary to use visual inference if there exists a corresponding conventional test, and they do not expect the visual test to perform equally well as the conventional test. In their third experiment, where the contamination of the data violate the assumptions of the conventional test, visual test outperforms the conventional test by a large margin. This supports the use of visual inference in situations where there are no existing numerical testing procedures. Visual inference has also been integrated into diagnostics for hierarchical linear models where the lineup protocol is used to judge the assumptions of linearity, normality and constant error variance for both the level-1 and level-2 residuals (Loy and Hofmann Citation2013, 2014, 2015).

3 Calculation of Statistical Significance and Test Power

3.1 What is Being Tested?

In diagnosing a model fit using the residuals, we are generally interested in testing whether “the regression model is correctly specified” ( $H_{0}$ ) against the broad alternative “the regression model is misspecified” ( $H_{a}$ ). However, it is practically impossible to test this broad $H_{0}$ with conventional tests, because they need specific structure causing the departure to be quantifiable in order to be computable. For example, the RESET test for detecting nonlinear departures is formulated by fitting $y = τ_{0} + \sum_{i = 1}^{p} τ_{p} x_{p} + γ_{1} {\hat{y}}^{2} + γ_{2} {\hat{y}}^{3} + γ_{3} {\hat{y}}^{4} + u, u \sim N (0, σ_{u}^{2})$ in order to test $H_{0} : γ_{1} = γ_{2} = γ_{3} = 0$ against $H_{a} : γ_{1} \neq 0 or γ_{2} \neq 0 or γ_{3} \neq 0$ . Similarly, the BP test is designed to specifically test $H_{0} :$ error variances are all equal ( $ζ_{i} = 0 for i = 1, \dots, p$ ) versus the alternative $H_{a} :$ that the error variances are a multiplicative function of one or more variables ( $at least one ζ_{i} \neq 0$ ) from $e^{2} = ζ_{0} + \sum_{i = 1}^{p} ζ_{i} x_{i} + u, u \sim N (0, σ_{u}^{2})$ .

While a battery of conventional tests for different types of departures could be applied, this is intrinsic to the lineup protocol. The lineup protocol operates as an omnibus test, able to detect a range of departures from good residuals in a single application.

3.2 Statistical Significance

In hypothesis testing, a $p$ -value is defined as the probability of observing test results at least as extreme as the observed result assuming $H_{0}$ is true. Conventional hypothesis tests usually have an existing method to derive or compute the $p$ -value based on the null distribution. The method to estimate a $p$ -value for a visual test essentially follows the process detailed by VanderPlas et al. (Citation2021). Details are given in Appendix A.

3.3 Power of the Tests

The power of a model misspecification test is the probability that $H_{0}$ is rejected given the regression model is misspecified in a specific way. It is an important indicator when one is concerned about whether model assumptions have been violated. In practice, one might be more interested in knowing how much the residuals deviate from the model assumptions, and whether this deviation is of practical significance.

The power of a conventional hypothesis test is affected by both the true parameters $θ$ and the sample size $n$ . These two can be quantified in terms of effect size $E$ to measure the strength of the residual departures from the model assumptions. Details about the calculation of effect size are provided in Section 4.2 after the introduction of the simulation model used in our experiment. The theoretical power of a test is sometimes not a trivial solution, but it can be estimated if the data generating process is known. We use a predefined model to generate a large set of simulated data under different effect sizes, and record if the conventional test rejects $H_{0}$ . The probability of the conventional test rejects $H_{0}$ is then fitted by a logistic regression formulated as (1) $Pr (reject H_{0} | H_{1}, E) = Λ (log (\frac{0.05}{0.95}) + β_{1} E),$ (1) where $Λ (.)$ is the standard logistic function given as $Λ (z) = exp (z) {(1 + exp (z))}^{- 1}$ . The effect size $E$ is the only predictor and the intercept is fixed to $log (0.05 / 0.95)$ so that $\hat{Pr} (reject H_{0} | H_{1}, E = 0) = 0.05$ , the desired significance level.

The power of a visual test on the other hand, may additionally depend on the ability of the particular participant, as the skill of each individual may affect the number of observers who identify the data plot from the lineup (Majumder, Hofmann, and Cook 2013b). To address this issue, Majumder, Hofmann, and Cook (2013b) models the probability of participant $j$ correctly picking the data plot from lineup $l$ using a mixed-effect logistic regression, with participants treated as random effects. Then, the estimated power of a visual test evaluated by a single participant is the predicted value obtained from the mixed effects model. However, this mixed effects model does not work with scenario where participants are asked to select one or more most different plots. In this scenario, having the probability of a participant $j$ correctly picking the data plot from a lineup $l$ is insufficient to determine the power of a visual test because it does not provide information about the number of selections made by the participant for the calculation of the $p$ -value. Therefore, we directly estimate the probability of a lineup being rejected by assuming that individual skill has negligible effect on the variation of the power. This assumption essentially averages out the subject ability and helps to simplify the model structure, thereby obviating a costly large-scale experiment to estimate complex covariance matrices. The same model given in 1 is applied to model the power of a visual test.

To study various factors contributing to the power of both tests, the same logistic regression model is fit on different subsets of the collated data grouped by levels of factors. These include the distribution of the fitted values, type of the simulation model and the shape of the residual departures.

4 Experimental Design

Our experiment was conducted over three data collection periods to investigate the difference between conventional hypothesis testing and visual inference in the application of linear regression diagnostics. Two types of departures, non-linearity and heteroscedasticity, were collected during data collection periods I and II. The data collection period III was designed primarily to measure human responses to null lineups so that the visual $p$ -values can be estimated. Additional lineups for both non-linearity and heteroscedasticity, using uniform fitted value distributions, were included for additional data, and to avoid participant frustration of too many difficult tasks.

During the experiment, every participant recruited from the Prolific crowd-sourcing platform (Palan and Schitter Citation2018) was presented with a block of 20 lineups. A lineup consisted of a randomly placed data plot and 19 null plots, which were all residual plots drawn with raw residuals on the y-axis and fitted values on the x-axis. An additional horizontal red line was added at $y = 0$ as a visual reference. The data in the data plot was simulated from one of two models described in Section 4.1, while the data of the remaining 19 null plots were generated by the residual rotation technique discussed in Section 2.3.

In each lineup evaluation, the participant was asked to select one or more plots that are most different from others, provide a reason for their selections, and evaluate how different they think the selected plots are from others. If there is no noticeable difference between plots in a lineup, participants had the option to select zero plots without the need to provide a reason. During the process of recording the responses, a zero selection was considered to be equivalent to selecting all 20 plots. No participant was shown the same lineup twice. Information about preferred pronouns, age group, education, and previous experience in visual experiments were also collected. A participant’s submission was only included in the analysis if the data plot is identified for at least one attention check.

Overall, we collected 7974 evaluations on 1152 unique lineups performed by 443 participants. A summary of the factors used in the experiment can be found in . There were four levels of the nonlinear structure, and three levels of heteroscedastic structure. The signal strength was controlled by error variance ( $σ$ ) for the nonlinear pattern, and by a ratio ( $b$ ) parameter for the heteroscedasticity. Additionally, three levels of sample size ( $n$ ) and four different fitted value distributions were incorporated.

Table 2 Levels of the factors used in data collection periods I, II, and III.

Download CSV Display Table

4.1 Simulating Departures from Good Residuals

4.1.1 Non-linearity and Heteroscedasticity

Data collection period I was designed to study the ability of participants to detect non-linearity departures from residual plots. The non-linearity departure was constructed by omitting a $j$ th order Hermite polynomial Hermite Citation1864; originally by Laplace Citation1820 term of the predictor from the simple linear regression equation. Four different values of $j = 2, 3, 6, 18$ were chosen so that distinct shapes of non-linearity were included in the residual plots. These include “U”, “S”, “M”, and “triple-U” shape as shown in . A greater value of $j$ will result in a curve with more turning points. It is expected that the “U” shape will be the easiest to detect, and as the shape gets more complex it will be harder to perceive in a scatterplot, particularly when there is noise. shows the “U” shape for different amounts of noise ( $σ$ ).

Fig. 3 Polynomial forms generated for the residual plots used to assess detecting non-linearity. The four shapes are generated by varying the order of polynomial given by j in $H e_{j} (.)$ .

Fig. 4 Examining the effect of σ on the signal strength in the non-linearity detection, for n = 300, uniform fitted value distribution and the “U” shape. As σ increases the signal strength decreases, to the point that the “U” is almost unrecognizable when σ = 4.

Data collection period II was similar to period I but focuses on heteroscedasticity departures. We generated the heteroscedasticity departures by setting the variance-covariance matrix of the error term as a function of the predictor, but fitted the data with the simple linear regression model, intentionally violated the constant variance assumption. Visual patterns of heteroscedasticity are simulated using three different shapes ( $a$ =–1, 0, 1) including “left-triangle”, “butterfly”, and “right-triangle” shapes as displayed in . shows the butterfly shape as the ratio parameter ( $b$ ) is changed. More details about the simulation process are provided in Appendix B.

Fig. 5 Heteroscedasticity forms used in the experiment. Three different shapes ( $a = - 1, 0, 1$ ) are used in the experiment to create “left-triangle”, “butterfly” and “right-triangle” shapes, respectively.

Fig. 6 Five different values of b are used in heteroscedasticity simulation to control the strength of the signal. Larger values of b yield a bigger difference in variation, and thus stronger heteroscedasticity signal.

4.1.2 Factors Common to both Data Collection Periods

Fitted values are a function of the independent variables (or predictors), and the distribution of the observed values affects the distribution of the fitted values. Ideally, we would want the fitted values to have a uniform coverage across the range of observed values or have a uniform distribution across all of the predictors. This is not always present in the collected data. Sometimes the fitted values are discrete because one or more predictors were measured discretely. It is also common to see a skewed distribution of fitted values if one or more of the predictors has a skewed distribution. This latter problem is usually corrected before modeling, using a variable transformation. Our simulation assess this by using four different distributions to represent fitted values, constructed by different sampling of the predictor, including $U (- 1, 1)$ (uniform), $N (0, {0.3}^{2})$ (normal), $lognormal (0, {0.6}^{2}) / 3$ (skewed) and $U {- 1, 1}$ (discrete).

shows the nonlinear pattern, a “U” shape, with the different fitted value distributions. We would expect that structure in residual plots would be easier to perceive when the fitted values are uniformly distributed.

Fig. 7 Variations in fitted values, that might affect perception of residual plots. Four different distributions are used.

Three different sample sizes were used in our experiment: $n = 50, 100, 300$ . shows the nonlinear “S” shape for different sample sizes. We expect signal strength to decline in the simulated data plots with smaller $n$ . We chose 300 as the upper limit, because it is typically enough for structure to be visible in a scatterplot reliably. Beyond 300, the scatterplot should probably be used with transparency or replaced with a density or binned plot as scatterplots suffer from over-plotting.

Fig. 8 Examining the effect of signal strength for the three different values of n used in the experiment, for nonlinear structure with fixed $σ = 1.5$ , uniform fitted value distribution, and “S” shape. For these factor levels, only when n = 300 is the “S” shape clearly visible.

4.2 Effect Size

The lineups are allocated to participants in a manner that uniformly covers the combination of experimental factors in . In addition, we use effect size to measure the signal strength, which helps in assigning a set of lineups with a range of difficulties to each participant.

Effect size in statistics measures the strength of the signal relative to the noise. It is surprisingly difficult to quantify, even for simulated data as used in this experiment.

For the non-linearity model, the key items defining effect size are sample size ( $n$ ) and the noise level ( $σ^{2}$ ), and so effect size would be roughly calculated as $\sqrt{n} / σ$ . Increasing sample size tends to boost the effect size, while heightened noise diminishes it. However, it is not clear how the additional parameter for the model polynomial order, $j$ , should be incorporated. Intuitively, the large $j$ means more complex pattern, which likely means effect size would decrease. Similarly, in the heteroscedasticity model, effect size relies on sample size ( $n$ ) and the ratio of the largest to smallest variance, $b$ . Larger values of both would produce higher effect size, but the role of the additional shape parameter, $a$ , in this context is unclear.

For the purposes of our calculations we have chosen to use an approach based on Kullback-Leibler divergence (Kullback and Leibler Citation1951). This formulation defines effect size to be $\begin{matrix} E = \frac{1}{2} (log \frac{| diag (RVR') |}{| diag (R σ^{2}) |} - n + tr (diag {(RVR')}^{- 1} diag (R σ^{2})) \\ + μ_{z}^{'} {(RVR')}^{- 1} μ_{z}), \end{matrix}$ where $diag (.)$ is the diagonal matrix constructed from the diagonal elements of a matrix, $X$ is the design matrix, $V$ is the actual covariance matrix of the error term, $R = I_{n} - X {(X' X)}^{- 1} X'$ is the residual operator, $μ_{z} = RZ β_{z}$ is the expected values of residuals where $Z$ contains any higher order terms of $X$ left out of the regression equation, $β_{z}$ contains the corresponding coefficients, and $σ^{2} I_{n}$ is the assumed covariance matrix of the error term when $H_{0}$ is true. More details about the effect size derivation are provided in Appendix A.

5 Results

Data collection used a total of 1152 lineups, and resulted in a total of 7974 evaluations from 443 participants. Roughly half corresponded to the two models, non-linearity and heteroscedasticiy, and the three collection periods had similar numbers of evaluations. Each participant received two of the 24 attention check lineups which were used to filter results of participants who were clearly not making an honest effort (only 11 of 454). To estimate $α$ for calculating statistical significance (see Section A.1 in the Appendix) there were 720 evaluations of 36 null lineups. Neither the attention checks nor null lineups were used in the subsequent analysis. The de-identified data, vi_survey, is made available in the R package, visage.

The data was collected on lineups constructed from four different fitted value distributions that stem from the corresponding predictor distribution: uniform, normal, skewed and discrete. Henceforth, we refer to these four different fitted value distributions with respect to their predictor distribution. More data was collected on the uniform distribution (each evaluated by 11 participants) than the others (each evaluated by 5 participants). The analysis in Sections 5.1–5.4 uses only results from lineups with uniform distribution, for a total 3069 lineup evaluations. This allows us to compare the conventional and visual test performance in an optimal scenario. Section 5.5 examines how the results may be affected if the fitted value distribution was different.

5.1 Power Comparison of the Tests

present the power curves of various tests plotted against the effect size in the residuals for non-linearity and heteroscedasticity. In each case the power of visual test is calculated for multiple bootstrap samples leading to the many (solid orange) curves. The effect size was computed at a 5% significance level and plotted on a natural logarithmic scale. To facilitate visual calibration of effect size values with the corresponding diagnostic plots, a sequence of example residual plots with increasing effect sizes is provided at the bottom of these figures. These plots serve as a visual aid to help readers understand how different effect size values translate to changes in the diagnostic plots. The horizontal lines of dots at 0 and 1 represent the non-rejection or rejection decisions made by visual tests for each lineup.

Fig. 9 Comparison of power between different tests for (A) nonlinear and (B) heteroscedasticity patterns (uniform fitted values only). Main plot shows the power curves, with dots indicating non-reject and reject in visual testing of lineups. The multiple lines for the visual test arise from estimating the power on many bootstrap samples. The row of scatterplots at the bottom are examples of residual plots corresponding to the specific effect sizes marked by vertical lines in the main plot.

compares the power for the different tests for nonlinear structure in the residuals. The test with the uniformly higher power is the RESET test, one that specifically tests for non-linearity. Note that the BP and SW tests have much lower power, which is expected because they are not designed to detect non-linearity. The bootstrapped power curves for the visual test are effectively a right shift from that of the RESET test. This means that the RESET test will reject at a lower effect size (less structure) than the visual test, but otherwise the performance will be similar. In other words, the RESET test is more sensitive than the visual test. This is not necessarily a good feature for the purposes of diagnosing model defects: if we scan the residual plot examples at the bottom, we might argue that the non-linearity is not sufficiently problematic until an effect size of around 3 or 3.5. The RESET test would reject closer to an effect size of 2, but the visual test would reject closer to 3.25, for a significance level of 0.05. The visual test matches the robustness of the model to (minor) violations of assumptions much better.

For the heteroscedasticity pattern, the power of BP test, designed for detecting heteroscedasticity, is uniformly higher than the other tests. The visual test power curve shifts to the right. This shows a similar story to the power curves for non-linearity pattern: the conventional test is more sensitive than the visual test. From the example residual plots at the bottom we might argue that the heteroscedasticity becomes noticeably visible around an effect size of 3 or 3.5. However, the BP test would reject at around effect size 2.5. Interestingly, the power curve for the SW test (for non-normality) is only slightly different to that of the visual test, suggesting that it performs reasonably well for detecting heteroscedasticity, too. The power curve for the BP test suggests it is not useful for detecting heteroscedasticity, as expected.

Overall, the results show that the conventional tests are more sensitive than the visual test. The conventional tests do have higher power for the patterns they are designed to detect, but they typically fail to detect other patterns unless those patterns are particularly strong. The visual test does not require specifying the pattern ahead of time, relying purely on whether the observed residual plot is detectably different from “good” residual plots. They will perform equally well regardless of the type of model defect. This aligns with the advice of experts on residual analysis, who consider residual plot analysis to be an indispensable tool for diagnosing model problems. What we gain from using a visual test for this purpose is the removal of any subjective arguments about whether a pattern is visible or not. The lineup protocol provides the calibration for detecting patterns: if the pattern in the data plot cannot be distinguished from the patterns in good residual plots, then no discernible problem with the model exists.

5.2 Comparison of Test Decisions Based on p-values

The power comparison demonstrates that the appropriate conventional tests will reject more aggressively than visual tests, but we do not know how the decisions for each lineup would agree or disagree. Here we compare the reject or fail to reject decisions of these tests, across all the lineups. shows the agreement of the conventional and visual tests using a mosaic plot for both non-linearity patterns and heteroscedasticity patterns. For both patterns the lineups resulting in a rejection by the visual test are all also rejected by the conventional test, except for one from the heteroscedasticity model. This reflects exactly the story from the previous section, that the conventional tests reject more aggressively than the visual test.

Fig. 10 Rejection rate (p-value $\leq 0.05$ ) of visual test conditional on the conventional test decision on non-linearity (left) and heteroscedasticity (right) lineups (uniform fitted values only) displayed using a mosaic plot. The visual test rejects less frequently than the conventional test, and (almost) only rejects when the conventional test does. Surprisingly, one lineup in the heteroscedasticity group is rejected by the visual test but NOT the conventional test.

Fig. 10 Rejection rate (p-value ≤0.05) of visual test conditional on the conventional test decision on non-linearity (left) and heteroscedasticity (right) lineups (uniform fitted values only) displayed using a mosaic plot. The visual test rejects less frequently than the conventional test, and (almost) only rejects when the conventional test does. Surprisingly, one lineup in the heteroscedasticity group is rejected by the visual test but NOT the conventional test.

For non-linearity lineups, conventional tests and visual tests reject 69% and 32% of the time, respectively. Of the lineups rejected by the conventional test, 46% are rejected by the visual test, that is, approximately half as many as the conventional test. There are no lineups that are rejected by the visual test but not by the conventional test.

In heteroscedasticity lineups, 76% are rejected by conventional tests, while 56% are rejected by visual tests. Of the lineups rejected by the conventional test, the visual test rejects more than two-thirds of them, too.

Surprisingly, the visual test rejects 1 of the 33 (3%) of lineups where the conventional test does not reject. shows this lineup. The data plot in position seventeen displays a relatively strong heteroscedasticity pattern, and has a strong effect size ( ${log}_{e} (E) = 4.02$ ), which is reflected by the visual test $p ‐value = 0.026$ . But the BP test $p ‐value = 0.056$ , is slightly above the significance cutoff of $0.05$ . This lineup was evaluated by 11 participants, it has experimental factors $a = 0$ (“butterfly” shape), $b = 64$ (large variance ratio), $n = 50$ (small sample size), and a uniform distribution for the predictor. It may have been the small sample size and the presence of a few outliers that may have resulted in the lack of detection by the conventional test.

Fig. 11 The single heteroscedasticity lineup that is rejected by the visual test but not by the BP test. The data plot (position 17) contains a “butterfly” shape. It visibly displays heteroscedasticity, making it somewhat surprising that it is not detected by the BP test.

Because the power curve of the visual tests are a shift to the right of the conventional test () we examined whether adjusting the significance level (to 0.001, 0.0001, 0.00001, …) of the conventional test would generate similar decisions to that of the visual test. Interestingly, it does not: despite resulting in less rejections, neither the RESET or BP tests come to complete agreement with the visual test (see Appendix A).

5.3 Effect of Amount of Non-linearity

The order of the polynomial is a primary factor contributing to the pattern produced by the non-linearity model. explores the relationship between polynomial order and power of the tests. The conventional tests have higher power for lower orders of Hermite polynomials, and the power drops substantially for the “triple-U” shape. To understand why this is, we return to the application of the RESET test, which requires a parameter indicating degree of fitted values to test for, and the recommendation is to generically use four (Ramsey Citation1969). However, the “triple-U” shape is constructed from the Hermite polynomials using power up to 18. If the RESET test had been applied using a higher power of no less than six, the power curve of “triple-U” shape will be closer to other power curves. This illustrates the sensitivity of the conventional test to the parameter choice, and highlights a limitation: it helps to know the data generating process to set the parameters for the test, which is unrealistic in practice. However, we examined this in more detail (see Appendix A) and found that there is no harm in setting the parameter higher than four on the tests’ operation for lower order polynomial shapes. Using a parameter value of six, instead of four, yields higher power regardless of generating process, and is recommended.

Fig. 12 The effect of the order of the polynomial on the power of conventional and visual tests. Deeper color indicates higher order. The default RESET tests under-performs significantly in detecting the “triple-U” shape. To achieve a similar power as other shapes, a higher order polynomial parameter needs to be used for the RESET test, but this higher than the recommended value.

For visual tests, we expect the “U” shape to be detected more readily, followed by the “S”, “M”, and “triple-U” shape. From , it can be observed that the power curves mostly align with these expectations, except for the “M” shape, which is as easily detected as the “S” shape. This suggests a benefit of the visual test: knowing the shape ahead of time is not needed for its application.

5.4 Effect of Shape of Heteroscedasticity

examines the impact of the shape of the heteroscedasticity on the power of both tests. The butterfly shape has higher power on both types of tests. The “left-triangle” and the “right-triangle” shapes are functionally identical, and this is observed for the conventional test, where the power curves are identical. Interestingly there is a difference for the visual test: the power curve of the “left-triangle” shape is slightly higher than that of the “right-triangle” shape. This indicates a bias in perceiving heteroscedasticity depending on the direction, and may be worth investigating further.

Fig. 13 The effect of heteroscedasticity shape (parameter a) on the power of conventional and visual tests. The butterfly has higher power in both tests. Curiously, the visual test has a slightly higher power for the “left-triangle” than the “right-triangle” shape, when it would be expected that they should be similar, which is observed in conventional testing.

5.5 Effect of Fitted Value Distributions

In regression analysis, predictions are conditional on the observed values of the predictors, that is, the conditional mean of the dependent variable $Y$ given the value of the independent variable $X$ , $E (Y | X)$ . This is an often forgotten element of regression analysis but it is important. Where $X$ is observed, the distribution of the $X$ values in the sample, or consequently $\hat{Y}$ , may affect the ability to read any patterns in the residual plots. The effect of fitted value distribution on test performance is assess using four different distributions of fitted values stemming from the predictor distributions: uniform, normal, discrete and lognormal (skewed). We expect that if all predictors have a uniform distribution, it is easier to read the relationship with the residuals.

examines the impact of the fitted value distribution on the power of conventional (left) and visual (right) tests for both the non-linearity (top) and heteroscedasticity (bottom) patterns. For conventional tests, only the power curves of appropriate tests are shown: RESET tests for non-linearity and BP tests for heteroscedasticity. For visual tests, more evaluations on lineups with uniform fitted value distribution were collected, so to have a fair comparison, we randomly sample five from the 11 total evaluations to estimate the power curves, producing the multiple curves for the uniform condition, and providing an indication of the variability in the power estimates.

Fig. 14 Comparison of power on lineups with different fitted value distributions for conventional and visual tests (columns) for non-linearity and heteroscedasticity patterns (rows). The power curves of conventional tests for non-linearity and heteroscedasticity patterns are produced by RESET tests and BP tests, respectively. Power curves of visual tests are estimated using five evaluations on each lineup. For lineups with a uniform fitted value distribution, the five evaluations are repeatedly sampled from the total 11 evaluations to give multiple power curves (solid grey). Surprisingly, the fitted value distribution has produces more variability in the power of conventional tests than visual tests. Uneven distributions, normal and skewed distributions, tend to yield lower power.

Perhaps surprisingly, the visual tests have more consistent power across the different fitted value distributions: for the nonlinear pattern, there is almost no power difference, and for the heteroscedastic pattern, uniform and discrete have higher power than normal and skewed. The likely reason is that these latter two have fewer observations in the tails where the heteroscedastic pattern needs to be detected.

The variation in power in the conventional tests is at first sight, shocking. However, it is discussed, albeit rarely, in the testing literature. See, for example, Jamshidian, Jennrich, and Liu (Citation2007), Olvera Astivia, Gadermann, and Guhn (Citation2019), and Zhang and Yuan (Citation2018) which show derivations and use simulation to assess the effect of the observed distribution of the predictors on test power. The big differences in the power curves seen in is echoed in the results reported in these articles.

6 Limitations and Practicality

One of the primary limitations of the lineup protocol lies in its reliance on human judgments. In this context, the effectiveness of a single lineup evaluation can be dependent on the perceptual ability and visual skills of the individual. However, when results from multiple individuals are combined the outcome is encouragingly high-quality and robust. For simple plots and strong patterns just a few individuals are needed to arrive at a clear answer, but more individuals will be needed when the plot design is complex, or the signal strength is weak.

Using a lineup protocol removes subjectiveness in interpreting patterns in plots. A plot is compared with draws from a null model, in much the same way as a test statistic is compared to its sampling distribution. It is important to remove plot elements that might introduce bias, such as axis labels, text and legends, or to make them generic.

The lineup protocol can be used cheaply and informally with the R package nullabor. There is evidence that it is being used fairly broadly, based on software download rates and citations of the original papers. For residual plot analysis we recommend that the lineup be the default first plot so that the data plot is only seen in the context of null plots. When a rigorous test is needed, we recommend using a crowd-sourcing service, as done in gene expression experiment described in Yin et al. (Citation2013). While it takes extra effort it is not difficult today, and costs are tiny compared to the overall costs of conducting a scientific experiment. We do also expect that at some point a computer vision model can be developed to take over the task of employing people to evaluate residual plots.

For this study, simulated data was used to provide a precisely controlled environment within which to compare results from conventional testing to those from visual testing. We also explored only the most commonly used, the residual versus fitted value plots. However, we expect the behavior of the conventional test and the visual test to be similar when observed residuals are diagnosed with this type of plot or other residual plots. The conventional tests will be more sensitive to small departures from the null. They will also fail to detect departures when residuals have some contamination, like outliers or anomalies, as is often encountered when working with data. The lineup approach is well-suited for generally interpreting data plots, and also detecting unexpected patterns not related to the model. This is supported by earlier research (e.g., Wickham et al. Citation2010; Roy Chowdhury et al. Citation2015; Loy and Hofmann 2015; VanderPlas and Hofmann Citation2016; Loy, Follett, and Hofmann Citation2016).

7 Conclusions

This article has described experimental evidence providing support for the advice of regression analysis experts that residual plots are indispensable methods for assessing model fit, using the formal framework of the lineup protocol. We conducted a perceptual experiment on scatterplots of residuals versus fitted values, with two primary departures from good residuals: non-linearity and heteroscedasticity. We found that conventional residual-based statistical tests are more sensitive to weak departures from model assumptions than visual tests. That is, a conventional test concludes there are problems with the model fit almost twice as often as a human. Conventional tests often reject the null hypothesis when departures in the form of non-linearity and heteroscedasticity are not visibly different from null residual plots.

While it might be argued that the conventional tests are correctly detecting small but real effects, this can also be seen as the conventional tests are rejecting unnecessarily. Many of these rejections happen even when downstream analysis and results would not be significantly affected by the small departures from a good fit. The results from human evaluations provide a more practical solution, which reinforces the statements from regression experts that residual plots are an indispensable method for model diagnostics. Further work would be needed to quantify how much departure from good residuals is too much.

It is important to emphasize that this work also supports a change in common practice, which is to deliver residual plots as a lineup, embedded in a field of null plots, rather than be viewed out of context. A residual plot may contain many visual features, but some are caused by the characteristics of the predictors and the randomness of the error, not by the violation of the model assumptions. These irrelevant visual features have a chance to be filtered out by participants with a comparison to null plots, resulting in more accurate reading. The lineup enables a careful calibration for reading structure in residual plots, and also provides the potential to discover interesting and important features in the data not directly connected to linear model assumptions.

Human evaluation of residuals is expensive, time-consuming and laborious. This is possibly why residual plot analysis is often not done in practice. However, with the emergence of effective computer vision, it is hoped this work helps to lay the foundation for automated residual plot assessment.

The experiment also revealed some interesting results. For the most part, the visual test performed similarly to the appropriate conventional test with a shift in the power curve. Unlike conventional tests, where one needs to specifically test for non-linearity or heteroscedasticity the visual test operated effectively across the range of departures from good residuals. If the fitted value distribution is not uniform, there is a small loss of power in the visual test. Surprisingly, there is a big difference in power of the conventional test across fitted value distributions. Another unexpected finding was that the direction of heteroscedasticity appears to affect the ability to visually detect it: both triangles being more difficult to detect than the butterfly, and a small difference in detection between left- and right-triangle.

Supplementary Materials

Appendix: The appendix includes more details about the experimental setup, the derivation of the effect size, the effect of data collection period, and the estimate of α. (appendix.pdf, Portable Document Format file)

R package “visage”: The R package “visage” containing code to simulate lineups used in the experiment. The package also contains de-identified data collected from the experiment. (visage_0.1.2.tar.gz, GNU zipped tar file)

Supplemental material

supplementary_files.zip

Download Zip (3.5 MB)

README.txt

Download Text (400 B)

Acknowledgments

These R packages were used for the work: tidyverse (Wickham et al. Citation2019), lmtest (Zeileis and Hothorn Citation2002), mpoly (Kahle Citation2013), ggmosaic (Jeppson, Hofmann, and Cook Citation2021), kableExtra (Zhu Citation2021), patchwork (Pedersen Citation2022), rcartocolor (Nowosad Citation2018). The study website was powered by PythonAnywhere (PythonAnywhere LLP Citation2023) and Flask web framework (Grinberg Citation2018). The jsPsych framework (De Leeuw Citation2015) was used to create behavioral experiments that run in our study website. The article was created with R packages rticles (Allaire et al. Citation2022), knitr (Xie 2014) and rmarkdown (Xie, Dervieux, and Riederer Citation2020). The project’s GitHub repository (https://github.com/TengMCing/lineup_residual_diagnostics) contains all materials required to reproduce this article.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Notes

1 Although we did not use it, it is useful to know that the R package skedastic (Farrar Citation2020) also contains a large collection of functions to test for heteroscedasticity.

References

Allaire, J. J., Xie, Y., Dervieux, C., R Foundation, Wickham, H., Journal of Statistical Software, Vaidyanathan, R., et al. (2022), rticles: Article formats for R Markdown. R package version 0.24. Available at https://CRAN.R-project.org/package=rticles.
Google Scholar
Box, G. E. P. (1976), “Science and Statistics,” Journal of the American Statistical Association, 71, 791–799.
Web of Science ®Google Scholar
Breusch, T. S., and Pagan, A. R. (1979), “A Simple Test for Heteroscedasticity and Random Coefficient Variation,” Econometrica: Journal of the Econometric Society, 47, 1287–1294.
Web of Science ®Google Scholar
Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E.-K., Swayne, D. F., and Wickham, H. (2009a), “Statistical Inference for Exploratory Data Analysis and Model Diagnostics,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367, 4361–4383.
PubMed Web of Science ®Google Scholar
——(2009b), “Statistical Inference for Exploratory Data Analysis and Model Diagnostics,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367, 4361–4383.
Google Scholar
Cleveland, W. S., and McGill, R. (1984), “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods,” Journal of the American Statistical Association, 79, 531–554.
Web of Science ®Google Scholar
Cook, R. D., and Weisberg, S. (1982), Residuals and Influence in Regression, New York: Chapman and Hall.
Google Scholar
——(1999), Applied Rregression Including Computing and Graphics, New York: Wiley.
Google Scholar
De Leeuw, J. R. (2015), “jsPsych: A JavaScript Library for Creating Behavioral Experiments in a Web Browser,” Behavior Research Methods, 47, 1–12.
PubMed Web of Science ®Google Scholar
Draper, N. R., and Smith, H. (1998), Applied Regression Analysis (Vol. 326), Hoboken, NJ: Wiley.
Google Scholar
Farrar, T. J. (2020), skedastic: Heteroskedasticity Diagnostics for Linear Regression Models, Bellville, South Africa. R Package Version 1.0.0.
Google Scholar
Grinberg, M. (2018), Flask Web Development: Developing Web Applications with Python, Sebastopol, CA: O’Reilly Media, Inc.
Google Scholar
Hermite, M. (1864), Sur un nouveau développement en série des fonctions, Imprimerie de Gauthier-Villars.
Google Scholar
Jamshidian, M., Jennrich, R. I., and Liu, W. (2007), “A Study of Partial F tests for Multiple Linear Regression Models,” Computational Statistics & Data Analysis, 51, 6269–6284.
Web of Science ®Google Scholar
Jarque, C. M., and Bera, A. K. (1980), “Efficient Tests for Normality, Homoscedasticity and Serial Independence of Regression Residuals,” Economics Letters, 6, 255–259.
Web of Science ®Google Scholar
Jeppson, H., Hofmann, H., and Cook, D. (2021), ggmosaic: Mosaic Plots in the ’ggplot2’ Framework, R package version 0.3.3. Available at https://CRAN.R-project.org/package=ggmosaic.
Google Scholar
Kahle, D. (2013), “mpoly: Multivariate Polynomials in R,” The R Journal, 5, 162–170.
Google Scholar
Kahneman, D. (2011), Thinking, Fast and Slow, New York: Farrar, Straus and Giroux.
Google Scholar
Kimball, A. W. (1957), “Errors of the Third Kind in Statistical Consulting,” Journal of the American Statistical Association, 52, 133–142.
Web of Science ®Google Scholar
Kirk, R. E. (1996), “Practical Significance: A Concept Whose Time Has Come,” Educational and Psychological Measurement, 56, 746–759.
Web of Science ®Google Scholar
Kullback, S., and Leibler, R. A. (1951), “On Information and Sufficiency,” The Annals of Mathematical Statistics, 22, 79–86.
Google Scholar
Laplace, P.-S. (1820), Théorie analytique des probabilités (Vol. 7), Paris: Courcier.
Google Scholar
Loy, A. (2021), “Bringing Visual Inference to the Classroom,” Journal of Statistics and Data Science Education, 29, 171–182.
Web of Science ®Google Scholar
Loy, A., Follett, L., and Hofmann, H. (2016), “Variations of Q–Q Plots: The Power of Our Eyes!” The American Statistician, 70, 202–214.
Web of Science ®Google Scholar
Loy, A., and Hofmann, H. (2013), “Diagnostic Tools for Hierarchical Linear Models,” Wiley Interdisciplinary Reviews: Computational Statistics, 5, 48–61.
Google Scholar
——(2014), “HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models in R,” Journal of Statistical Software, 56, 1–28.
Web of Science ®Google Scholar
——(2015), “Are You Normal? The Problem of Confounded Residual Structures in Hierarchical Linear Models,” Journal of Computational and Graphical Statistics, 24, 1191–1209.
Web of Science ®Google Scholar
Majumder, M., Hofmann, H., and Cook, D. (2013a), “Validation of Visual Statistical Inference, Applied to Linear Models,” Journal of the American Statistical Association, 108, 942–956.
Web of Science ®Google Scholar
——(2013b), “Validation of Visual Statistical Inference, Applied to Linear Models,” Journal of the American Statistical Association, 108, 942–956.
Google Scholar
Montgomery, D. C., Peck, E. A., and Geoffrey Vining, G. (1982), Introduction to Linear Regression Analysis, Hoboken, NJ: Wiley.
Google Scholar
Nowosad, J. (2018), ’CARTOColors’ palettes. R package version 1.0, Available at https://nowosad.github.io/rcartocolor.
Google Scholar
Olvera Astivia, O. L., Gadermann, A., and Guhn, M. (2019), “The Relationship between Statistical Power and Predictor Distribution in Multilevel Logistic Regression: A Simulation-based Approach,” BMC Medical Research Methodology, 19, 1–20.
PubMed Web of Science ®Google Scholar
Palan, S., and Schitter, C. (2018), “Prolific. ac—A Subject Pool for Online Experiments,” Journal of Behavioral and Experimental Finance, 17, 22–27.
Web of Science ®Google Scholar
Pedersen, T. L. (2022), patchwork: The Composer of Plots, R package version 1.1.2, available at https://CRAN.R-project.org/package=patchwork.
Google Scholar
PythonAnywhere LLP. (2023), “PythonAnywhere,” available at https://www.pythonanywhere.com.
Google Scholar
Ramsey, J. B. (1969), “Tests for Specification Errors in Classical Linear Least-Squares Regression Analysis,” Journal of the Royal Statistical Society, Series B, 31, 350–371.
Google Scholar
Roy Chowdhury, N., Cook, D., Hofmann, H., Majumder, M., Lee, E.-K., and Toth, A. L. (2015), “Using Visual Statistical Inference to Better Understand Random Class Separations in High Dimension, Low Sample Size Data,” Computational Statistics, 30, 293–316.
Web of Science ®Google Scholar
Shapiro, S. S., and Wilk, M. B. (1965), “An Analysis of Variance Test for Normality (Complete Samples),” Biometrika, 52, 591–611.
Web of Science ®Google Scholar
Silvey, S. D. (1959), “The Lagrangian Multiplier Test,” The Annals of Mathematical Statistics, 30, 389–407.
Google Scholar
VanderPlas, S., and Hofmann, H. (2016), “Spatial Reasoning and Data Displays,” IEEE Transactions on Visualization and Computer Graphics, 22, 459–468. DOI: 10.1109/TVCG.2015.2469125.
PubMed Web of Science ®Google Scholar
VanderPlas, S., Röttger, C., Cook, D., and Hofmann, H. (2021), “Statistical Significance Calculations for Scenarios in Visual Inference,” Stat, 10, e337.
Web of Science ®Google Scholar
White, H. (1980), “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity,” Econometrica: Journal of the Econometric Society, 48, 817–838.
Web of Science ®Google Scholar
Wickham, H., Averick, M., Bryan, J., Chang, W., D’Agostino McGowan, L., François, R., Grolemund, G., et al. (2019), “Welcome to the tidyverse,” Journal of Open Source Software, 4, 1686.
Google Scholar
Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010), “Graphical Inference for Infovis,” IEEE Transactions on Visualization and Computer Graphics, 16, 973–979. DOI: 10.1109/TVCG.2010.161.
PubMed Web of Science ®Google Scholar
Xie, Y. (2014), “knitr: A Comprehensive Tool for Reproducible Research in R,” in Implementing Reproducible Computational Research, eds. V. Stodden, F. Leisch, and R. D. Peng. Boca Raton, FL: Chapman and Hall/CRC. Available at http://www.crcpress.com/product/isbn/9781466561595.
Google Scholar
Xie, Y., Dervieux, C., and Riederer, E. (2020), R Markdown Cookbook. Boca Raton, FL: Chapman and Hall/CRC. Available at https://bookdown.org/yihui/rmarkdown-cookbook.
Google Scholar
Yin, T., Majumder, M., Roy Chowdhury, N., Cook, D., Shoemaker, R., and Graham, M. (2013), “Visual Mining Methods for RNA-Seq Data: Data Structure, Dispersion Estimation and Significance Testing,” Journal of Data Mining in Genomics and Proteomics, 4, 137.
Google Scholar
Zeileis, A., and Hothorn, T. (2002), “Diagnostic Checking in Regression Relationships,” R News, 2, 7–10.
Google Scholar
Zhang, Z., and Yuan, K.-H. (2018). Practical Statistical Power Analysis Using Webpower and R, Granger, IN: Isdsa Press.
Google Scholar
Zhu, H. (2021), kableExtra: Construct Complex Table with kable and Pipe Syntax, R package version 1.3.4. Available at https://CRAN.R-project.org/package=kableExtra.
Google Scholar

A Plot is Worth a Thousand Tests: Assessing Residual Diagnostics with the Lineup Protocol

Abstract

1 Introduction