650
Views
0
CrossRef citations to date
0
Altmetric
Short Technical Note

Iteratively Reweighted Least Squares Method for Estimating Polyserial and Polychoric Correlation Coefficients

, &
Pages 316-328 | Received 10 Feb 2023, Accepted 01 Sep 2023, Published online: 07 Nov 2023

Abstract

An iteratively reweighted least squares (IRLS) method is proposed for the estimation of polyserial and polychoric correlation coefficients in this article. It calculates the slopes in a series of weighted linear regression models fitting on conditional expected values. For polyserial correlation, conditional expectations of the latent predictor is derived from the observed ordinal categorical variable, and the regression coefficient is obtained with the weighted least squares method. In estimating polychoric correlation coefficient, conditional expectations of the response variable and the predictor are updated in turns. Standard errors of the estimators are obtained using the delta method based on data summaries instead of the whole data. Conditional univariate normal distribution is exploited and a single integral is numerically evaluated in the proposed algorithm, comparing to the double integral computed numerically based on the bivariate normal distribution in the traditional maximum likelihood (ML) approaches. This renders the new algorithm very fast in the estimation of both polyserial and polychoric correlation coefficients. Thorough simulation studies are conducted to compare the performances of the proposed method with the classical ML methods. Real data analyses illustrate the advantage of the new method in computing speed. Supplementary materials for this article are available online.

1 Introduction

In medical research, as well as in behavioral, educational, and psychological studies, it is common that the observed variables are measured on ordinal scales. For example, a kind of measure similar as Likert scale, proposed in Likert, Roslow, and Murphy (Citation1934), is widely used to assess responses in surveys, allowing individuals to express how much respondents agree or disagree with a particular statement on a five (or seven) point scale. These categorical variables can be treated as being discretized from an underlying continuous variable for degree of agreement on the statement. There are also many examples of quantitative variables that are discretized explicitly in social science studies. For instance, when asking questions about sensitive or personal quantitative attributes (income, alcohol consumption), the nonresponse rate may often be reduced by simply asking the respondent to select one of two very broad categories (under $30K/over $30K, etc.). When analyzing this kind of data, as pointed by Robitzsch (Citation2020), it is common in practice to assign integer values to each category and proceed in the analysis as if the data had been measured on an interval scale with desired distributional properties.

Yet another popular approach is the latent variable model, see Moustaki (Citation2000), which usually requires much smaller number of latent variables than the number of observed variables. The most common choice for the distribution of the latent variables is the normal distribution because all covariances between the latent variables can be fully captured by the covariance matrix and each of its elements can be estimated using a bivariate normal distribution separately. The correlation in the standard bivariate normal distribution is called tetrachoric correlation based on 2×2 contingency table was suggested by Pearson (Citation1900). The tetrachoric correlation was generalized to the polychoric correlation for the case where the observed variables X and Y have r and s ordinal categories by Ritchie-Scott (Citation1918) and Pearson and Pearson (Citation1922) in the early 20th century.

Pearson correlation has been considered a less suitable method for studying the degree of association between categorical variables for several reasons. First, from a methodological point of view these variables would imply ordinal scales, whereas the Pearson correlation assumes interval measurement scales. Furthermore, the only information provided by this kind of scale is the number of subjects in each of the categories (cells) in a contingency table; if the Pearson correlation is used in this case the relationship between measures would be artificially restricted due to the restrictions imposed by categorization (Gilley and Uhlig (Citation1993)), since all subjects situated in the interval that limits each of the categories would be considered as being included in the same category and, therefore, they would be assigned the same score with a resulting reduction in data variability.

Holgado-Tello et al. (Citation2010) illustrated the advantages of using polychoric rather than Pearson correlations in exploratory factor analysis (EFA) and confirmatory factor analysis (CFA), taking into account that the latter require quantitative variables measured in intervals, and that the relationship between these variables has to be monotonic. The matrix of estimated polychoric and/or polyserial coefficients is used as inputs for the typical parameter estimation in structural equation modeling (Muthén Citation1984). More recently, network research has gained substantial attention in psychological sciences, which is called psychological networks by researchers. Psychological networks has been used in various different fields of psychology Epskamp, Borsboom, and Fried (Citation2018). The Gaussian graphical model (GGM) was proposed by Lauritzen (Citation1996), in which edges were directly interpreted as partial correlation coefficients. The GGM requires an estimate of the covariance matrix as input, for which polyserial and polychoric correlations can also be used in case the data are ordinal.

Let X be an observed ordinal variable determined by an underlying latent continuous random variable Z1 and Y represent another observed continuous variable. It is assumed that the joint distribution of Z1 and Y is bivariate normal. The product moment correlation between X and Y is called the point polyserial correlation, while the correlation between Z1 and Y is called the polyserial correlation. The maximum likelihood estimate (MLE) MLE of the polyserial correlation was derived by Cox (Citation1974). Olsson (Citation1979) presented two approaches to estimating the MLE of the polychoric correlation: the so-called two-step method which first estimates the unknown thresholds, a and b, from the marginal frequencies of the table and then finds the MLE of the ρ conditioning on the estimated thresholds. The second approach is to find the joint MLE of (ρ,a,b) from the likelihood function. Olsson, Drasgow, and Dorans (Citation1982) derived the relationship between the polyserial and point polyserial correlation and compared the MLE of polyserial correlation with a two-step estimator and with a computationally convenient ad hoc estimator.

The estimation of polychoric correlation and polyserial correlation using the ML methods has been implemented in many popular statistical softwares such as Mplus by Muthén and Muthén (Citation2005), PRELIS Jöreskog (Citation2005), PROC CORR in SAS (with POLYCHORIC or OUTPLC = options) see Institute (2017), polycor package in R by Fox (Citation2010) and psych package in R by Revelle (Citation2017), POLYCORR program by Uebersax (Citation2010), and an extensive list of software for computing the polychoric correlation by Uebersax (Citation2006).

A Bayesian approach to estimating tetrachoric and polychoric correlation coefficients was proposed by Albert (Citation1992). The author used a latent bivariate normal distribution to estimate a polychoric correlation coefficient from the Bayesian point of view by using the Gibbs sampler. One attractive feature of this method is that it can be generalized in a straightforward manner to handle a number of nonnormal latent distributions. They generalized their method to handle bivariate lognormal and bivariate t latent distributions in their simulations. Chen and Choi (Citation2009) and Choi et al. (Citation2011) proposed two new Bayesian estimators, maximum a posteriori (MAP) and expected a posteriori (EAP), and compared them to the ML method.

The aforementioned methods are all based on likelihood function assuming a bivariate normal distribution. In order to evaluate the cumulative distribution functions (CDF) of the bivariate normal distributions, they inevitably involve computations of the double integrals numerically, which require significant more computational resources than computing single integrals in the CDF of a univariate normal distribution.

In this article we propose an iteratively reweighted least square method to estimate the polyserial and the polychoric correlation coefficients. It is computationally fast because the complex bivariate integrations are circumvented completely. It is motivated by the fact that the Pearson’s correlation coincides with the slope of the regression line for paired standard normal data. When one of the paired continuous data is discretized, an unbiased estimator of the slope is derived from the generated categorical data. When both of the paired data are discretized, the slope of the regression line, that is, the correlation coefficient of the two normal random variables, will be obtained iteratively from a series of similar estimation procedures. The detail of the algorithm can be found in Section 2. In Sections 3 and 4, we conduct simulation studies and data analyses to compare the proposed algorithm with the ML method. At last, we conclude with discussions and some works can be done in the future to improve the proposed method.

2 Iteratively Reweighted Least Squares Algorithm

Assume (Z1,Z2)TN2(0, R) where 0=(0,0)T and R=(1ρρ1),1ρ1. Conditioning on Z1, Z2|Z1N(ρZ1,1ρ2). Hence, (1) E(Z2|Z1)=ρZ1.(1)

This represents a simple linear regression model fitting Z2 on Z1 and ρ is the slope of the regression line. Therefore, ρ, the Pearson correlation of Z1 and Z2, can be estimated from such a linear regression model.

2.1 Polyserial Correlation Coefficients

Consider the case where one of the paired random variables, namely Z1, is discritized into an ordinal polychotomous variable, X, and the other is observed as a continuous variable, Y. Let X be an observed ordinal variable with s categories, generated from the latent variable Z1 with X=i if ai‐1<Z1ai,i = 1,,s, where ais are thresholds with a0= and as=.

If Z1 were observable, it would have been given from the regression line that E(Y|Z1)=ρZ1. Taking expectation with respect to Z1, (2) E{E(Y|Z1)}=ρE(Z1).(2)

It holds for every Z1 such that ai1<Z1ai, or correspondingly, X = i, for i=1,2,,s. That is, (2) E{E(Y|Z1,ai1<Z1ai)}=E{E(Y|X=i)}=ρE(Z1|ai1<Z1ai),(2) for i=1,,s.

Denote E(Y|X=i) by EYi and E(Z1|ai1<Z1ai) by exi, (2) becomes (3) E{EYi}=ρexi,(3) for i=1,,s.

This is a regression model without an intercept, in which EYi is the response variable and exi is the explanatory variable, with ρ being the regression coefficient. Because EYis have unequal variances, ρ cannot be estimated with an ordinary least squares method. However, ρ can be estimated using a weighted least squares method with a diagonal weight matrix.

It is easy to show that the density function of EYi is f(y)=Φ(aiρy1ρ2)Φ(ai1ρy1ρ2)Piϕ(y), where Pi=Pr(X=i). The mean and variance of EYi, μi, and σi2, are given by (4) μi=ρϕ(ai1)ϕ(ai)Piσi2=1+ρ2ai1ϕ(ai1)aiϕ(ai)Piρ2{ϕ(ai1)ϕ(ai)}2Pi2.(4)

Let yi1,yi2,,yini be the observed response variables associated with X = i and ai1<Z1jai for j=1,,ni, where ni is the size of data with X = i, EYi is estimated by (5) Êyi=y¯X=i=1nij=1niyij.(5)

Since Z1 has a truncated normal distribution with lower and upper limits ai1 and ai, respectively, exi is the expected value of the truncated normal distribution, given by (6) exi=ϕ(ai1)ϕ(ai)Pi,(6) where ϕ(·) is the density function of the standard normal distribution. Let CPi=Pr(Xi)=Φ(ai)=j=1iPj,i=1,,s. Then âi=Φ1(CP̂i), and exi in (6) is estimated by (7) êxi=ϕ(âi1)ϕ(aî)P̂i=ϕ{Φ1(CP̂i1)}ϕ{Φ1(CP̂i)}P̂i.(7)

Let Êx=(êx1,êx2,,êxs)T,Êy=(Êy1,Êy2,,Êys)T, and Σ̂=[σ̂12/n1000σ̂22/n2000σ̂s2/ns], the regression coefficient is given by the weighted least squares method (8) ρ̂=(ÊxTΣ̂1Êx)1ÊxTΣ̂1Êy=i=1sniσ̂i2êxiÊyii=1sniσ̂i2êxi2.(8)

Algorithm 1

IRLS algorithm to estimate the polyserial correlation coefficient

Input: observed continuous variable y and ordinal variable x

Output: ρ̂, polyserial correlation coefficient of Y and X; varρ̂, the standard error of ρ̂

1: Compute P̂i,CP̂i from data x. Estimate the thresholds âi=Φ1(CP̂i),i=1,,s

2: Compute êxi,Êyi using the formula in (7) and (5) for i=1,,s

3: Compute the initial ρ̂ using the Pearson correlation coefficient between y and x

4: Set iter=0,diff=1, n = 100, ϵ=1e8 and ρ̂0=ρ̂

5: while iter<n & diff>ϵ do

6: Compute σ̂i2 using the formula in (4) with ρ̂0, for i=1,,s

7: Update ρ̂ using the formula in (8)

8: Compute diff=|ρ̂ρ̂0|

9: Update ρ̂0=ρ̂ and iter = iter + 1

10: end while

11: Compute var(ρ̂) using the formula in (9)

12: return ρ̂ and varρ̂

While σi2 in (4) depends on ρ, it can be obtained iteratively using the formula in (8), with the Pearson correlation coefficient as the initial value. The variance of ρ̂ is given by (9) var(ρ̂)=(ÊxTΣ̂1Êx)1=(i=1sniσ̂i2êxi2)1,(9) and the standard error of ρ̂ is var(ρ̂).

The details of the IRLS method for estimating polyserial correlation coefficient are given in the following Algorithm 1

2.2 Tetrachoric and Polychoric Correlation Coefficients

When both of the paired normal variables are discritized into ordinal variables, neither the response nor the predictor variable is observable. Both the weights and the response parts in (8) have to be updated after ρ̂ is obtained in each iteration. Hence, we call it the iteratively reweighted least squares algorithm. The correlation coefficient between observed categorical variables is smaller than that between the latent continuous variables in magnitude. However, the same procedure given in Section 2.1 will update the estimate of the correlation coefficient, making it closer to the true parameter than the previous one. Therefore, the polychoric correlation coefficient can be estimated iteratively by updating conditional expectations of the two latent variables in turns. As shown subsequently, the estimates are obtained by replacing the conditional expectations with the contingency table statistics, that is sample moments. Therefore, it is a moment method that exhibits the consistency by the law of large number and the series of the estimates will converge to the polychoric correlation coefficient.

Let (Z1,Z2)TN2(0, R) as above. Z1 and Z2 are categorized into X and Y with s and r categories, respectively. Assume that X is generated from the latent variable Z1 with X=i if ai‐1<Z1ai,i = 1,,s, where ais are thresholds with a0= and as=. Similarly, Y is generated from Z2 with Y=j if bj‐1<Z2bj,j = 1,,r, where bj are thresholds with b0= and br=. The data usually consist of an array of observed frequencies nij, the number of X=i,Y=j, for i=1,,s;j=1,,r. Let Pij=P(X=i,Y=j)=P(ai1<Z1ai,bj1<Z2bj) be the proportion of data in cell (i,j). The cumulative marginal proportions of the table are Pi·=k=1ij=1rPkj,i=1,,s,P·j=i=1sk=1jPik,j=1,,r.

The IRLS method for estimating the tetrachoric correlation coefficient(s=r=2) is derived first. Suppose that X and Y are two dichotomous variables, generated from Z1 and Z2, respectively. Let N=(N11N12N21N22) be the contingency table with Nij,i,j=1,2 the frequencies of X by Y at the category i and j, respectively, N=N11+N21+N12+N22.

Since Z2 is not observable, the regression model in (2) becomes (10) E{E(Z2|X=i)}=ρE(Z1|ai1<Z1ai),i=1,2.(10)

Denote the explanatory variables on the right-hand side of (10) by exi. It is given by (11) ex1=̂E(Z1|Z1a)=ϕ(a)P1·=exp(a22)2πP1·,ex2=̂E(Z1|Z1>a)=ϕ(a)1P1·=exp(a22)2π(1P1·),(11) where a=Φ1(P1·) and b=Φ1(P·1) are the cutoff points, P̂1·=N11+N12N is the proportion of X = 1 and P̂·1=N11+N21N is the proportion of Y = 1. Similarly, P̂2· and P̂·2 are proportions of X=2 and Y = 2, respectively. ex1 and ex2 are estimated with formulas in (11) by plugging in the observed frequencies of the contingency table, and are used as the initial value of predictors in the regression model (10).

Denote the response variable of the regression model in (10) by EYi=E(Z2|X=i),i=1,2. It cannot be calculated directly. But because Z2 is dichotomized by whether Z2b or Z2>b into two categories, indicated by either Y = 1 or Y = 2, EYi can be obtained based on the binomial distribution of Y, or into which of the two intervals Z2 falls. This procedure is presented in the following two steps:

First, Z1 and Z2 are jointly normally distributed, Z2|Z1N(ρZ1,1ρ2). Denote bρZ1ρ2 by Zb, the conditional mean of Z2 given Z1 at ex1 and ex2 for different categories of Y are (12) e11=E(Z2|Z2b,Z1=ex1)=ρex11ρ2ϕ(ex1b)/Φ(ex1b),e21=E(Z2|Z2b,Z1=ex2)=ρex21ρ2ϕ(ex2b)/Φ(ex2b),e12=E(Z2|Z2>b,Z1=ex1)=ρex1+1ρ2ϕ(ex1b)/{1Φ(ex1b)},e22=E(Z2|Z2>b,Z1=ex2)=ρex2+1ρ2ϕ(ex2b)/{1Φ(ex2b)},(12) where ϕ(·) and Φ(·) are the density function and the cumulative distribution function of the standard normal distribution, respectively.

Second, EYi is obtained based on the conditional expected values in (12). Because Z2 is dichotomized into Y, with two mass probabilities, then (13) EY1=P11P11+P12e11+P12P11+P12e12,EY2=P21P21+P22e21+P22P21+P22e22,(13) where Pij,i=1,2;j=1,2 is the proportion in cell (i,j) of the probability table: P=(P11P12P21P22).

Similar as in model (2), EYis have unequal variances. They are not independent. Yet the regression coefficient can be obtained with a weighted least square method. Denote the covariance matrix of P̂ by B, which is given by (14) B=(P11(1P11)P11P12P11P21P11P22P12P11P12(1P12)P12P21P12P22P21P11P21P12P21(1P21)P21P22P22P11P22P12P22P21P22(1P22))/N.(14)

EYi is multivariate differentiable function of P. The partial derivative matrix of EYi with respect to P is denoted by D, which is presented in the appendix.

The covariance matrix of vector EY=(EY1,EY2)T, is given by Σ=DBD, using the delta method. Σ̂ can be obtained by replacing all Pij with the observed frequencies P̂ij. Let ex=(ex1,ex2)T, the regression coefficient in model (10) can be calculated using a weight least square method as (15) ρ̂=(êxΣ̂1êx)1êxΣ̂1ÊY,(15)

and (16) var(ρ̂)=(êxΣ̂1êx)1.(16)

The standard error of ρ̂ is var(ρ̂).

The predictors in (10) are not constants. They have to be updated with the improved estimate ρ̂ from (15). This also consists of the following two steps. First, we obtain the conditional expected values of Z1 given Z2 for different categories of Z1, using a similar derivation of (12)–(13), (17) EYx1=P11P11+P21e11+P21P11+P21e21,EYx2=P12P12+P22e12+P22P12+P22e22.(17)

Second, we update ex1 and ex2 by taking expectations over Z2, (18) ex1=P11P11+P12ex11+P12P11+P12ex12,ex2=P21P21+P22ex21+P22P21+P22ex22,(18) where (19) ex11=̂E(Z1|Z1a,Z2=EYx1)=ρEYx11ρ2ϕ(EYx1a)/Φ(EYx1a),ex12=̂E(Z1|Z1a,Z2=EYx2)=ρEYx21ρ2ϕ(EYx2a)/Φ(EYx2a),ex21=̂E(Z1|Z1>a,Z2=EYx1)=ρEYx1+1ρ2ϕ(EYx1a)/{1Φ(EYx1a)},ex22=̂E(Z1|Z1>a,Z2=EYx2)=ρEYx2+1ρ2ϕ(EYx2a)/{1Φ(EYx2a)}.(19)

Then we go back to (12) and repeat the previous procedure until ρ̂ converges. Because both the weight matrix and the response vector in (15) have been updated, we call this an iteratively reweighted least squares algorithm. The proposed IRLS algorithm proceeds in sequence: exEY,ΣρexEY,Σρ,var(ρ). shows the change of the values used in (15).

Fig. 1. The track of the estimating process of the tetrachoric correlation coefficient.

Fig. 1. The track of the estimating process of the tetrachoric correlation coefficient.

The red star points are the (êx1,ÊY1) and the blue points are (êx2,ÊY2). It can be seen that the algorithm converges fast and stops in several iterations. The slope of the red line is the direct Pearson correlation and is apparently smaller than the true ρ which is green line. The slope of the black line is the estimated tetrachoric correlation coefficient from the proposed IRLS algorithm. Clearly, as the algorithm proceeding, the results of our algorithm is getting closer to the true value. The details of the IRLS algorithm for estimating tetrachoric correlation coefficient are given in the following Algorithm 2,

Algorithm 2

IRLS method to estimate the tetrachoric correlation coefficient

Input: observed ordinal data y and x or contingency table

Output: tetrachoric correlation coefficient ρ̂ of Y and X, s.e.(ρ̂)

1: Calculate the frequency table P̂=(P̂ij),i,j=1,2, the covariance B̂ using (14).

2: Estimate the thresholds given from P̂. âi=Φ1(P̂·1), b̂j=Φ1(P̂1·).

3: Initialization:

4: ρ̂0 = Pearson correlation coefficient of x and y (ordinal data), 0(contingency table)

5: êx1 and êx2 using the formulas in (11)

6: Set iter=0,diff=1, n = 100 and ϵ=1e8

7: while iter<n & diff>ϵ do

8: Compute Êy1,Êy2 using (12) and (13)

9: Compute the derivative matrix D̂ using (24), and Σ̂=D̂B̂D̂ using (14)

10: Compute ρ̂ using the formula in (15)

11: Compute diff=|ρ̂ρ̂0|,

12: Update êx1 and êx2 using the formula in (17), (19), and (18)

13: Update ρ̂0=ρ̂ and iter = iter + 1

14: end while

15: Compute var(ρ̂) using the formula in (16)

16: return ρ̂,varρ̂

illustrate the whole process of estimating the tetrachoric correlation coefficient with the Algorithm 2.

Fig. 2. Process of estimating the tetrachoric correlation coefficient.

Fig. 2. Process of estimating the tetrachoric correlation coefficient.

Next, we generalize Algorithm 2 to the case where X and Y have s and t categories, respectively, that is, to estimate the polychoric correlation coefficient. As in the dichotomous case, thresholds are ai=Φ1(Pi·),i=1,,s, bj=Φ1(P·j),j=1,,r.

Initially, define (20) exi=̂E(Z1|ai1<Z1ai)=ϕ(ai1)ϕ(ai)Pi·(20) for i=1,,s.

Conditional on Z1=exi and bj1<Z2bj, (21) eij=E(Z2|bj1<Z2bj,Z1=exi)=ρexi+1ρ2ϕ(exibj1)ϕ(exibj)Φ(exibj)Φ(exibj1)(21) for i=1,,s;j=1,,r.

Then the conditional expected value of Z2 given Z1=exi is (22) EYi=E(Z2|Z1=exi)=j=1rPijPi·eij(22) for i=1,,s.

Let frequency table P̂s×t=(P̂ij),i=1,,s;j=1,,r. Then the covariance matrix of vector P̂, by stacking P̂ by row, is given by (23) B=(Bij)s×r,s×rBij={1NPi(1Pi),i=j,1NPiPj,ij.(23)

For simplicity, we ignore the derivative brought by thresholds b (This leads to a smaller final variance). The partial derivative matrix can be calculated as follows: (24) EYkPij={0,ki{(Pk·Pkj)ekjn=1,njrPknekn}/Pk·2,k=i.(24)

Let EY=(EYi)s×1,i=1,,s, D=EYP=(Dij)s,s×r is block diagonal matrix: (D1000D2000Ds) where Di is the nonzero elements derived by equation above which length r, and the covariance matrix is Σ=DBD. Similar as in estimating tetrachoric correlation coefficient, ρ̂ can be calculated with (15). Variance of ρ̂ is calculated with (16). Standard error of ρ̂ is obtained by taking the square root of its variance.

Then the predictors exi are updated in the following two steps. First, compute EYxj=E(Z2|bj1<Z2bj)=i=1sPijP·jeij.

Second, obtain (25) exi=E(Z1|ai1<Z1ai)=j=1rPijPi·exij,(25) for i=1,,s, where (26) exij=E(Z1|ai1<Z1ai,Z2=EYxj)=ρEYxj+1ρ2ϕ(EYxjai1)ϕ(EYxjai)Φ(EYxjai)Φ(EYxjai1)(26) for j=1,,r.

The procedure is repeated until ρ̂ converges.

The details of the IRLS algorithm for estimating polychoric correlation coefficient are given in the following Algorithm 3,

Algorithm 3

IRLS method to estimate the polychoric correlation coefficient

Input:

observed ordinal data y and x or contingency table

Output:

polychoric correlation of Y and X, s.e.(ρ̂)

1: Calculate P̂=(P̂ij),i=1,,s;j=1,,r, the covariance B̂ using (23)

2: Estimate the thresholds âi and b̂i from the cumulative marginal proportions of P̂.

3: Initialization:

4: ρ̂0 = Pearson correlation coefficient of x and y (ordinal data), 0(contingency table)

5: Initialize êxi using the formula in (20) for i=1,,s

6: Set iter=0,diff=1, n = 100 and ϵ=1e8

7: while iter<n & diff>ϵ do

8: Compute eij for different categories using the formula in (21) for all i, j

9: Compute the derivation matrix D̂ using (24), Σ̂=D̂B̂D̂

10: Compute Êyi using the formula in (22) for i=1,,s

11: Compute ρ̂ using formula in (15)

12: Compute diff=|ρ̂ρ̂0|,

13: Compute EYxj,exij using the formula in (26) for all i, j

14: Update exi using the formula in (25) for i=1,,s

15: Update ρ̂0=ρ̂ and iter = iter + 1

16: end while

17: Compute var(ρ̂) using the formula in (16)

18: return estimate ρ̂ and var(ρ̂) as the s.e. of ρ̂.

An R package IRLSpoly is developed to implement the Algorithms 1, 2, and 3. It can be installed in R using the command install_github (”encoreus/IRLSpoly”).

3 Simulation Study

In this section, we conduct a series of simulation studies to compare the proposed algorithm with the standard maximum likelihood method. All tests are carried out on a computer with an Intel (R) Core (TM) i7-10700 CPU @ 2.90GH and 64G memory. Following Choi et al. (Citation2011), the sample size is set to N=30,50,100,500,1000. These numbers were chosen to reflect from small to moderate sample sizes that might be commonly encountered. The population correlation coefficient is set to ρ=0,0.2,0.4,0.6, and 0.8, ranging from null to moderate high. The number of categories for each ordinal variable are set to 2 (binary responses), 3, 5 and 7, that is, r=s=2,3,5,7, respectively. Repeat n = 1000 times.

We generate data from a bivariate normal distribution with correlation ρ and discretized them into categorical data, following the same procedure as described in in Bollen and Barb (Citation1981), where the mean of the distribution is taken as a reference point and the variable is divided into equally spaced intervals that move away from the mean of the distribution toward the extremes. Therefore, the distribution of the categorized data gets closer to normal as the number of categories increases.

To measure the performance of the two methods, we use the following criteria: (27) MEAN=i=1nρ̂/n,MRB=i=1n{(ρîρ)/ρ}/nRMSE=i=1n(ρ̂ρ)2/n,SD=i=1n(ρ̂MEAN)2/nMSD=i=1nvar(ρ̂)/n(27) which, respectively, represent mean, mean relative bias, root mean squared error, standard deviation and mean value of the estimates of the variance of ρ(Monte Carlo). We compare MSD with SD to verify the accuracy of the algorithm variance calculation formula.

For polyserial correlation, the results with true value of ρ=0,0.2,0.4,0.6,0.8 are shown in . Function polyserial defined in the polycor package in R by Fox (Citation2010) is invoked to obtain the MLEs.

Table 1. Simulation results of polyserial correlation comparing the IRLS algorithm with the ML method.

It can be seen that the results of the two methods are similar, especially with ρ=0,0.2,0.4. When ρ=0.6,0.8, for small sample size such as N = 30, 50 the bias of the new method is slightly larger than that of the ML method. However, as the sample size increases, it performs as well as the ML method. In addition, the mean value of the standard deviation calculated by the new method (MSD) is very close to that of Monte Carlo (SD).

The slightly larger biases from the IRLS are mainly due to the fact that only the means of the data, instead of the whole data, are used in estimating polyserial correlations. On the other hand, the advantages of the proposed method for polyserial correlation include: (a) it is obtained from data summaries. If only summaries were reported, for example, in a report or a paper, the proposed method can be applied to estimate the polyserial correlation, but the traditional ML method cannot; (b) In the simulation, the ML may not be able to calculate because of the small n and s (leading a null category), while the proposed method still works.

For polychoric correlation simulations, a quick two step maximum likelihood (Olsson Citation1979) procedure is adopted in calling polychor function in the polycor package of R. The regular ML method in which the thresholds are estimated simultaneously is too slow to run the simulations. We use the same settings as in simulations for polyserial correlations ().

Table 2. Simulation results of polychoric correlation comparing the regression method with the ML method.

The MEANs of the estimators of both IRLS and ML are close to the true values of ρ. IRLS can be considered unbiased. The SD of IRLS is slightly larger than that of ML, and the difference decreases as ρ, s, r, and N increases, by the law of large numbers. The difference between these SDs is negligible when ρ0.4,N100 or s,r5.

The estimated standard derivation from the IRLS is smaller than the true value. However, the biases decrease with the increase of ρ,s,r,N.

Because the estimator of polychoric correlation from the IRLS algorithm is determined by proportions, not the frequencies of the contingency table, increasing sample size of the dataset does not reduce the speed of the estimation. In comparison, the log-likelihood function in the traditional ML methods is calculated by adding those of all data points. Additionally, only single integrals are evaluated in the IRLS when the cumulative distribution function Φ is invoked. But the traditional ML methods need to evaluated double integrals in calculating likelihood functions. Therefore, the IRLS is a much faster algorithm in estimating polychoric correlation comparing to the ML methods. Our simulations show that IRLS requires significantly less running time than the ML methods. presents the running time of the simulation studies comparing with the ML and 2-steps method (IRLS calculate variance but the others not because they are too slow), with ρ=0.4,N=500 and 1000 replicates. The user time is the CPU time charged for the execution of user instructions of the calling process. The system time is the CPU time charged for execution by the system on behalf of the calling process.

Table 3. Running time of IRLS and ML methods.

These results show that the IRLS method spends less running time than the ML and 2-steps methods. In addition, as the number of categories getting larger, the ML and 2-steps methods require significantly more running time, while the proposed IRLS algorithm does not slow down much.

4 Data Analysis

In this section, two real data analyses are conducted in comparing the performance of the proposed method and the ML method, including running time recorded. The first two datasets were obtained from Kirk (Citation1973) and were also listed in the psych package as examples. The Big Five Inventory (bfi) data, include 25 personality self report items taken from 2800 subjects. The items were collected using a 6 point response scale: 1 for very inaccurate, 2 for moderately inaccurate, 3 for slightly inaccurate, 4 for slightly accurate, 5 for moderately accurate and 6 for very accurate. Three additional demographic variables (sex, education, and age) are also included. shows a scatterplot of matrices (SPLOM), with bivariate scatterplots below the diagonal, histograms on the diagonal, and the polychoric correlation coefficients with standard errors above the diagonal of the first five variables. Correlation ellipses are drawn in the same graph. The red lines below the diagonal are the LOESS smoothed lines, fitting a smooth curve between two variables.

Fig. 3. Polychoric correlation coefficients of the bfi data.

Fig. 3. Polychoric correlation coefficients of the bfi data.

It can be seen from that the estimates from the proposed method are very close to those from the ML method, with the largest difference being 0.016, smaller than its standard error estimated with both methods.

The running times are listed in . The user time is the CPU time spent in the execution of user instructions of the calling process. The system time is the CPU time charged for spent in the execution by the system on behalf of the calling process. Clearly, the proposed method requires significantly less CPU time than the ML method in estimating the polychoric correlation coefficients.

Table 4. Running time of the IRLS and the ML, 2-steps method.

The second dataset came from the study of investigation on parenteral nutrition in hospitalized patients with digestive tract tumors in China. The study was initiated by the Committee of Cancer Rehabilitation and Palliative Care of China in 2017. In this survey, the nutritional risk and incidence of malnutrition of newly admitted hospitalized cancer patients were obtained through the nutritional risk screening and assessment. The survey consists of many questionnaires, including Quality of Life, KPS, Nutritional Risk Screening (NRS-2002), Patient-Generated Subjective Global Assessment (PG-SGA). The data had 23 categorical variables from 1086 patients. The proposed IRLS and the traditional ML methods are applied on this data.

presents a pairwise scatterplot of matrices (SPLOM) with five randomly selected variables from the data similar to the above. The two estimates are also very close.

Fig. 4. Polychoric correlation coefficients of the Parenteral Nutrition data.

Fig. 4. Polychoric correlation coefficients of the Parenteral Nutrition data.

presents the running times of three competing methods. With moderate sample size (1086) and number of variables (23), the advantage of the proposed method in speed is manifested in that the running time of ML and 2-steps method is much longer than the IRLS.

Table 5. Running time of the IRLS and the ML, 2-steps method.

5 Conclusion and Discussion

In this article, we propose a new approach to estimating the polyserial and polychoric correlation coefficients. Simulation studies and data analyses show that the proposed IRLS algorithm can estimate polyserial and polychoric correlations consistently and efficiently. It also takes much less time to compute than the traditional ML methods. This prominent aspect of the new method may help in modern research with huge datasets to analyze. It makes studies on big data using polyserial or polychoric correlation plausible, such as in network data and text mining.

The basic idea of the new method is that the correlation coefficient of two continuous variables can be obtained from the slope of the derived regression models. The regression coefficient is not necessarily estimated from the whole data. Instead, it can be calculated from points at the conditional expected values and through which the regression line pass in theory. Hence, the regression coefficient, that is, the correlation coefficient, can be estimated with a derived new regression model, with updated weights for different response variables. Therefore, an iteratively reweighted least squares method is proposed to estimate the regression coefficient. This method is in effect a generalized moment method based on summary statistics that always generates consistent estimators.

The advantage of this method is obvious. It is applicable to summary data so it can be used for meta analysis that combines different studies with only summaries of data reported. Meta analysis on the polyserial and polychoric correlation will be another future work.

One limitation of the proposed method is that it only considers the estimation and inference of one polychoric/polyserial correlation. In some situations, it is common to use a matrix of estimated polychoric and/or polyserial correlation coefficients as inputs for estimating the parameters. Additionally, an estimate of the asymptotic covariance matrix of the estimated polychorics/polyserials is used for inference, for example, for construction of an overall test statistic of correctness of the model. The methods presented in this article certainly can be extended to the multivariate cases, which will be one of our future works. Another possible aspects of the proposed method left to future study is its robustness to the distributional assumption. We assumed that the latent variables have an underlying bivariate normal distribution. Whether this is true is testable. However, it is out of the scope of this article. It is worthwhile to examine what degree of departures from the normality assumption has any effects on the correlation estimation in the future.

Supplemental material

ucgs_a_2257251_sm4650.r

Download R Objects File (3.1 KB)

ucgs_a_2257251_sm4649.pdf

Download PDF (131.7 KB)

Disclosure Statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

The authors gratefully acknowledge the Natural Science Foundation of Zhejiang Province (Project No. LDT23E05015A0, titled Isogeometric Analysis and Nonlinear Time Series Modeling of Error Field in Processing Domain of CNC Machine Tool).

References

  • Albert, J. H. (1992), “Bayesian Estimation of the Polychoric Correlation Coefficient,” Journal of Statistical Computation and Simulation, 44, 47–61. DOI: 10.1080/00949659208811448.
  • Bollen, K. A., and Barb, K. H. (1981), “Pearson’s R and Coarsely Categorized Measures,” American Sociological Review, 46, 232–239. DOI: 10.2307/2094981.
  • Chen, J., and Choi, J. (2009), “A Comparison of Maximum Likelihood and Expected A Posteriori Estimation for Polychoric Correlation Using Monte Carlo Simulation,” Journal of Modern Applied Statistical Methods, 8, 32. DOI: 10.22237/jmasm/1241137860.
  • Choi, J., Kim, S., Chen, J., and Dannels, S. (2011), “A Comparison of Maximum Likelihood and Bayesian Estimation for Polychoric Correlation Using Monte Carlo Simulation,” Journal of Educational and Behavioral Statistics, 36, 523–549. DOI: 10.3102/1076998610381398.
  • Cox, N. (1974), “Estimation of the Correlation between a Continuous and a Discrete Variable,” Biometrics, 30, 171–178. DOI: 10.2307/2529626.
  • Epskamp, S., Borsboom, D., and Fried, E. I. (2018), “Estimating Psychological Networks and their Accuracy: A Tutorial Paper,” Behavior Research Methods, 50, 195–212. DOI: 10.3758/s13428-017-0862-1.
  • Fox, J. (2010), Polycor: Polychoric and Polyserial Correlations r package version 0.7-8. Online http://www.cran.r-project.org/web/packages/polycor/index.html (Accessed on 31 August 2010). bibr12, 69.
  • Gilley, W. F., and Uhlig, G. E. (1993), “Factor Analysis and Ordinal Data,” Education, 114, 258–265.
  • Holgado-Tello, F. P., Chacón-Moscoso, S., Barbero-García, I., and Vila-Abad, E. (2010), “Polychoric versus Pearson Correlations in Exploratory and Confirmatory Factor Analysis of Ordinal Variables,” Quality & Quantity, 44, 153–166.
  • Institute, S. (2017), Base SAS 9.4 Procedures Guide: Statistical Procedures, SAS Institute.
  • Jöreskog, K. G. (2005), “Structural Equation Modeling with Ordinal Variables Using Lisrel.” Technical report.
  • Kirk, D. B. (1973), “On the Numerical Approximation of the Bivariate Normal (tetrachoric) Correlation Coefficient,” Psychometrika, 38, 259–268. DOI: 10.1007/BF02291118.
  • Lauritzen, S. L. (1996), Graphical Models (Vol. 17), Oxford: Clarendon Press.
  • Likert, R., Roslow, S., and Murphy, G. (1934), “A Simple and Reliable Method of Scoring the Thurstone Attitude Scales,” The Journal of Social Psychology, 5, 228–238. DOI: 10.1080/00224545.1934.9919450.
  • Moustaki, I. (2000), “A Latent Variable Model for Ordinal Variables,” Applied Psychological Measurement, 24, 211–223. DOI: 10.1177/01466210022031679.
  • Muthén, B. O. (1984), “A General Structural Equation Model with Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators,” Psychometrika, 49, 115–132. DOI: 10.1007/BF02294210.
  • Muthén, L. K., and Muthén, B. O. (2005), Mplus: Statistical Analysis with Latent Variables: User’s Guide, Los Angeles: Muthén & Muthén.
  • Olsson, U. (1979), “Maximum Likelihood Estimation of the Polychoric Correlation Coefficient,” Psychometrika, 44, 443–460. DOI: 10.1007/BF02296207.
  • Olsson, U., Drasgow, F., and Dorans, N. J. (1982), “The Polyserial Correlation Coefficient,” Psychometrika, 47, 337–347. DOI: 10.1007/BF02294164.
  • Pearson, K. (1900), “Mathematical Contribution to the Theory of Evolution. VII. On the Correlation of Characters not Quantitatively Measurable,” Philosophical Transactions of the Royal Society of London, 195, 1–47.
  • Pearson, K., and Pearson, E. S. (1922), “On Polychoric Coefficients of Correlation,” Biometrika, 14, 127–156. DOI: 10.1093/biomet/14.1-2.127.
  • Revelle, W. R. (2017), “psych: Procedures for Personality and Psychological Research.” Software.
  • Ritchie-Scott, A. (1918), “The Correlation Coefficient of a Polychoric Table,” Biometrika, 12, 93–133. DOI: 10.1093/biomet/12.1-2.93.
  • Robitzsch, A. (2020), “Why Ordinal Variables Can (Almost) Always Be Treated as Continuous Variables: Clarifying Assumptions of Robust Continuous and Ordinal Factor Analysis Estimation Methods,” Frontiers in Education, 5, 589965. DOI: 10.3389/feduc.2020.589965.
  • Uebersax, J. S. (2006), “Introduction to the Tetrachoric and Polychoric Correlation Coefficients,” Obtenido de http://www.john-uebersax.com/stat/tetra.htm. [Links].
  • Uebersax, J. S. (2010), “User Guide for polycorr 1.1” (advanced version).