254
Views
0
CrossRef citations to date
0
Altmetric
Research Article

An Examination of the Linking Error Currently Used in PISA

ORCID Icon & ORCID Icon

ABSTRACT

Educational large-scale assessment (LSA) studies like the program for international student assessment (PISA) provide important information about trends in the performance of educational indicators in cognitive domains. The change in the country means in a cognitive domain like reading between two successive assessments is an example of a trend estimate. The uncertainty of trend estimates includes sampling and linking errors, which are regularly reported in the PISA study. This article focuses on the linking error that assesses the variability in trend estimation regarding the choice of items. Since PISA 2015, the linking error estimation method has changed. This article compares the statistical behavior and the concept of the new PISA linking error and the PISA linking error utilized until PISA 2012. It turns out that the newly proposed linking error provides no generalization of the old PISA linking error but rather reflects different aspects of uncertainties and model error.

Introduction

One major aim of international large-scale assessments (ILSAs) is to monitor changes in the levels of educational outcomes (e.g., student performance). Every three or four years, for example, the Programme for International Student Assessment (PISA) provides international comparisons of student performance in three content areas (reading, mathematics, and science; OECD, Citation2014). The successive assessment of these content domains makes it possible to estimate national trends within each participating country, which provides policymakers with important information for the evaluation of educational reforms. In order to be able to estimate these trends in student performance, repeated assessments need to be reported on a common scale that is comparable across time (Mazzeo & von Davier, Citation2014). To accomplish this task, a set of common items is repeatedly administered in each assessment, and linking methods are used to align the results from the different assessments on a common scale. When calculating trend estimates, however, it is important to identify the source and magnitude of errors affecting national trend estimation (Wu, Citation2010).

In order to estimate country trends in student achievement, the results from different assessments need to be linked so that the achievement scores in a respective cognitive domain can be directly compared. The general idea of a linking approach is to use a set of common items administered in more than one assessment in order to establish a common metric that makes it possible to compare the test results across the different assessments (Dorans et al., Citation2007; von Davier & Sinharay, Citation2014). illustrates a typical linking design used in two assessments of an ILSA study. In both assessments, a set of I0 link items (also referred to as common or anchor items) is administered to a cohort of students (e.g., 15-year-old students in a country; OECD, Citation2014). In addition, I1 and I2 unique items are presented in only one of the two assessments. One advantage of including unique items in an ILSA is that they can be made publicly available for secondary analysis, and the item pool can be renewed in later assessments (Mazzeo & von Davier, Citation2014).

Figure 1. Linking in two PISA assessments.

Figure 1. Linking in two PISA assessments.

In the present article, we are dealing with the quantification of uncertainty in trend estimates in country means. It has been argued that trend estimates are prone to sampling errors of persons and linking errors due to the selection of items in successive assessments (Wu, Citation2010). PISA changed the linking error estimation approach since PISA 2015 (OECD, Citation2014, Citation2018). There is a lack of research on how the newly proposed PISA linking error compares to the old linking error employed until PISA 2012. The present article has two main goals. First, we clarify with simulations and analytical derivations the error concept that is behind the new PISA linking error. It is shown that the new PISA linking error mainly quantifies error due to country differential item functioning. Second, we demonstrate in various settings that the new PISA linking error is not a generalization of the old linking error. This message is important for practitioners because official reporting indicates that the new linking error could replace the old linking error without essentially changing the interpretation of the linking error concept.

Scaling models for trend estimation

In this section, alternative item response models (i.e., scaling models) for trend estimation are introduced. In the one-parameter logistic (1PL; Rasch, Citation1960) model, the item response function for item i (i = 1, … ,I) in country c (c = 1, … ,C) at time t = 1,2 is given by

(1) Pictθ=Ψθbict,model,θNμct,σct2,t=1,2.(1)

The assumed item difficulties bict,model can be time point-specific or country-specific. However, item parameters can also be assumed invariant across countries or both time points. Furthermore, the latent ability variable θ is normally distributed with mean μct and standard deviation σct. In empirical applications like the PISA study, the assumed item difficulties in the scaling model will differ to some extent from true data-generating item difficulties bict,true. This difference can introduce additional variability or bias in country means and country standard deviations. The 1PL model for dichotomous data or the partial credit model (Masters & Wright, Citation1997) for polytomous data has been used in PISA until PISA 2012 (OECD, Citation2014).

Since PISA 2015, the 2PL model (Birnbaum, Citation1968) for dichotomous items or the generalized partial credit model (Muraki, Citation1997) for polytomous items has been utilized. The item response function for the 2PL model is given by

(2) Pictθ=Ψaict,modelθbict,model,θNμct,σct2,t=1,2,(2)

where aict,model are item discrimination parameters.

The partial invariance assumption (Oliveri & von Davier, Citation2011, Citation2014; von Davier et al., Citation2019) relies on the idea that most parameters are invariant across countries and time points. That is, it holds in the 1PL model that bict,model=bi with common item difficulties bi. In PISA, for most of the items, item discriminations are assumed to be invariant across countries and time in the 2PL model (Joo et al., Citation2021; OECD, Citation2018, Citation2020).

Item fit statistics are utilized to assess whether the invariance assumption of item parameters can be justified. The mean deviation (MD; OECD, Citation2018) statistic for item i in a particular group g is defined as

(3) MDig=Pi,obsθPi,modelθfgθ,(3)

where Pi,model is the model-implied item response function, Pi,obs is the empirical (i.e., observed) item response function, which estimates the true item response function, and fg denotes the empirical density in group g. The MD statistic in (3) can be interpreted as a weighted average of the difference Pi,obsθPi,modelθ in item response functions. In the MD statistic, positive and negative deviations in Pi,obsθPi,modelθ can cancel on average.

In PISA, the root mean square deviation (RMSD; OECD, Citation2018) statistic is frequently used and defined as

(4) RMSDig=(Pi,obs(θ)Pi,model(θ))2fg(θ)dθ.(4)

The RMSD statistic only takes non-negative values and quantifies the weighted variability of the difference Pi,obsθPi,modelθ. One can show that MDigRMSDig (Robitzsch & Lüdtke, Citation2020). If the true item response function is given by Ψθbi,true and there were no sampling errors, we would have RMSDig=MDig, which is the reason why we only rely on the MD statistic in this article. Computational details regarding the MD and RMSD statistics can be found in Köhler et al. (Citation2020), Robitzsch (Citation2022), or Tijmstra et al. (Citation2020).

We now discuss the behavior of the MD statistic in a misspecified 1PL model. Assume that bi,model denote the assumed item difficulty used for scaling and bi,true is the true data-generating item difficulty. If fg is the normal density with mean μg and standard deviation σg, it was shown by Robitzsch and Lüdtke (Citation2020) that the MD statistic could be approximated by

(5) MDig=Φμgbi,modelD2+σg2Φμgbi,trueD2+σg2,whereD=1.701.(5)

We can apply a Taylor expansion with respect to item difficulties in (5), and we obtain

(6) MDig=1D2+σg2\phiiμgbi,modelD2+σg2bi,modelbi,true.(6)

By assuming σg = 1, we can determine an upper bound of MDig in (6) by determining D2+σg21/2 and using ϕxϕ0=0.399 as

(7) MDig0.202bi,modelbi,true.(7)

For standard deviations σg larger than 1, the multiplication factor in (7) is smaller than 0.202; for standard deviations smaller than 1, it is slightly larger than 0.202. The bound in (7) can only be attained if the group distribution density fg matches the item difficulty, that is, it holds that μgbi,model=0. Formulas (5), (6), and (7) can be used to the size of unmodelled DIF or IPD effects as the discrepancy bi,modelbi,true into the size of the MD statistic. Hence, a cutoff value of the MD statistic of 0.20 requires at least a value of 1.00 of the deviation |bi,modelbi,true|. Thus, using a large cutoff MD cutoff of 0.20 is frequently very similar, if not identical, to the full invariance approach in which all parameters are assumed to be invariant across groups or countries because very large deviations bi,modelbi,true rarely occur. It can be argued that model deviations that are represented by the difference bi,modelbi,true become less important if items have extremely small or high item difficulties compared to the center of the θ distribution (see Tijmstra et al., Citation2020).

Linking errors for trend estimation

In this section, a variance component model for item difficulties in the 1PL model is presented that allows the derivation of linking errors. The model follows the framework presented by Robitzsch and Lüdtke (Citation2019). The item response function in the 1PL model for item i in study t = 1,2 for country c is given by

(8) logitP(Xict=1|θ)=θbict,(8)

where item difficulties bitc are centered for each country c in study t. The country mean μtc and the country standard deviation σtc of country c can be estimated. In most applications, statistical inference (i.e., standard errors or uncertainty quantification in general) for μtc and σtc should be conducted.

The consequences of the violation of measurement invariance (i.e., assuming item difficulties bi,model in the scaling model that differ from the data-generating item difficulties bi,true) can be distinguished regarding three different aspects. First, items can function differently across assessments, known as item parameter drift (IPD; Hanson & Béguin, Citation2002; Meade et al., Citation2005). Second, items can function differently (differential item functioning; DIF) across countries at one time point, indicating that an item is relatively easier or more difficult for a specific country than at the international level (Camilli, Citation2006). These cross-national differences have been studied extensively and termed country DIF (Oliveri & von Davier, Citation2011, Citation2014). Third, country DIF can vary across time points, meaning that the relative difficulty changes across assessments (DIF × IPD; Carstensen, Citation2013).

We now specify a variance component model for item effects for national item parameters bitc that parametrizes DIF, IPD, and DIF × IPD (see Robitzsch & Lüdtke, Citation2019)

(9) bict=bi+uic+vit+wict.(9)

International item parameters βit that are used for linking the scales at the two time points (see the linking INT1INT2 in ) are given by

(10) βit=bi+vit,(10)

where the item effects uic, vit, and wict have zero means and variances σDIF,c2=Varuic, σIPD2=Varvit, and σDIF×IPD,c2=Varwict. Note that the variance components of DIF and DIF × IPD effects can be country-specific. In the following, the different approaches for assessing the linking error of the country mean trend estimate Δμˆc=μˆc2μˆc1 are discussed.

Old PISA linking error until 2012: assessment of IPD effects

Until PISA 2012, the old linking error only addresses the effects of IPD in linking items. The linking procedure in PISA used mean–mean linking (Kolen & Brennan, Citation2014; Monseur & Berezner, Citation2007), which equates the mean of international item parameters βit for t = 1 (i.e., T1) and t = 2 (i.e., T2). It should be emphasized that the item parameters βit are the same for each country c. The old PISA linking error LE that was operationally used until PISA 2012 is estimated as the empirical variance of the difference in the international parameters of the two successive time points

(11) LE2ˆ=1I01i=1I0βi2βi12(11)

It can be shown that the expected value of the linking error estimate in (11) is given by (see Monseur & Berezner, Citation2007; Wu, Citation2010)

(12) LE2=ELE2ˆ=2I0σIPD2(12)

Thus, the old PISA linking error is a function of the effects of IPD (i.e., σIPD2) and the number of link items (i.e., I0). However, it does not take into account variations in trend estimation that are due to country DIF effects (i.e., σDIF,c2) and relative difficulty changes across assessments that are due to DIF×IPD effects (i.e., σDIF×IPD,c2). In former PISA studies, the linking error estimate in (11) has also been generalized to accommodate the testlet structure of items (Monseur & Berezner, Citation2007; Robitzsch & Lüdtke, Citation2019). The reason for preferring the more complex estimate is that IPD or DIF effects are often positively correlated for items within a testlet. Hence, ignoring the testlet structure will typically result in underestimated linking errors.

A linking error formula that accounts for DIF, IPD, and DIF×IPD effects

It has been shown in a simulation study by Sachse et al. (Citation2016) that the old PISA linking error substantially underestimates the true linking error of country mean trend estimates. Robitzsch and Lüdtke (Citation2019) analytically derived the linking error for the trend estimate in the country means for the 1PL model based on the variance component model defined in (9) and (10). The analytical derivation assumed mean–mean linking steps, although it can be expected that the linking errors based on (concurrent) marginal maximum likelihood estimation are very similar. The linking error of the trend estimate Δμˆc=μˆc2μˆc1 in the country means is given by

(13) LE2=Var(Δμˆc)=2I0σIPD2+I1+I2(I0+I1)(I0+I2)σDIF,c2+2I0+I1+I2(I0+I1)(I0+I2)σDIF×IPD,c2.(13)

Importantly, only the first term in (13) (i.e., the variance due to IPD) is reflected in the old PISA linking error (see EquationEquation (12)). In empirical applications, the DIF and DIF × IPD are also relevant contributions to the true linking error, which explains the reported underestimation of Sachse et al. (Citation2016). It should be emphasized that the middle term in (13) that refers to the DIF variance component disappears if there are only link items, that is, I1 = I2 = 0. It can also be seen in the linking error formula in (13) that the linking error is a country-specific quantity. The variance components in (13) can be estimated for each country through linear mixed-effects models (Robitzsch & Lüdtke, Citation2019).

New PISA linking error since 2015: assessment of variability due to noninvariance

Since PISA 2015, a new linking error method for trend estimates in the country means has been established (OECD, Citation2018, Citation2020). The change in the linking error method was motivated by the assumption that the newly proposed linking error includes additional aspects in the uncertainty quantification of trend estimates (OECD, Citation2018). First, it assesses the effects of the choice of link items and unique items on trend estimates. In contrast, the old PISA linking error solely quantifies the IPD effects of link items. Second, it is argued that model error such as DIF that is not fully accounted for by modeling is more adequately reflected in the new linking error. Third, the new linking error assesses the effects of using different calibration samples for obtaining item parameters utilized in the scaling models. Fourth, changes in the scaling model, such as the switch from the 1PL to the 2PL model in PISA, were also claimed to be more adequately reflected in the new PISA linking error. Moreover, the new linking error is also intended to be applicable to the partial invariance approach that has been implemented since PISA 2015 (von Davier et al., Citation2019).

The linking error estimation relies on the recalibration (RC) approach in which a data set of the first time point is reestimated using a different set of item parameters (Martin et al., Citation2012). We formally denote by μˆc1=Cbc1,Dc1 the original estimate of the country mean for country c at T1, where bc1 are item parameters for country c that were originally employed in the estimation of the country mean and Dc1 denotes the dataset of item responses for country c at T1. The label C indicates the application of a calibration procedure. The recalibration typically involves item parameters that are used in the scaling models at T2 (OECD, Citation2018 see also Martin et al., Citation2012). Moreover, the country mean estimate under recalibration could also be based on only link items that appear at T1 and T2 (i.e., item set I0 in ) by removing unique items at T1 from the estimation. We formally denote by μˆc1,RC=Cbc2,D˜c1 the country mean estimate at T1 under recalibration, where bc2 are the country-specific item parameters used at T2 and D˜c1Dc1 is a subset or a full set of item responses for country c at T1Footnote1. The consequence of the choice of link items and unique items is also quantified in the difference μˆc1,RCμˆc1. The new PISA linking error estimate is given by (see OECD, Citation2018, Citation2020)

(14) LE2=1Cc=1Cμˆc1,RCμˆc12.(14)

It summarizes the differences in the country means due to different assumed item parameters and used items across all countries. Hence, the linking error defined in (14) applies to all countries, and no country-specific linking error estimates can be assessed using the new PISA linking error method. Notably, PISA replaces the estimated variance (or the standard deviation, respectively) in (14) with a robust standard deviation estimator (OECD, Citation2018). In the rest of this article, we only consider the nonrobust variant in (14) because using a robust estimator in the simulation studies and analytical derivations does not change the expected values of linking errors. In contrast, robust standard deviation estimates are only expected to substantially increase the efficiency of linking error estimates if the differences μˆc1,RCμˆc1 strongly deviate from normality.

Assessing the total error for statistical inference of trend estimates

In all PISA studies, statistical inference for trend estimates in the country means included two aspects of errors, namely, the sampling error SE and the linking error LE (OECD, Citation2014, Citation2018; Wu, Citation2010). The total error TE is used to assess statistical uncertainty in trend estimates using the following formula (OECD, Citation2014):

(15) TE=SE2+LE2.(15)

Note that (15) presupposes that the two error component estimates are uncorrelated. However, the linking error can also be affected by sampling error. Additional variability in the linking error due to sampling error can be eliminated by analytical or resampling approaches (Robitzsch, Citation2021; Robitzsch & Lüdtke, Citation2019).

In this article, we ignore the impact of sampling error in linking error estimation and apply the total error estimation formula (15). Note that PISA also ignores potential biases in linking error estimates due to sampling error, which is the reason why we also follow the approaches implemented in operational practice in this article.

Purpose

To our knowledge, there is currently no published literature on the performance of the newly proposed PISA linking error. Therefore, this article aims at understanding the behavior of the new PISA linking error compared to the old linking error used in PISA until 2012. Consequently, we restrict ourselves to the simpler 1PL model instead of the 2PL model because it was claimed that the new PISA linking error is a generalization of the old one. It has been shown in the Section “A Linking Error Formula That Accounts for DIF, IPD, and DIF × IPD Effects” that the true linking error for the country mean trend estimate depends on the three variance components DIF, IPD, DIF × IPD and the numbers of link and unique items. In the next two studies, we restrict the investigation of the new PISA linking error by studying the effects of IPD and DIF separately. The advantage of this approach is that the behavior of the newly proposed linking error can be understood in idealized settings. We believe that it is advantageous to try to understand the behavior of a new estimator in clearly defined but idealized settings instead of investigating complex settings that mimic real data applications because the latter confounds several aspects. In our view, it is frequently impossible to get a deep insight into the performance of a statistical approach by studying complex simulation findings. In Study 1, we investigate the behavior of the new linking error under IPD with only link items utilizing a simulation study and analytical arguments. In Study 2, the behavior of the new linking error is investigated under DIF in a test design with link items and unique items utilizing a simulation study and analytical arguments.

Study 1: item parameter drift in minor-minor design

The old PISA linking error only addressed IPD in link items. Study 1 investigates the performance of the new PISA linking error in minor–minor designs in PISA based on the 1PL scaling model in which only link items and no unique items exist (see ). Only IPD and no DIF effects were considered. First, a simulation study was carried out to study the behavior of the new linking error empirically. Afterward, analytical arguments are provided to deepen the understanding of the empirical findings.

Figure 2. Minor-minor design.

Figure 2. Minor-minor design.

Method

Our simulation study comprised C = 30 countries at two time points. We took the country means and standard deviations from the PISA trend study between 2006 and 2009 for reading reported by Robitzsch and Lüdtke (Citation2019). The country means at T1 (i.e., PISA 2006) had an average of 0.00 (SD = 0.21, Min = −0.32, Max = 0.46). The country means at T2 (i.e., PISA 2009) were slightly larger on average (M = 0.05, SD = 0.25, Min = −0.59, Max = 0.52). The trend estimates in the country means, defined as the difference between the means at the two time points, had an average of 0.05 (SD = 0.17, Min = −0.53, Max = 0.34). We assumed equal country standard deviations across the two assessments for all countries (M = 1.18, SD = 0.09, Min = −0.99, Max = 1.40). The data-generating parameters can be found at https://osf.io/vq5pn/.

All countries had the same sample size N. The sample sizes were varied in the simulation with factor levels N = 1000, 2500, and 5000. Item responses were simulated from the 1PL model. In total, I0 = 30 link items were chosen. Common item difficulties bi were defined equispaced in the interval [−2.0,2.0]. IPD effects vit were added to the common item difficulties (see EquationEquation (9)). A mixture distribution was chosen to simulate IPD effects. The motivation for choosing a mixture distribution was to simultaneously accommodate unsystematically and normally distributed IPD effects as well as large IPD effects that could resemble a data-generating model fulfilling the partial invariance assumption. The data-generating model for IPD effects vit consists of a subset of items Jrandom that have unsystematic (and small) IPD effects (random IPD) and a distinct set Jbias of biased (and large) IPD effects (biased IPD; see Robitzsch & Lüdtke, Citation2022b, for a similar distinction),

(16) vit(1πbias)N(0,σIPD2)+πbiasFbias.(16)

Hence, unsystematic IPD is reflected in the normal distribution mixture component N0,σIPD2. Outlying IPD effects are represented by a contaminating distribution Fbias that has thicker tails than N0,σIPD2. In order to avoid confounding properties of biased estimation with the assessment of variability in the linking error, we assume that IPD effects vit have zero means. Hence, the mean of the distribution Fbias is zero (i.e., xdFbiasx=0).

In the simulation study, the proportion πbias of biased IPD items was either 0 or 0.15. The distribution of biased IPD effects was chosen as a uniform distribution on [−0.8,−0.5] [0.5,0.8]. In addition, the standard deviation σIPD of IPD effects was either 0.00 or 0.30. In total, there are four conditions in which random IPD and biased IPD effects were crossed (i.e., no IPD with πbias = 0 and σIPD = 0.00; IPD random with πbias = 0 and σIPD = 0.30; IPD biased items with πbias = 0.15 and σIPD = 0.00; IPD random and biased items with πbias = 0.15 and σIPD = 0.30).

International item parameters at T1 were obtained by applying the 1PL model to the pooled sample comprising all students in all C = 30 countries. In the next step, the pooled sample of all students was scaled using item difficulties fixed to the obtained values from T1. It was assessed by means of the MD statistic whether item parameters at T2 can be regarded as invariant. Items that showed a misfit received a unique item parameter at T2 in the scaling model, while all other items had item difficulties fixed to the corresponding estimate at T1. Four cutoff values for the absolute value of the MD statistic were used in this simulation study: 0.20, 0.12, 0.08, and 0.05. The condition of the MD cutoff of 0.20 mimics the full invariance condition because items were only flagged in rare cases. Note that in this simulation study, we only assessed item misfit at the international level (i.e., the pooled sample) and not at the country level for each of the two time points because only IPD and no DIF effects were simulated.

The standard error referring to person sampling was estimated as σˆct/N, where σˆct is the estimated standard deviation in the scaling modelFootnote2. The new PISA linking error utilizes the used item parameters at T2 for recalibrating data at T1. With a larger MD cutoff value, more item parameters in the recalibration were equal to those used in the original calibration at T1. The total error (TE) was assessed with the formula (15). We computed the mean and the standard deviation of the country mean trend estimates as well as the average of the estimated linking error using the new PISA method. Means and standard deviations (SDs) were averaged across the results of all 30 countries.

In each of the conditions, 300 replications were carried out. The statistical software R was used for implementing the entire simulation. The R package TAM (Robitzsch et al., Citation2022) was employed for estimating the multiple-group 1PL models and the computation of the MD statistic.

Results

Across all conditions, the means of country mean trend estimates were essentially zero. Hence, only the assessment of variability (i.e., the SD) and the estimation of TEs are interesting in this simulation study.

In , the empirical standard deviation and average estimated total errors are compared as a function of sample size for different MD cutoff values of the partial invariance scaling approach. In the absence of IPD (i.e., the “No IPD” condition), the average of estimated TEs was slightly smaller than SDs of trend estimates. Due to the very large sample sizes of the pooled sample comprising all countries, item parameters were estimated practically without sampling errors, and no items were removed under PI for all cutoff values 0.20, 0.12, 0.08, and 0.05. Hence, no differences in SDs and estimated TEs between the methods occurred.

Table 1. Simulation study 1: standard deviations and average-estimated total errors for country mean trend estimates in the presence of item parameter drift (σIPD = 0.30) and no country differential item functioning (σDIF = 0.00, σDIF×IPD = 0.00) for a minor–minor design, I0 = 30 link items, and C = 30 countries as a function of sample size per country (N).

Under the condition “IPD Random,” only a few items were removed with an MD cutoff value of 0.20. In this case, the total error mostly reflects sampling error and not linking error. It can be seen that the SD of trend estimates was smallest with the MD cutoff of 0.20. Due to normally distributed IPD effects, some items with large IPD effects were treated as outliers. It is known from robust statistics that the removal of outliers from a normal distribution results in less efficient estimates. This is also demonstrated in the larger standard deviations with a smaller MD cutoff value. As can be seen, the variability in trend estimates is grossly underestimated using the newly proposed PISA linking error. Hence, if there is only random and unsystematic IPD, the newly proposed PISA linking error is not a generalization of the old PISA linking error, which can be unbiasedly estimated (see Monseur & Berezner, Citation2007; Robitzsch & Lüdtke, Citation2019)Footnote3.

In the condition that only involved biased IPD items (i.e., condition “IPD Biased Items”), it turned out that specifying unique item parameters for biased items increased the precision in the country mean trend estimates. For example, with N = 5000, the standard deviation decreases from 0.070 with an MD cutoff of 0.20 to 0.026 with an MD cutoff of 0.05. Notably, the smaller variability in trend estimates using the partial invariance approach with a more strict MD cutoff value was not reflected in the PISA linking error. The PISA linking error was low when the SD was high and the other way around. Hence, the new PISA linking error does not reflect variability in trend estimates under repeated sampling but rather the number of items that were assumed noninvariant. Consequently, the error concept behind the new PISA linking error since 2015 differs from the old PISA linking error until PISA 2012.

In the presence of both random IPD and IPD due to biased items (i.e., condition “IPD Random and Biased Items”), it was observed that the linking error strongly underestimates the true variability in trend estimates in the country means. It can be concluded that the currently used linking error showed a very distinct behavior from the linking error previously used in PISA. While the old linking error correctly quantifies the variability in trend estimates due to IPD variability (Monseur & Berezner, Citation2007), the newly proposed linking error quantifies the extent of items that are assumed noninvariant. Therefore, the recently used new PISA linking error does not adequately capture variability in trend estimates due to IPD but constitutes a summary measure of the extent of noninvariant items. It was argued that the latter concept refers to the comparability of scaling results across groups (Joo et al., Citation2021; von Davier et al., Citation2019). In this sense, the new PISA linking error contributes to the TE to the extent of limited comparability.

Analytical findings

We now try to get further insights into the behavior of the new PISA linking error by providing analytical arguments. The derivations rely on the substitution of marginal maximum likelihood estimation of the 1PL model with unweighted least squares estimation (Cai & Moustaki, Citation2018; Robitzsch & Lüdtke, Citation2020). Hence, the findings are only approximately generalizable to operational practice that uses maximum likelihood estimation. The estimated country mean μˆct for country c at time point t in infinite sample sizes of persons can be written as

(17) μˆct=μct1Ii=1Ibict,truebict,model,(17)

where bict,model is the assumed item difficulty in the scaling model and bict,true refers to the true data-generating item difficulty. Note that additional variability is introduced in (17) if the differences bict,truebict,model differ from zero. Linking errors quantify the variability of these differences.

We now derive the expected value of the new PISA linking error. Like in the simulation study, we rely on the mixture distribution for IPD effects (see (16)). Assume that there is no random IPD σIPD2, but there only exist biased IPD effects. Furthermore, assume that these biased items are detected in the partial invariance approach using appropriate cutoff values for the MD statistic. The set of all items is partitioned into a set Jrandom that has no IPD effects and a set Jbias that has (large) IPD effects. By defining the variance in Fbias as κIPD2 (i.e., x2dFbiasx=κIPD2), we get the true linking error due to IPD

(18) LE2=2πbiasI0κIPD2.(18)

We now evaluate the new PISA linking error. The difference between the recalibrated country mean μˆc1,RC and the original country mean μˆc1 can be calculated as

(19) μˆc1,RCμˆc1=1IoiJrandomvi2vi1+1I0iJbiasvi2vi1.(19)

The first term on the right-hand side in (19) equals zero because only large DIF effects exist. The second term refers to the set of biased items Jbias that differ from zero, and we get

(20) Eμˆc1,RCμˆc12=2πbiasI0κIPD2.(20)

Hence, the result in (20) coincides with the true linking error computed in (18). As shown in this simulation study, the new PISA linking error does not assess the variability in trend estimates of the partial invariance approach in a repeated sampling of items. Instead, it quantifies the extent of comparability that coincides with the old PISA linking error in the particular data-generating model of partial invariance. Note that this derivation rests on the strong assumption that IPD effects fulfill the partial invariance assumption, and the noninvariant items were all correctly flagged by the item fit statistic.

We now apply mean–mean linking of the item parameters at the international level obtained from separate scaling models at T1 and T2. This approach was operationally implemented in PISA until PISA 2012 (OECD, Citation2014). Again, we assume infinite sample sizes of persons in order only to address the consequences of the linking error. At T1, estimated item difficulties at the international level are bi+vi1. At T2, item difficulties are used that are, on average, the same as those at T1 because the mean–mean linking method is utilized. Hence, for scaling, one uses item difficulties bi+vi2I01i=1I0vi2vi1. Hence, we get using (17)

(21) μˆc1,RCμˆc1=1I0i=1I0bi+vi1bi+vi2I01i=1I0vi2vi11I0i=1I0bi+vi1bi+vi1=0.(21)

Consequently, when using mean–mean linking and unweighted least squares estimation instead of maximum likelihood estimation as simplifying assumptions, we expect that recalibrated country means coincide with original country means. With maximum likelihood estimation and mean–mean linking, our empirical experience is that the difference between both estimates also turned out very close to zero. Hence, the new PISA linking error is approximately zero under mean–mean linking, while the true linking error can substantially differ from zero.

Study 2: cross-sectional differential item functioning

In the second study, the impact of cross-sectional country DIF on the new PISA linking error is investigated. It is known that country DIF introduces additional variability in trend estimates if there are unique items in two successive assessments. Only cross-sectional DIF with a common SD σDIF across countries and no IPD effects are considered in the simulation study and the analytical derivations.

Method

In this study, we consider a major-minor test design (see ) in which there are I0 link items, I1 unique items at T1, and no unique items at T2. We used the same data-generating parameters for country means and standard deviations (see https://osf.io/vq5pn/). In total, I0 = 30 link items and I1 = 30 unique items were chosen. Both item sets had common equispaced item difficulties on the interval [−2,2]. DIF effects were added to common item difficulties. The same mixture distribution for DIF effects as in Simulation Study 1 was utilized. Random DIF was simulated with an SD σDIF of either 0 or 0.30. Again, biased DIF effects were uniformly drawn from [−0.8, −0.5] [0.5,0.8] with a proportion πbias of either 0 or 0.15. Item misfit was detected at the country level separately at T1 and T2 using the four choices of cutoff values 0.20, 0.12, 0.08, and 0.05. Like in Simulation Study 1, there are four types of DIF effects in which the occurrence of random DIF and biased DIF was crossed.

Figure 3. Major-minor design.

Figure 3. Major-minor design.

First, item responses were scaled at T1 with the 1PL model using a pooled sample comprising all students. Next, MD item fit statistics were assessed for each country at T1. If the absolute value of the MD statistic exceeded the prespecified cutoff, a unique item parameter for this country was introduced. All remaining items had fixed item difficulties that equaled the obtained item difficulties at the international metric. Second, item responses at T2 were separately scaled with the 1PL model using the pooled sample. The two assessments were linked onto a common international metric using the mean–mean linking approach. Like at T1, the linked item parameters at T2 were utilized for assessing the item fit at the country level. Those items whose MD statistic exceeded the prespecified cutoff value received unique item parameters. In assessing the new PISA linking error, the recalibrated country means only relied on the 30 link items. The 30 unique items at T1 were omitted in the computation of the recalibrated country mean.

Like in Simulation Study 1, the mean and the SD of trend estimates and the average TE were computed. In total, 300 replications were conducted, and the R software (R Core Team, Citation2022) and the R package TAM (Robitzsch et al., Citation2022) were utilized.

Results

In , empirical standard deviations and the average estimated linking errors are presented as a function of sample size N. The condition in which DIF was absent (i.e., condition “No DIF”) showed the same pattern as the one observed under the absence of IPD (see ).

Table 2. Simulation study 2: standard deviations and average estimated total errors for country mean trend estimates in the presence of country differential item functioning (σDIF = 0.50), and no item parameter drift (σIPD = 0.00, σDIF×IPD = 0.00) for a Major-minor design, I0 = 30 link items, I1 = 30 unique items at T1, and C = 30 countries as a function of sample size per country (N).

In the presence of random cross-sectional DIF (i.e., condition “DIF Random”), it can be seen that it is useful to flag items with large DIF effects and assign them unique item parameters to reduce the variability in trend estimates. Note that this property only applies to major-minor designs if unique items exist. Interestingly, the newly proposed PISA linking error approximately reflects the observed variability in trend estimates due to country DIF in major-minor designs.

If there is only DIF due to biased items (i.e., condition “DIF Biased Items”), the new PISA linking error even better captures the variability due to DIF. Items with large cross-sectional DIF effects are essentially removed from the analysis, which resulted in decreased variability of trend estimates.

Finally, if both random DIF and biased DIF were present (i.e., condition “DIF Random and Biased Items”), using more strict MD cutoff values again reduced the variability in trend estimates. Also, the new PISA linking error successfully assessed the observed variability in the country mean trend estimates.

It should be emphasized that the successful behavior of the new PISA linking error depends on the decision that the recalibrated country means only involved link items and no unique items. Linking errors were underestimated if unique items were also used in the recalibration approach.

Analytical findings

We now want to validate the findings of the simulation study with analytical results. We assume that there only exist country DIF and no IPD effects. The difference between the original and the recalibrated country mean can be determined as

(22) μˆc1,RCμˆc1=1I0i=1I0uic1I0+I1i=1I0+I1uic=I1I0I0+I1i=1I0uic1I0I0+I1i=I0+1I0+I1uic.(22)

By taking the variance in (22), we get

(23) Eμˆc1,RCμˆc12=I1I0I0+I1σDIF,c2.(23)

The expected value of the square of the linking error estimate is given by

(24) ELE2=1Cc=1CEμˆc1,RCμˆc12=I1I0I0+I1σDIF2,(24)

where σˉDIF2=C1c=1CσDIF,c2 denotes the average DIF variance across countries. The quantity in (24) corresponds to the linking error obtained in the variance component in Robitzsch and Lüdtke (Citation2019) (see EquationEquation (13)). If I2 = 0 and no IPD and DIF × IPD variance components exist, the DIF variance components are equal across items. Thus, the new PISA linking error captures variability in trend estimates due to country DIF. Notably, the DIF variance component is not reflected in the old PISA linking error, so including this property is an advantage of the new PISA linking error.

The analytical finding in (24) can also be used to investigate the adequacy of the new PISA linking error in general test designs in which there exist unique items at T1 and T2 (i.e., I1 > 0 and I2 > 0). The multiplication factor of the DIF variance of the new PISA linking error in (24) is given by I1I01I0+I11. The true linking error (see EquationEquation (13)) has a multiplication factor of I1+I2I0+I11I0+I21. The multiplication factor of the difference between the new PISA linking error and the true linking error is given by

(25) I2(I0I1)I0(I0+I1)(I0+I2).(25)

Hence, the new PISA linking error correctly captures DIF effects if there are no unique items at T2 (i.e., I2 = 0) or the number of link items I0 equals the number of unique items I1 at T1.

Discussion

In this article, we studied the behavior of the newly proposed linking error using two simulation studies and analytical arguments that rely on the 1PL model. It turned out that it does not satisfactorily assess variability in trend estimates due to IPD effects. Hence, the new linking error does not provide a generalization of the old PISA linking error that (only) assesses variability due to IPD. In the case of IPD, the only exception in which the new PISA linking error would prove useful might be the rare constellation in which IPD effects exactly follow the partial invariance assumption, and the items with IPD effects would be correctly flagged by item fit statistics.

Interestingly, the new PISA linking error satisfactorily captures variability in trend estimates due to country DIF effects. This finding was confirmed in the simulation study and an analytical derivation. It should be emphasized that the procedure should be adapted to minor-major test designs in which unique items only appear at T2. In this case, the linking error must be based on evaluating recalibrated and originally estimated country means at T2 instead of T1.

If both IPD and DIF are present, it can be concluded that the new PISA linking error results in biased estimates of the variability. However, in our experience, DIF effects are much more important than IPD effects in empirical studies like PISA, which, fortunately, will probably not result in strongly biased linking error estimates using the new method. However, practitioners and applied researchers should be aware that the newly proposed linking error depends on the extent of assumed noninvariant items. Hence, the estimated linking error depends on the arbitrarily chosen cutoff statistic for assessing item fit (see Robitzsch, Citation2022). Consequently, the new PISA linking error might be regarded as a measure of comparability instead of the variability of scale scores in trend estimates.

We want to reiterate that the new PISA linking error estimates an error of zero if all items were assumed invariant. It is hard to understand why a linking error should be exactly zero by assuming invariance. In contrast, we are convinced that linking errors also exist in scaling models that assume invariant item parameters. We would like to point out that the extent of linking errors is unrelated to whether one uses concurrent calibration, mean–mean linking, or scaling under partial invariance. Linking errors refer to variability regarding item choice. It is not essential whether deviations are modeled using the partial invariance approach or mean–mean linking or remain unmodelled in misspecified concurrent calibration.

The recalibration approach is computationally feasible for trend estimates in the country means. However, reporting linking errors for trend estimates in other parameters of interest, such as standard deviations, percentiles, or proportions, could also be warranted. In this case, it might be more general to apply resampling methods of item groups such as jackknife (Monseur & Berezner, Citation2007; Robitzsch & Lüdtke, Citation2022a; Xu & von Davier, Citation2010) or balanced half-sampling (Robitzsch, Citation2021). Plausible values could be drawn for each subsample of items in the resampling approach. This approach would provide a simple computational device for calculating linking errors for arbitrary parameters of interest. We have argued in Robitzsch and Lüdtke (Citation2019) that linking errors should also be reported for cross-sectional statistics because country DIF also impacts cross-sectional quantities such as means, standard deviations, or percentiles. We find it inconsistent that linking errors are only (or primarily) reported for trend estimates in country means for PISA but are ignored for all other trend estimate parameters and cross-sectional descriptive statistics.

Notably, we restricted ourselves to analytical and simulation evidence of the different performances of the two linking errors based on the 1PL model. However, the conceptual differences also transfer to the 2PL model. In Robitzsch (Citation2023), a linking error similar, but more intricate, to EquationEquation (13) for the trend estimate with log-mean-mean linking based on the 2PL model was obtained utilizing the M-estimation theory. Further simulation research can be carried out to demonstrate that the newly proposed linking error in PISA also substantially differs from the 2PL linking error in Robitzsch (Citation2023) that includes DIF in item intercepts and item slopes (i.e., in situations of uniform or nonuniform DIF). Nevertheless, we believe that linking errors for any statistic of interest should be more generally obtained through resampling methods regarding items or item groups (i.e., testlets).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1. If all items were used for computing the recalibrated country mean, linking errors would become very small if the number of link items is much smaller than the number of unique items at T1. The reason is that the same item parameters were used for unique items. Hence, only link items can have different item parameters in the recalibrated country mean, which, in turn, can only impact the estimated linking error.

2. Using σˆct/N instead of the log-likelihood based standard error results in a slight underestimate of the true standard error regarding person sampling. However, this article mainly addresses the assessment of linking error and the degree of biased standard errors vanishes in a large sample sizes N of persons per country.

3. We do not present the old PISA linking error in this simulation.

References

  • Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.
  • Cai, L., & Moustaki, I. (2018). Estimation methods in latent variable models for categorical outcome variables. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test (pp. 253–277). Wiley. https://doi.org/10.1002/9781118489772.ch9
  • Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (pp. 221–256). Praeger Publisher.
  • Carstensen, C. H. (2013). Linking PISA competencies over three cycles – results from Germany. In M. Prenzel, M. Kobarg, K. Schöps, & S. Rönnebeck (Eds.), Research on PISA (pp. 199–213). Springer. https://doi.org/10.1007/978-94-007-4458-5_12
  • Dorans, N. J., Pommerich, M., & Holland, P. W. (Eds.). (2007). Linking and aligning scores and scales. Springer. https://doi.org/10.1007/978-0-387-49771-6
  • Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3–24. https://doi.org/10.1177/0146621602026001001
  • Joo, S. H., Khorramdel, L., Yamamoto, K., Shin, H. J., & Robin, F. (2021). Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educational Measurement: Issues and Practice, 40(2), 37–48. https://doi.org/10.1111/emip.12404
  • Köhler, C., Robitzsch, A., & Hartig, J. (2020). A bias-corrected RMSD item fit statistic: An evaluation and comparison to alternatives. Journal of Educational and Behavioral Statistics, 45(3), 251–273. https://doi.org/10.3102/1076998619890566
  • Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking. Springer. https://doi.org/10.1007/978-1-4939-0317-7
  • Martin, M. O., Mullis, I. V. S., Foy, P., Brossman, B., & Stanco, G. M. (2012). Estimating linking error in PIRLS. IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 5, 35–47. https://bit.ly/2Vx3el8
  • Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 101–121). Springer. https://doi.org/10.1007/978-1-4757-2691-6_6
  • Mazzeo, J., & von Davier, M. (2014). Linking scales in international large-scale assessment. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment (pp. 229–258). CRC Press.
  • Meade, A. W., Lautenschlager, G. J., & Hecht, J. E. (2005). Establishing measurement equivalence and invariance in longitudinal data with item response theory. International Journal of Testing, 5(3), 279–300. https://doi.org/10.1207/s15327574ijt0503_6
  • Monseur, C., & Berezner, A. (2007). The computation of equating errors in international surveys in education. Journal of Applied Measurement, 8, 323–335. https://bit.ly/2WDPeqD
  • Muraki, E. (1997). A generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 153–164). Springer. https://doi.org/10.1007/978-1-4757-2691-6_9
  • OECD. (2014). PISA 2012 technical report. OECD Publishing. https://www.oecd.org/pisa/pisaproducts/pisa2012technicalreport.htm
  • OECD. (2018). PISA 2015 technical report. OECD Publishing. https://www.oecd.org/pisa/data/2015-technical-report/
  • OECD. (2020). PISA 2018 technical report. OECD Publishing. https://www.oecd.org/pisa/data/pisa2018technicalreport/
  • Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53(3), 315–333. https://bit.ly/3k4K9kt
  • Oliveri, M. E., & von Davier, M. (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14, 1–21. https://doi.org/10.1080/15305058.2013.825265
  • Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research.
  • R Core Team. (2022). R: A language and environment for statistical computing. https://www.R-project.org/
  • Robitzsch, A. (2021). Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry, 13(11), 2198. https://doi.org/10.3390/sym13112198
  • Robitzsch, A. (2022). Statistical properties of estimators of the RMSD item fit statistic. Foundations, 2(2), 488–503. https://doi.org/10.3390/foundations2020032
  • Robitzsch, A. (2023). Linking error in the 2PL model. J, 6(1), 58–84. https://doi.org/10.3390/j6010005
  • Robitzsch, A., Kiefer, T., & Wu, M. (2022). TAM: Test analysis modules. R package version 4. 1–4. http://CRAN.R-project.org/package=TAM
  • Robitzsch, A., & Lüdtke, O. (2019). Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assessment in Education: Principles, Policy & Practice, 26(4), 444–465. https://doi.org/10.1080/0969594X.2018.1433633
  • Robitzsch, A., & Lüdtke, O. (2020). A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychological Test and Assessment Modeling, 62(2), 233–279. https://bit.ly/3ezBB05
  • Robitzsch, A., & Lüdtke, O. (2022a). Comparing different trend estimation approaches in international large-scale assessment studies. OSF Preprints. Retrieved November 12, 2022, from https://doi.org/10.31219/osf.io/u8kf5
  • Robitzsch, A., & Lüdtke, O. (2022b). Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. Journal of Educational and Behavioral Statistics, 47(1), 36–68. https://doi.org/10.3102/10769986211017479
  • Sachse, K. A., Roppelt, A., & Haag, N. (2016). A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. Journal of Educational Measurement, 53(2), 152–171. https://doi.org/10.1111/jedm.12106
  • Tijmstra, J., Bolsinova, M., Liaw, Y. L., Rutkowski, L., & Rutkowski, D. (2020). Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. Journal of Educational Measurement, 57(4), 566–583. https://doi.org/10.1111/jedm.12263
  • von Davier, M., & Sinharay, S. (2014). Analytics in international large-scale assessments: Item response theory and population models. In L. Rutkowski, L. M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 155–174). CRC Press. https://doi.org/10.1201/b16061-12
  • von Davier, M., Yamamoto, K., Shin, H. J., Chen, H., Khorramdel, L., Weeks, J., Davis, S., Kong, N., & Kandathil, M. (2019). Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assessment in Education: Principles, Policy & Practice, 26(4), 466–488. https://doi.org/10.1080/0969594X.2019.1586642
  • Wu, M. (2010). Measurement, sampling, and equating errors in large-scale assessments. Educational Measurement: Issues and Practice, 29, 15–27. https://doi.org/10.1111/j.1745-3992.2010.00190.x
  • Xu, X., & von Davier, M. (2010). Linking errors in trend estimation in large-scale surveys: A case study. ETS Research Report ETS RR10-10. ETS. https://doi.org/10.1002/j.2333-8504.2010.tb02217.x