1,176
Views
3
CrossRef citations to date
0
Altmetric
Methodological Studies

Empirical Benchmarks to Interpret Intervention Effects on Student Achievement in Elementary and Secondary School: Meta-Analytic Results from GermanyOpen DataOpen Materials

, &
Pages 119-157 | Received 02 Apr 2022, Accepted 03 Jan 2023, Published online: 21 Feb 2023

Abstract

To assess the meaningfulness of an intervention effect on students’ achievement, researchers may apply empirical benchmarks as standards for comparisons, involving normative expectations for students’ academic growth as well as performance gaps between weak and average schools or policy-relevant groups (e.g., male and female students, students from socioeconomically advantaged or disadvantaged families, students with or without a migration background). Previous research made these empirical benchmarks available for students in the United States. We expand this line of research by providing novel meta-analytic evidence on these empirical benchmarks for students attending elementary and secondary schools in Germany for a broad variety of achievement outcomes. Drawing on the results obtained for large probability samples, we observed variations in each kind of benchmark across countries as well as across and within domains and student subpopulations within Germany. Thus, the assessment of the very same intervention effect may depend on the target population and outcome of the intervention. We offer guidelines and illustrations for applying empirical benchmarks to assess the magnitude of intervention effects.

Imagine a research team conducted a randomized experiment to assess the effect of a whole-school intervention on fourth graders’ achievement in elementary school. Drawing on a pre-posttest design with data from 2,000 students from 50 schools, the intervention was shown to improve students’ mathematics achievement with a standardized mean difference of d = 0.15 (95% CI [0.05, 0.25]) relative to a practice-as-usual control group. How should the team assess the magnitude of this effect? Is it small, medium, or even large? Such questions are challenging for every researcher who takes reporting standards seriously and aims to convey the meaningfulness of findings to a broader audience comprising other researchers, practitioners, and policymakers. For example, the American Educational Research Association (AERA) standards recommend that “for each of the statistical results that is critical to the logic of the design and analysis, there should be included […] a qualitative interpretation of the index of the effect that describes its meaningfulness in terms of the questions the study was intended to answer” (American Educational Research Association [AERA], Citation2006, p. 37).

In the opening example, the standardized mean difference is an “index of the effect” and its 95% CI provides a set of plausible values for the corresponding population value (Cumming, Citation2014). However, neither the value d = 0.15 itself nor its confidence interval are informative about the meaningfulness of the intervention effect. Unfortunately, as shown by the review by Peng et al. (Citation2013), many researchers do not interpret the meaningfulness of their effect sizes at all or they draw on the conventional guidelines by Cohen (Citation1988) to interpret values of d = 0.20 as “small,” values of d = 0.50 as “medium,” and values of d = 0.80 as “large” effects. Researchers also often uncritically apply the threshold value of d ≥ 0.25 (Tallmadge, Citation1977) to evaluate intervention effects as demonstrating “educational significance” (Bloom et al., Citation2008, p. 295). However, such guidelines or threshold values are not universally applicable but rather—as emphasized by Cohen (Citation1988) himself and several other scholars (Bloom et al., Citation2008; Konstantopoulos & Hedges, Citation2008; Kraft, Citation2020; Lipsey et al., Citation2012)—need to be qualified by the research context in question.

Several approaches have been developed that help put educational intervention effects into context (Baird & Pane, Citation2019; Konstantopoulos & Hedges, Citation2008; Kraft, Citation2020; Lipsey et al., Citation2012; Valentine et al., Citation2019). One approach is to compare the intervention effects with regard to cost and scalability (e.g., Kraft, Citation2020, Lipsey et al., Citation2012). Another broad class of approaches—that is also the focus of the present paper—compares the effects of educational interventions with empirical benchmarks that serve as reference values. Several types of empirical benchmarks have been provided, involving (a) normative expectations for academic growth in students’ achievement (Bloom et al., Citation2008; Lee et al., Citation2019; Scammacca et al., Citation2015), (b) typical effects of educational interventions (Hill et al., Citation2008; Kraft, Citation2020; Lipsey et al., Citation2012), (c) performance gaps between policy-relevant groups (e.g., male vs. female students, socioeconomically disadvantaged vs. privileged students, students with different racial and ethnic identities), and (d) performance gaps between low and average-performing schools (Bloom et al., Citation2008; Hill et al., Citation2008; Konstantopoulos & Hedges, Citation2008; Lipsey et al., Citation2012).

Putting intervention effects into context requires matching the empirical benchmarks not only to the nature of the intervention, but also to its target population and outcome (Bloom et al., Citation2008; Hill et al., Citation2008; Lipsey et al., Citation2012). Previous studies that provided empirical benchmarks have focused on student populations in the United States. It is unknown how well extant empirical benchmarks generalize to student populations in other countries. Important research objectives are therefore not only to learn about general trends of empirical benchmark values, but also to offer benchmarks that are specifically tailored to the educational contexts of other countries. To this end, the present paper provides new meta-analytic evidence on empirical benchmarks for students attending elementary and secondary schools in Germany for a broad variety of outcome domains (e.g., mathematics, science, information and communication technology [ICT], and first and second language skills).Footnote1 To obtain reliable, precise, and generalizable benchmarks (Findley et al., Citation2021; Shadish et al., Citation2002), we meta-analytically integrate the results from analyses with individual participant data (IPD) and/or published research reports obtained from large probability samples.

This paper is structured as follows. We begin by describing key characteristics of the German school system to contextualize the empirical benchmark values that we provide for general and more specific student populations in Germany. The paper is then organized in three parts, elaborating on empirical benchmarks that are based on (1) students’ academic growth, (2) performance gaps between student demographic groups, and (3) performance gaps between weak and average schools. In each part we delineate the rationale for the respective benchmark, key findings from previous research, the analysis methods applied, and conclude with a short presentation and discussion of the results obtained with student samples in Germany. Finally, we discuss and illustrate several guidelines to align the benchmarks with the nature as well as the population and outcome of the intervention. Our paper is accompanied by extensive online supplementary materials (OSM) in the Open Science FrameworkFootnote2, including further illustrations of how to apply the empirical benchmarks (supplemental OSM A.1), details on the applied methods (e.g., treatment of missing data, estimation of standard errors for effect sizes; supplemental OSM A.2, A.3, A.4), the R code (R Core Team, Citation2021) for reproducibility and replicability, and detailed tables of empirical benchmarks (e.g., from previous research, for specific school types and federal states) that are not presented in the main text (supplemental OSM B, C, and D).

Key Characteristics of the German School System

Empirical benchmarks to interpret the effect sizes of educational interventions should be tied to the target population (Bloom et al., Citation2008; Lipsey et al., Citation2012). In educational research and policy analyses, the target population is often defined by key characteristics of the school system where the intervention is carried out. In the present study we focus on the German school system that is strongly characterized by Germany’s constitution as a federal republic with 16 federal states. Each state has the primary responsibility for legislation and administration for its educational system (Secretariat of the Standing Conference of the Ministers of Education & Cultural Affairs of the Länder in the Federal Republic of Germany [KMK], Citation2019). Consequently, there are important differences between the school systems across the 16 states, but there also several commonalities. In particular, at the elementary level, all students attend the same type of elementary school across all 16 federal states (“Grundschule”). At the secondary level Germany’s school system is characterized—like that of many other countries (Salchegger, Citation2016)—by tracking into different school types that cater to students with different performance levels. Typically, five major school types are distinguished to categorize the landscape of schools in Germany: the academic track school (“Gymnasium”; up to Grade 12 or 13), vocational school (“Hauptschule”; up to Grade 9 or 10), intermediate school (“Realschule”; up to Grade 10), multitrack school (“Schulen mit mehreren Bildungsgängen”; up to Grade 9, 10, 12, or 13), and comprehensive school (“Gesamtschule”; up to Grade 12 or 13). Notably, all federal states offer schools in the academic track, but they vary with respect to the other school types. In the remainder of this article, we will therefore subsume the other school types under the umbrella term “nonacademic track” to refer to this broad class of schools.

Furthermore, the educational system in Germany can be characterized by four major educational stages. Elementary education spans grade levels 1–4 for most states (except in Berlin and Brandenburg, where it includes grade levels 1–6). At the secondary level, three educational stages underlie different educational paths with their respective leaving certificates (Secretariat of the Standing Conference of the Ministers of Education & Cultural Affairs of the Länder in the Federal Republic of Germany [KMK], Citation2019). Attending school up to grade level 9 or 10 in lower secondary level education prepares students to obtain the lower secondary school leaving certificate and the intermediate secondary school leaving certificate, respectively. Attending upper secondary education (which spans grade levels 10–12 in most states) prepares students to obtain the general higher education entrance qualification. Of note, all students in upper secondary education are taught in core subjects involving mathematics, German, and one of the natural sciences (i.e., biology, chemistry, or physics). However, in many federal states, students in upper secondary education can specialize and choose to learn these subjects in courses that are taught at either a basic level or an increased level of academic standards (Secretariat of the Standing Conference of the Ministers of Education & Cultural Affairs of the Länder in the Federal Republic of Germany [KMK], Citation2019).

Part 1: Normative Expectations for Students’ Academic Growth

Theoretical Background and Previous Research

Rationale for Using This Benchmark

Student achievement is the outcome of long-term, cumulative domain-specific processes of knowledge and skill acquisition (Baumert et al., Citation2009). Growth in student achievement results from students’ engagement in learning opportunities in school and—in addition—from their learning experiences outside of school (e.g., with friends and family) as well as natural cognitive development (Baumert et al., Citation2009; Bloom et al., Citation2008; Lipsey et al., Citation2012). Growth estimates of students’ achievement therefore provide a good approximation of how much students gain in knowledge and skills in a certain domain during a certain time span without any additional educational intervention (Hill et al., Citation2008). Thus, normative expectations of students’ academic growth are very helpful in assessing the meaningfulness of intervention effects, to estimate how much these interventions improve students’ academic growth over and above what would have occurred during the year without the intervention.

Previous Research

Drawing on samples with US students, three studies provided comprehensive collections of empirical benchmarks for normative expectations for academic growth using standardized achievement tests (Bloom et al., Citation2008; Lee et al., Citation2019; Scammacca et al., Citation2015). The major results of these studies can be summarized as follows. First, all three studies found the same basic growth pattern irrespective of whether they used cross-sectional or longitudinal data. More specifically, average standardized learning gains (ESGrowth; see EquationEquation 1) decreased in all domains as students moved from Kindergarten to Grade 12 (see also Dadey & Briggs, Citation2012; Kuhfeld & Soland, Citation2021). Supplemental Figure A.4 illustrates this basic pattern for the year-to-year growth in achievement (estimated with cross-sectional data) as reported in the widely-cited papers by Hill et al. (Citation2008) and Bloom et al. (Citation2008). Second, the basic pattern that was observed for the average student was also found for students scoring at the low end of the achievement distribution (i.e., below the 10th, 25th, and 50th percentiles; Bloom et al., Citation2008; Scammacca et al., Citation2015) and for students with low socioeconomic status (SES; Bloom et al., Citation2008). Third, the basic pattern also showed some variation across student subpopulations, domains, and domain-specific tests. For example, in the study by Bloom et al. (Citation2008), average learning gains from Grade 3–4 in mathematics were ESGrowth = 0.52 and thus larger than in reading (ESGrowth = 0.36) or science (ESGrowth = 0.37). In higher grade levels, average learning gains were comparable across domains (Bloom et al., Citation2008; Lee et al., Citation2019; Scammacca et al., Citation2015), but growth estimates obtained for different tests showed notable variation around the corresponding average. For example, the average growth estimate in science from Grade 10 to 11 was ESGrowth = 0.15, with the estimates obtained for individual tests ranging from a loss of achievement of ESGrowth = −0.22 to a gain in achievement of ESGrowth = 0.33 (Bloom et al., Citation2008).

What do we know about normative expectations of academic growth in Germany? First, the research evidence from studies using probability samples that are representative of the total student population in Germany is scattered (see Supplemental Table B.4), and has not yet been synthesized quantitatively through meta-analysisFootnote3 (see Supplemental Figure A.4 and Table B.4). Consistent with results from the US, learning gains have been found to decrease with increasing grade levels in all three domains. However, there is also considerable variation across estimates. For example, the estimated year-to-year gain for mathematics from Grade 9 to Grade 10 varies between ESGrowth = 0.24 (Baumert & Artelt, Citation2002) and ESGrowth = .50 (KMK, Citation2012). Importantly, the research evidence for Germany is much weaker than for the US, as little to no estimates were available for many grade transitions. Second, previous research provided mixed results as to whether growth trajectories differ between school types. Some studies demonstrated larger learning gains for students attending the (highest) academic track (Becker et al., Citation2006; Pfost et al., Citation2010), while other studies found little support for differential learning gains (Retelsdorf & Möller, Citation2008; Schneider & Stefanek, Citation2004) or even stronger learning gains at nonacademic track schools (Autorengruppe Bildungsberichterstattung, Citation2020). Finally, estimates of students’ academic growth for the same student sample differed by test (Supplemental Table B.4). For example, growth estimates in mathematics (Lehner et al., Citation2017) and science (Schiepe-Tiska et al., Citation2017) were found to be larger when the tests were aligned to the national standards that underlie school curricula than when using tests that assessed domain-specific literacy.

Cross-Sectional vs. Longitudinal Growth Estimates

To estimate students’ academic growth, researchers can apply two alternative study designs. One can apply a cross-sectional study design to compare the average achievement levels of independent, representative student samples attending, for example, two consecutive grade levels (e.g., students attending grade levels 9 and 10). Alternatively, one can apply a longitudinal study design to compare the average competence levels of a representative student sample that is followed across two or more consecutive waves of measurements (e.g., the same students measured in both Grade 9 and Grade 10).

The large majority of normative expectations for students’ academic growth in the United States were developed using cross-sectional designs (Bloom et al., Citation2008; Lee et al., Citation2019; Scammacca et al., Citation2015). Bloom et al. (Citation2008) also showed that differences in academic growth trajectories between study designs were typically small (i.e., less than .10 of a standard deviation) for most grade transitions. However, when students reached the legal age to drop out of school (i.e., for the transition from Grade 9 to 10), considerably larger growth rates (i.e., about 0.25 standard deviations) were observed when using a cross-sectional design.

One major reason for the differences in results between research designs is selective drop out. The mean competence level in the lower grade is based on both the students who drop out and the students who continue their school career; the mean competence level in the upper grade is based only on the students who continue their school career. The average level of achievement of students who drop out of school is lower than that of students who continue their school career (Gubbels et al., Citation2019). Thus, in cross-sectional designs, student attrition due to student drop-out leads to a positive selection of students with better achievement in higher grade levels. In longitudinal designs, estimates of academic growth are computed as the mean change in students’ achievement scores. The change score estimates the academic growth of the students who continue their school career but not of those who drop out. Thus, growth estimates of achievement are less affected by selection effects due to attrition when using longitudinal designs. In summary, estimates of academic growth using cross-sectional designs may be larger than estimates of academic growth in longitudinal designs when data are subject to student attrition. Thus, longitudinal designs are preferable for estimating students’ growth in achievement (Bloom et al., Citation2008).

Research Objectives

Normative expectations for academic growth serve as vital empirical benchmarks to assess the meaningfulness of results from educational interventions. Such benchmarks should be tied to the target population and outcome of the intervention. However, for intervention researchers in Germany the relevant knowledge base on this kind of effect size benchmark is limited in several ways. First, relative to the US, considerably less is known about normative expectations for students’ academic growth in Germany at the national level for the general student population. Second, the available growth estimates focus on the total student population, but not student populations attending certain types of schools or in certain educational stages. Third, most effect size benchmarks for students’ academic growth are based on cross-sectional designs, which may overestimate growth rates when student attrition is likely to occur (Bloom et al., Citation2008). This is the case in Germany because every school year a considerable proportion of students (i.e., up to 5%) repeat a class (Statistisches Bundesamt [Destatis]), Citation2021, supplemental Tab. 3.8). To address these gaps in the literature we take advantage of individual participant data (IPD) obtained from all major longitudinal studies that are representative for the German student population in Grades 1–12. Drawing on these IPD, we analyze and meta-analytically integrate empirical benchmarks for students’ annual academic growth in vital outcome domains—German as first language, English as second language, ICT, mathematics, and science—for (a) the total student population and (b) students attending different types of schools.

Method

To obtain empirical benchmarks for students’ academic growth, we applied the two-stage strategy for IPD meta-analyses: we first estimated effect sizes for IPD in Stage 1 and combined them meta-analytically in Stage 2 (Brunner et al., Citation2023; Burke et al., Citation2017; Morris et al., Citation2018). We only briefly describe the applied methods here (see supplemental OSM A.2 for details).

Large-Scale Assessment Data

To obtain effect size estimates of students’ academic growth, we sought IPD that fulfilled the following inclusion criteriaFootnote4. The datasets should (a) be representative of the German student population, (b) provide longitudinal data for the same sample of individual students, and (c) provide scores obtained from standardized achievement tests. To identify such datasets we carried out a systematic search in two electronic databases. Our search identified four large-scale assessments comprising longitudinal data from six independent national probability samples: the samples of three starting cohorts (SC2, SC3, and SC4; NEPS Network, Citation2019, Citation2020a, Citation2020b) from the National Educational Panel Study (NEPS, Blossfeld & Rossbach, Citation2019), the sample of the Assessment of Student Achievements in German and English as a Foreign Language (DESI, Klieme, Citation2012), and the samples of the longitudinal extensions of the year 2003 (PISA-I + 03) and 2012 (PISA-I + 12) cycles of the Programme for International Student Assessment (Prenzel et al., Citation2013; Reiss et al., Citation2020). Supplemental Table A.1 provides an overview of the socio-demographics for these samples. The sample sizes that were available for the statistical analyses varied from N = 1,868 students (NEPS-SC3, Grade 12) to N = 10,543 students (DESI; see Supplemental Table A.2).Footnote5

Measures

We examined students’ growth in achievement using a broad spectrum of measures that provided a commensurable (i.e., vertical) metric across time (see Supplemental Table A.8). The datasets included measures for achievement in various domains: mathematics, science, ICT, specific verbal skills in German as a first language (reading comprehension, grammar, writing), and specific verbal skills in English as a foreign language (text reconstruction, language awareness, listening). Assessments were conducted in all grades from 1 to 12 except Grade 8. All tests were administered using a paper-and-pencil format.

Two-Stage IPD Meta-Analyses

To obtain effect sizes for students’ growth in achievement in Stage 1 we used the R package lavaan (version 0.6–9; Rosseel, Citation2012) to specify latent change score models (McArdle, Citation2009) following the guidelines provided in Kievit et al. (Citation2018). Growth was defined as the mean difference (Δ) in the latent change score that was obtained for the measures of two successive waves of measurement. We followed Bloom and colleagues (Citation2008) and computed a standardized effect size for academic growth by dividing Δ by the average standard deviation of achievement measures across two successive waves of measurement. Specifically, we used the average standard deviations as obtained for (a) the total student population SDAve and (b) the student population attending a certain school type SDAve.ST to compute ESGrowth and ESGrowth.ST, respectively. (1) ESGrowth=Δ/SDAve(1) (2) ESGrowth.ST=Δ/SDAve.ST(2)

To facilitate the comparison of effect sizes within and across studies we scaled each standardized effect size to represent annual growth (i.e., growth in 12 months; see Lee et al., Citation2019). All growth estimates are presented in Supplemental Table B.0.

In Stage 2, we used the R package metafor (version 3.0; Viechtbauer, Citation2010) to carry out the meta-analyses. Because random-effects models cannot be expected to reliably gauge the heterogeneity of (true) effect sizes when less than k = 10 effect sizes are available for the meta-analytic integration (Langan et al., Citation2019, p. 95), we applied (multivariate) fixed-effects models (Rice et al., Citation2018) when 2 ≤ k < 10, and (multivariate) random-effects models (Hedges, Citation2019) when k ≥ 10. The random-effects model provides several measures to assess the heterogeneity of the effect sizes (Borenstein et al., Citation2017). The standard deviation (σ) depicts the average deviation of the (true) effect sizes around the (true) average effect size. The I2 statistic depicts the proportion of observed heterogeneity in effect sizes that is real and not due to random noise; I2 values falling in the intervals 30% ≤ I2 ≤ 60%, 50% ≤ I2 ≤ 90%, and 75% ≤ I2 ≤ 100% are often considered to represent moderate, substantial, and considerable heterogeneity (Higgins et al., Citation2021). Finally, the 95% prediction interval (95% PI) provides a plausible range of values in which the true effect sizes of about 95% of all relevant populations will fall. To take within-sample dependencies among effect sizes into account we used the R package clubSandwich (version 0.5.3; Pustejovsky, Citation2021) to impute a working covariance matrix for the observed effect sizes (Hedges, Citation2019). We used r = .90 as a reasonable upper-bound estimate for the within-sample correlation among effect sizes. Because we used an estimated working covariance matrix rather than an empirical one, we conducted sensitivity analyses (Hedges, Citation2019; Mavridis & Salanti, Citation2013). These analyses corroborated that the meta-analytic statistics (i.e., averages, standard errors, and σ), were fairly robust against the different values chosen for the correlation among effect sizes (see supplemental OSM A.2).

Results and Discussion

Normative Expectations for Students’ Annual Academic Growth

We found considerable variation in students’ annual academic growth across domains, grade levels, and school types (see as well as Supplemental Tables B.1 and B.2). The heterogeneity in effect sizes implies that the magnitude of the very same intervention effect would be assessed differently depending on the target outcome and target population. The heterogeneity in effect sizes becomes particularly apparent when looking at the results obtained for the random-effects models. For example, the standard deviation of the true effect sizes around the meta-analytic average for average learning gains from Grade 8 to 9 was 0.16 (I2 = 99%) with a 95% PI ranging from ESGrowth = −0.11 to ESGrowth = 0.55.

Table 1. Normative expectations of students’ annual academic growth: total student population.

Second, consistent with previous research, average learning gains in the total student population decreased in all domains as students moved from Kindergarten (i.e., German preschool) to Grade 12 (see Supplemental Table B.1 and Figure A.4). However, this basic pattern in the total student population showed variation within and across domains. For example, learning gains in mathematics from Kindergarten to Grade 1 were considerably larger (ESGrowth = 1.16) than from Grade 1 to 2 (ESGrowth = 0.68; see and Supplemental Figure A.4). In higher grade levels, learning gains became more similar, although considerable variation remained (see section “Average” in , Supplemental Table B.0, and Figure A.4). Third, the basic pattern for students’ academic growth that was observed for the total student population was also found for most school types and domains (see Supplemental Table B.1). Yet, there were again some deviations from this pattern within school types. For example, annual academic growth in mathematics for students attending vocational school was somewhat larger in Grades 7–9 (ESGrowth = .39) than in Grades 5–7 (ESGrowth = .29). Fourth, for many domains and grade transitions, standardized effect sizes (in terms of ESGrowth) were very similar across school types, but some growth estimates varied across school types (e.g., verbal skills in English; see Supplemental Table B.1).

We can only speculate about possible reasons why we observed this variation in students’ academic growth within and across domains as well as across student populations. For example, these deviations may reflect differences in the extent to which school curricula in different grade levels and school types emphasize the content that is assessed with a certain test. Moreover, the deviations could also arise from increased student learning (e.g., because students are more motivated to learn) when a certain educational path leads to an important transition point (e.g., from elementary to secondary school) or to a certain school leaving certificate (e.g., the lower secondary school leaving certificate). These factors and their interactions may help explain the deviations from the basic pattern that we observed, for example, in academic growth in mathematics during elementary school, or for vocational school students approaching Grade 9 (i.e., the year they may obtain the lower secondary school leaving certificate).

Limitations

Two caveats should be borne in mind when applying the present findings on students’ annual academic growth as empirical benchmarks. First, the time span between two successive waves of measurements underlying the annual growth estimates was in some cases shorter than 12 months (e.g., in the DESI study), and in some cases (considerably) longer than 12 months (e.g., 24 or 36 months for NEPS-SC2/-SC3). Consistent with the approach used by Lee et al. (Citation2019), we assumed linear growth during the observed time interval to provide an estimate of students’ annual growth for certain grade transitions. However, previous research with student samples in the United States showed that assuming linear growth for students’ learning may not provide the best fit to the empirical data, because the development of students’ achievement follows to some extent non-linear trajectories both within (Kuhfeld & Soland, Citation2021) and across grade levels (Bloom et al., Citation2008; Kuhfeld & Soland, Citation2021; Lee et al., Citation2019; Scammacca et al., Citation2015). Thus, when growth estimates for a certain grade transition are based on linear projections, these estimates should be interpreted with care. For example, when annual growth estimates are based on a time interval of 24 months, the growth estimate reflects students’ learning gains across the lower and the higher grade. Because learning gains typically decrease with increasing grade levels, academic growth as estimated for the lower and upper grade should be considered lower and upper bound estimates of students’ true learning gains for these grade transitions, respectively. Results obtained for US students suggest that this caveat concerns especially the learning gains of students in elementary school (see Supplemental Figure A.4; Bloom et al., Citation2008; Kuhfeld & Soland, Citation2021; Lee et al., Citation2019; Scammacca et al., Citation2015 ). Academic growth rates from Kindergarten to Grade 4 decreased considerably with increasing grade levels. However, from Grade 5 to Grade 11, growth estimates were found to be much more consistent across grade levels.

Second, the estimates for students’ annual academic growth refer to the average growth observed for the total student population or the population of students attending a certain school type. When educational interventions target other student populations (e.g., students scoring at the bottom end of the achievement distribution or students with low SES), the present growth estimates may not match well to the target population of the intervention. To address this limitation, the latent change score models that we applied in the present paper can be extended as latent change regression models (McArdle, Citation2009) that include students’ achievement in the first wave of measurement and other individual characteristics (e.g., their SES) as covariates. This allows the estimation of student growth as a function of the covariates, and, thus, to estimate empirical benchmarks for more specific student populations. Moreover, consistent with previous studies in the United States (Bloom et al., Citation2008; Lee et al., Citation2019; Scammacca et al., Citation2015), we provided growth estimates for the total student population in Germany but not for specific federal states. When benchmarks on students’ academic growth in certain states are needed, researchers may want to search electronic databases (e.g., those that we applied in the present study) to identify longitudinal samples that are representative for these states.

Part 2: Performance Gaps Between Demographic Student Groups

Theoretical Background and Previous Research

Rationale for Using This Benchmark

Educational policies in many countries (Organisation for Economic Co-operation & Development [OECD], Citation2003; Stanat & Christensen, Citation2006) as well as many educational interventions (Lipsey et al., Citation2012) aim to reduce the performance gaps consistently found between vital student demographic groups, such as male and female students, socioeconomically advantaged and disadvantaged students, and students without and with migration background. As pointed out by Konstantopoulos and Hedges (Citation2008), performance gaps between these student subgroups may therefore serve as important empirical benchmarks to assess the meaningfulness of the effects of policy efforts or educational interventions aimed at reducing or eliminating these gaps.

Previous Research

Konstantopoulos and Hedges (Citation2008) as well as Bloom and colleagues (Bloom et al., Citation2008; Hill et al., Citation2008; Lipsey et al., Citation2012) provided empirical benchmarks for performance gaps between student demographic groups based on data from the US National Assessment of Educational Progress (NAEP). Building on the work of these authors, we computed these effect sizes for the most recent NAEP assessment in 2019 (see Supplemental Table A.15) and found that female students outperformed male students in reading whereas gender differences in mathematics were considerably smaller or negligible. Further, socioeconomically advantaged students (i.e., students ineligible for free/reduced price lunch) outperformed socioeconomically disadvantaged students in both outcome domains. Finally, White students outperformed both Black and Hispanic students across all grade levels in both reading and mathematics.

In Germany, performance gaps between three demographic student groups have received a great deal of attention: (a) male and female students, (b) socioeconomically advantaged and disadvantaged students, and (c) students without and with migration background. The scientific, public, and political discussion of these performance gaps was informed by reliable evidence obtained from several large-scale assessments. First, in findings from elementary and lower secondary education, female students consistently outperformed male students in verbal skills (Böhme et al., Citation2016; McElvany et al., Citation2017; Schipolowski et al., Citation2017; Weis et al., Citation2019a) and ICT skills (Gerick et al., Citation2019), whereas male students outperformed female students in mathematics (Robitzsch et al., Citation2020; Schipolowski et al., Citation2017, Citation2019). Gender differences in (general) science and most science domains (except for biology) were small to negligible (Robitzsch et al., Citation2020; Schipolowski et al., Citation2019). Second, socioeconomically advantaged students outperformed socioeconomically disadvantaged students in all outcome domains in elementary and lower secondary education (Haag et al., Citation2017; Hußmann et al., Citation2017; Kuhl et al., Citation2016; Mahler & Kölm, Citation2019; Senkbeil et al., Citation2019; Weis et al., Citation2019b). Third, students without migration background (i.e., whose parents were both born in Germany) outperformed students with migration background in all grade levels and outcome domains, especially when both of their parents were born outside Germany (Haag et al., Citation2016; Henschel et al., Citation2019; Rjosk et al., Citation2017; Vennemann et al., Citation2019; Weis et al., Citation2019b; Wendt et al., Citation2020; Wendt & Schwippert, Citation2017).

Research Objectives

Performance gaps between student demographic groups may serve as vital empirical benchmarks to assess the meaningfulness of findings from educational policies or interventions that aim to reduce or eliminate these gaps (Bloom et al., Citation2008; Hill et al., Citation2008; Konstantopoulos & Hedges, Citation2008; Lipsey et al., Citation2012). Such benchmarks should be tied to the target population and outcome of the intervention. Thus, educational researchers in Germany may benefit from having benchmarks for performance gaps between student demographic groups that are based on reliable research evidence for the total student population in Germany as well as student populations in each federal state. However, the relevant knowledge base on this kind of effect size benchmark for Germany is limited in several ways. First, the published research evidence on performance gaps between student demographic groups obtained from major large-scale assessments is scattered across individual study reports, and has not yet been mapped onto a common effect size metric and meta-analytically integrated across large-scale assessments. Second, reliable and generalizable results on performance gaps between demographic student groups in Germany have been published for students in elementary education and lower secondary education, but not for students in upper secondary education.

The overarching goal of Part 2 of the present paper is therefore to provide empirical benchmarks for performance gaps between student demographic groups by meta-analyzing the results for performance gaps in the German student population between (a) male and female students, (b) socioeconomically advantaged and disadvantaged students, and (c) students without and with migration background. Taking advantage of the results obtained from all major German large-scale assessment programs, we provide reliable and generalizable benchmarks values for performance gaps between student demographic groups for (a) the total student population in Germany in elementary, lower, and upper secondary education and (b) student populations in each of the 16 federal states for students in elementary and lower secondary education.

Method

Search for Results on Performance Gaps Between Student Demographic Groups from Large-Scale Assessments

The scientific, public, and political discussion of performance gaps between student demographic groups is based on reliable evidence obtained from several large-scale assessments. We therefore sought published results (see OSM A.3 for details) on performance gaps between student demographic groups obtained from all major national and international large-scale assessment programs in Germany, including the National Assessment Studies (NAS; Grade 4 and 9), the Trends in International Mathematics and Science Study (TIMSS; Grade 4), the Progress in International Reading Literacy Study (PIRLS; Grade 4), the Programme for International Student Assessment (PISA; 15-year-olds mostly in Grade 9 or 10), and the International Computer and Information Literacy Study (ICILS; Grade 8). Because these studies did not cover student populations in upper secondary education, we added the results on performance gaps obtained from analyzing the IPD of students in SC3 and SC4 who participated in NEPS in grade levels 11 and 12 (see Part 1). Finally, we could only use the results from the NAS in Grade 4 and 9 to meta-analyze performance gaps in each of the 16 federal states because neither the international large-scale assessment studies nor the NEPS provide reliable results for individual federal states.

Computation of Standardized Effect Sizes

We computed standardized mean differences ESGender and ESMig to depict the performance gap between (a) male and female students and (b) the performance gap between students whose parents were both born in Germany (i.e., students without migration background) and students whose parents were both born outside Germany (i.e., students with migration background). To this end, we divided the raw mean-level difference in achievement (ΔGender or ΔMig) by an estimate of the standard deviation of the achievement measure as obtained for the total student population in Germany. Positive values of ESGender indicate that female students outperformed male students, and positive values of ESMig indicate that students without migration background outperformed students with migration background. Further, we computed ESSES that depicts the correlation between students’ achievement and their SES as measured by the highest occupational status of the parents (HISEI; Ganzeboom & Treiman, Citation1996). The HISEI is a well-established, internationally comparable measure that refers to the occupational status of the parent who is higher on this measure or to the only available parent. To facilitate the broad application of the results obtained for students’ SES as effect size benchmarks, we drew on Pustejovsky (Citation2014, pp. 96–97) to convert the meta-analytic averages obtained for ESSES into standardized mean differences (ESSES.SMD). Specifically, ESSES.SMD approximates the mean difference in achievement (in standard deviation units) between students whose HISEI is above and below the median HISEI level in the total student population. The effect sizes for the total student population and the federal states can be found in OSM C.

Meta-Analytic Integration

We applied separate meta-analytic models to integrate the effect sizes ESGender, ESSES, and ESMig for the subdomains within domains, domains, and across domains obtained for (a) the total student population and (b) the student populations in each federal state. To this end, we applied the same meta-analytic procedures and set of assumptions as in Part 1. Sensitivity analyses corroborated that the meta-analytic averages and the associated standard errors and confidence limits were fairly robust against the different values chosen for the correlation among observed effect sizes (see supplemental OSM A.3).

Results and Discussion

Normative Expectations for Performance Gaps Between Student Demographic Groups

The observed variations in the size of performance gaps between student demographic groups across and within domains, grade levels, and federal states imply that the magnitude of the very same intervention effect would be assessed differently depending on the target outcome and target population. First, as shown by , the performance of male and female students in the total student population was very similar for 10 out of 34 outcomes (i.e., −0.10 ≤ ESGender ≤ 0.10). However, there were also several outcomes with larger gender differences in achievement. For example, female students in lower secondary education outperformed male students in verbal skills (ESGender = 0.32), particularly in spelling in German (ESGender = 0.46), as well as in ICT skills (ESGender = 0.20). The heterogeneity in the magnitude of gender differences becomes also evident when looking at the standard deviation of (true) effect sizes () and the other heterogeneity measures (see Supplemental Table C.1). Moreover, relative to elementary and lower secondary education, gender differences increased considerably in upper secondary education in favor of male students in mathematics, ICT, and science. One plausible reason why gendered performance gaps may increase in upper secondary education is that male students are more likely than female students to choose courses in mathematics (e.g., school year 2018/19: male/female students: 52%/48%), informatics (85%/15%), physics (74%/26%), and chemistry (56%/44%) that are taught with more lessons per week and at an increased level of academic standards (Servicestelle der Initiative Klischeefrei, Citation2020). Male students’ increased learning opportunities relative to female students in these subjects may be a major factor in the larger observed gender differences in these grades.

Table 2. Elementary education (grade 4): normative expectations for performance gaps between student demographic groups.

Table 3. Lower secondary education (grade 8, 9 and 10): normative expectations for performance gaps between student demographic groups.

Table 4. Upper secondary education (grade 11 and 12): normative expectations for performance gaps between student demographic groups.

Second, socioeconomically advantaged students consistently outperformed socioeconomically disadvantaged students across all outcome domains and grade levels in the total student population (). Consistent with previous research in the US (see Supplemental Table A.15 and Sirin, Citation2005), we found some (albeit small) variation across, but also within, domains within grade levels. In particular, the standard deviations of (true) effect sizes around the average effect size across domains were σ = 0.04/0.02/0.03 in elementary/lower secondary/upper secondary education (). Further, we found profound differences between grade levels. Performance gaps in upper secondary education were considerably smaller (.13 ≤ ESSES ≤ .18) than in elementary (.29 ≤ ESSES ≤ .40) or lower secondary education (.26 ≤ ESSES ≤ .38). One major reason for these differences in performance gaps is that students with higher levels of SES and achievement are more likely to enter upper secondary educationFootnote6. These selection processes also lead to a restriction in the range of both the SES measure and the achievement outcomes for the subpopulation of students who entered upper secondary education. All else being equal, these variance restrictions in turn may have reduced the size of the social gradient (see Cohen et al., Citation2003) that we applied to estimate the effect size ESSES.Footnote7

Third, students without migration background outperformed students with migration background across all outcome domains (except for reading in English) and grade levels in the total student population. However, performance gaps varied widely across achievement outcomes (see the heterogeneity measures reported in and Supplemental Table C.1). For example, the smallest performance gaps in elementary education were found for spelling in German (ESMIG = 0.28) and the largest in science (ESMIG = 0.82). Moreover, performance gaps in upper secondary education were smaller (average achievement: ESMIG = 0.36) than in elementary (average achievement: ESMIG = 0.63) or lower secondary education (average achievement: ESMIG = 0.63). One reason for these differences is that students who enter upper secondary education demonstrate higher levels of achievement irrespective of their migration background (see Footnote 6). Thus, performance gaps related to students’ migration background can be expected (and were also found) to be smaller in magnitude in upper secondary than in lower secondary or elementary education.

Fourth, the pattern of results on performance gaps between student demographic groups in elementary and lower secondary education that was found for the total student population was also found for all federal states (Supplemental Tables C.2–C.4). However, the magnitude of performance differences between student demographic groups varied across federal states. For example, for reading in German in elementary school, gender differences ranged between ESGender = 0.15 (Bremen) and ESGender = 0.33 (Lower Saxony), the social gradient varied between ESSES/ESSES.SMD = 0.32/0.53 (Schleswig-Holstein) and ESSES/ESSES.SMD = 0.43/0.72 (Lower Saxony), and performance differences between students without and with migration background lay between ESMig = 0.19 (Mecklenburg Western Pomerania) and ESMig = 0.78 (Berlin). Differences between federal states likely reflect a complex interaction of many factors, including differences in educational policies and practices between federal states that aim at reducing or eliminating performance gaps in the student demographic groups under investigation.

Limitations

Two caveats should be borne in mind when applying the present findings on performance gaps between student demographic groups as empirical benchmarks. First, the performance gaps depict differences in the mean level of achievement between student demographic groups. However, the achievement of students within all subgroups under investigation were very heterogeneous and the achievement distributions between these groups overlapped substantially. Thus, the mean-level differences found between student demographic groups cannot be used to reliably predict or assess the performance of individual students within these groups. For example, there were many female students, students from socioeconomically-disadvantaged families, and students with migration background who outperformed male students, students from socioeconomically-advantaged families, and students without migration background, respectively.

Second, consistent with previous studies with US student samples (Bloom et al., Citation2008; Hill et al., Citation2008; Konstantopoulos & Hedges, Citation2008; Lipsey et al., Citation2012), we provided empirical benchmarks for students in elementary, lower secondary, and upper secondary education for selected grade levels. It was not possible for us to provide these benchmarks for other grade levels because they were not targeted by the major national and international large-scale assessments. Further, we could not provide these benchmarks for different types of schools because such results have not been published in the available study reports.

Part 3: Performance Gaps Between Weak and Average Schools

Theoretical Background and Previous Research

Rationale for Using This Benchmark

Many educational policies and interventions are conceptualized as school-level interventions or whole-school reforms, rather than targeting individual students (Lipsey et al., Citation2012). Such policies and interventions typically aim to improve learning outcomes for all students at a certain school by systematically addressing vital learning-related factors, such as the quality of instruction, parent involvement, student assessment, teachers’ professional development, or school management (Borman et al., Citation2003; Cheung et al., Citation2021). As pointed out by Konstantopoulos and Hedges (Citation2008), a natural—and arguably the most appropriate—empirical benchmark to assess the effects of such school-level interventions are performance gaps between schools. Because it does not seem realistic for a single intervention to make a weak school perform like the best schools, a more realistic, but still ambitious goal of school-level interventions might be to make weak schools perform like average schools (Hill et al., Citation2008; Konstantopoulos & Hedges, Citation2008). The observed performance gap between weak and average-performing schools has therefore been proposed as an empirical benchmark (Bloom et al., Citation2008; Hill et al., Citation2008; Lipsey et al., Citation2012).

Previous Research with Schools from the United States

Both student-level (e.g., prior knowledge, parental support) and school-level factors (e.g., quality of instruction) contribute to students’ learning in school (Atteberry & McEachin, Citation2020; Wang et al., Citation1993). Previous research provided empirical benchmarks for the performance gaps between schools that estimated the joint impact of school-level factors on students’ achievement outcomes by taking into account between-school differences in student-level factors, including students’ level of prior achievement, SES, gender, and migration background (Bloom et al., Citation2008; Hill et al., Citation2008; Konstantopoulos & Hedges, Citation2008; Lipsey et al., Citation2012). This yields a distribution of school performance () that depicts how much schools differ in their mean level of achievement (i.e., the regression-adjusted school-average achievement), if all schools had students with similar prior achievement and socio-demographic background (Bloom et al., Citation2008; Lipsey et al., Citation2012). Using this distribution of school performance, weak and average schools were defined as those schools scoring at the 10th and 50th percentile, respectively (Bloom et al., Citation2008; Hill et al., Citation2008; Lipsey et al., Citation2012). The performance gap between weak and average schools was then divided by the student-level standard deviation of the achievement outcome to estimate the standardized effect size ESSchool (see EquationEquation 3). Using this methodological approach, previous studies estimated ESSchool for reading and mathematics achievement of students attending Grades 3, 5, 7, and 10 in four school districts in the US (see Supplemental Figure A.14). Performance gaps between weak and average schools varied considerably across districts with 0.16 ≤ ESSchool ≤ 0.43 (Mdn ESSchool = 0.27) in elementary education (i.e., Grade 3) and 0.11 ≤ ESSchool ≤ 0.41 (Mdn ESSchool = 0.26) in lower secondary education. Performance gaps between schools were more homogenous in upper secondary education (i.e., Grade 10) with 0.07 ≤ ESSchool ≤ 0.17 (Mdn ESSchool = 0.12).

Figure 1. Performance Gap Between Weak and Average Schools (adapted from Bloom et al., Citation2008, p. 314). Note. To estimate performance gaps between schools, we adapted the approach by Bloom et al. (Citation2008). Specifically, we drew on the standard assumption of multilevel models that the random-coefficients (u0j) that depict regression-adjusted mean-level differences between schools (i.e., the regression-adjusted school-average achievement levels) are normally distributed with mean zero and standard deviation τ. To obtain an estimate of τ we used a two-level random-intercept model (Level 1: students, Level 2: schools) in which students’ (grand-mean centered) prior achievement, their (grand-mean centered) SES, and information on students’ gender and migration background were entered as predictors at Level 1 for each achievement outcome. Schools with a regression-adjusted school-average achievement level below/at/above zero score worse/same/better than other schools with students of the same prior achievement level and socio-demographic background characteristics. The figure shows the distribution for τ = 1.

Figure 1. Performance Gap Between Weak and Average Schools (adapted from Bloom et al., Citation2008, p. 314). Note. To estimate performance gaps between schools, we adapted the approach by Bloom et al. (Citation2008). Specifically, we drew on the standard assumption of multilevel models that the random-coefficients (u0j) that depict regression-adjusted mean-level differences between schools (i.e., the regression-adjusted school-average achievement levels) are normally distributed with mean zero and standard deviation τ. To obtain an estimate of τ we used a two-level random-intercept model (Level 1: students, Level 2: schools) in which students’ (grand-mean centered) prior achievement, their (grand-mean centered) SES, and information on students’ gender and migration background were entered as predictors at Level 1 for each achievement outcome. Schools with a regression-adjusted school-average achievement level below/at/above zero score worse/same/better than other schools with students of the same prior achievement level and socio-demographic background characteristics. The figure shows the distribution for τ = 1.

Research Objectives

Performance gaps between weak and average schools may serve as vital empirical benchmarks to assess the effects of educational policies or interventions at the school level. However, this kind of benchmark has not yet been provided for German schools. To address this gap in the literature we take advantage of IPD obtained from all major longitudinal studies that are representative of the German student population in Grades 1 to 12. Drawing on these IPD, we analyze and meta-analytically integrate empirical benchmarks for performance gaps between weak and average schools in vital outcome domains—German as first language, English as second language, ICT, mathematics, and science—for (a) the total student population and (b) students attending different types of schools.

Methods

Large-Scale Assessment Data

To obtain empirical benchmarks for performance gaps between schools we again applied the two-stage strategy for IPD meta-analyses (see supplemental OSM A.4 for details). To this end, we took advantage of the IPD from all major longitudinal studies that are representative for the German student population in elementary and secondary school (see Part 1).

Measures

We examined performance gaps between schools by using a broad spectrum of outcome measures (for an overview, see Supplemental Table A.8). To estimate the regression-adjusted school-average achievement for each school we used domain-identical pretest measures and socio-demographic student characteristics as covariates. Specifically, we used two measures of students’ SES, namely the HISEI (Ganzeboom & Treiman, Citation1996) and an indicator of the highest educational attainment within the family. Moreover, we used two indicator variables to represent students’ gender and migration background.

Two-Stage IPD Meta-Analyses

Drawing on the approach by Bloom et al. (Citation2008), we used two-level random-intercept models to estimate performance gaps between schools with similar student backgrounds (see ) in Stage 1. In these models the random-coefficient for the intercept u0j (as obtained for a certain school j) depicts the regression-adjusted school average achievement, as the difference between the school-average achievement level of a certain school from the grand-mean while controlling for students’ prior achievement and socio-demographic characteristics. Following Konstantopoulos and Hedges (Citation2008) and Bloom et al. (Citation2008), we drew on the standard assumption of multilevel models that u0j is normally distributed with mean zero and standard deviation τ (see ). Using this set of assumptions, we computed the performance gap ESSchool between weak schools (with u0j being located at the 10th percentile) and average schools (with u0j being located at the 50th percentile) as follows (Bloom et al., Citation2008, p. 315): (3) ESSchool= 1.285τ/SDPopulation(3)

The multiplier 1.285 in Equation 8 denotes the (absolute) difference between the 10th and 50th percentile in a standard normal distribution (i.e., |−1.285 − 0| = 1.285), and SDPopulation represents the student-level standard deviation of the achievement outcome as obtained for the total student population. ESSchool was computed for (a) the total student population as well as (b) the population of students attending a certain school type. We also computed an effect size ESSchool.ST where we divided the τ obtained for each school type by the student-level standard deviation of the achievement outcome (SDPopulation.ST) obtained for the student population attending each school type: (4) ESSchool.ST= 1.285τ/SDPopulation.ST(4)

All values of ESSchool and ESSchool.ST are presented in Supplemental Table D.0.

In Stage 2, we applied separate meta-analytic models to integrate the effect sizes ESSchool and ESSchool.ST for the subdomains within domains, domains, across domains, and across grade levels obtained for (a) the total student population and (b) different school types (including average estimates for nonacademic track schools). To this end, we applied the same meta-analytic procedures and set of assumptions as in Part 1. Sensitivity analyses empirically corroborated that most meta-analytic averages and the associated standard errors and confidence limits were fairly robust against the different values chosen for the correlation among observed effect sizes (see supplemental OSM A.4).

Results and Discussion

Normative Expectations for Performance Gaps Between Weak and Average Schools

The observed variations of performance gaps between weak and average schools across and within educational stages imply that the magnitude of the very same intervention effect may be assessed quite differently depending on the target population and outcome. First, consistent with results from the US, shows that the performance gaps between weak and average schools were larger in elementary (average ESSchool = 0.32) and lower secondary education (e.g., average ESSchool = 0.41 in Grades 5–9) than in upper secondary education (average ESSchool = 0.23). One plausible reason for the increase in performance gaps between schools in lower secondary education is that students in Germany are tracked into different school types that cater to their performance levels when they enter secondary school. Thus, mean level differences between school types add to performance differences between schools even when controlling for students’ prior achievement and socio-demographic characteristicsFootnote8. Further, students need to qualify to enter upper secondary education irrespective of the type of school that they attend (Secretariat of the Standing Conference of the Ministers of Education & Cultural Affairs of the Länder in the Federal Republic of Germany [KMK], Citation2019). Students at this level therefore demonstrate considerably higher levels of achievement than the other students, and the student body becomes more homogenous with respect to their achievement levels (see also Footnotes 6 and 7). Consequently, performance gaps between schools become considerably smaller in upper secondary education. Second, performance gaps between weak and average schools in the total student population varied considerably across domains within lower secondary education, but much less so in elementary and upper secondary education (see section “Average” in and Supplemental Table D.1). For example, the standard deviation of the (true) effect sizes for the meta-analytic integration across domains was σ = .13 in Grades 5 to 9 and σ = .12 in Grades 5 to 10, but σ = .08 in Grades 1 to 4 and σ = .05 in Grades 10 to 12. Third, several performance gaps between schools in the total student population varied considerably within domains (see Supplemental Table D.0). For example, in lower secondary education (i.e., grade levels 5 to 10) the largest range of performance gaps between schools were found for verbal skills in English (0.26 ≤ ESSchool ≤ 0.75) and German (0.26 ≤ ESSchool ≤ 0.58).

Table 5. Performance gaps between weak and average schools: total student population.

Strengths and Limitations

Drawing on the work by Bloom et al. (Citation2008), the effect size ESSchool was computed as the model-implied difference between weak and average schools assuming that all schools had students with the same level of prior achievement and the same socio-demographic background characteristics. This approach offers the advantage that performance gaps between schools can be easily computed for other reference values. Thus, it is possible to obtain empirical benchmark values for schools at different points on the performance continuum (e.g., closing the performance gap between average and excellent schools). Drawing on the cumulative standard normal distribution, the performance gap between schools located, for example, at the 50th percentile (i.e., average schools) and 95th percentile (i.e., excellent schools) is 1.645 ⋅ τ.

Despite this strength, two caveats should be borne in mind when applying the present findings on performance gaps between schools as empirical benchmarks. First, as pointed out by Bloom et al. (Citation2008), ESSchool approximates the “net effect” of factors related to school effectiveness that may be targeted by school-level interventions. However, without adjusting for the covariates, the mean-level differences between weak and average schools are (considerably) larger than ESSchool because the student characteristics that were included in the multi-level model as covariates also contribute substantially to students’ achievement outcomes over and above the school effects. Second, in accordance with the seminal studies by Konstantopoulos and Hedges (Citation2008) and Bloom et al. (Citation2008), we computed ESSchool by assuming that the regression-adjusted school-average achievement scores (i.e., the random-coefficients u0j) were normally distributed. This allows us to compare the benchmark values of the present study to those obtained in prior research. Further, additional analyses by Bloom et al. (Citation2008, p. 314) showed that similar benchmark values were obtained when using the empirical (and not a model-implied) distribution of the regression-adjusted school average achievement scores. Nevertheless, future research may benefit from estimating ESSchool by invoking different distributional assumptions for the random coefficients, or by obtaining estimates of the regression-adjusted school-average achievement level for each school (e.g., empirical Bayes estimates; see Raudenbush & Bryk, Citation2002, p. 47) and using the empirical distribution of these estimates.

Application Guidelines and Example

Empirical effect size benchmarks facilitate assessing the meaningfulness of educational intervention effects. This requires (a) finding a good match between the benchmark values and the intervention with respect to its nature and its target population and outcome (Bloom et al., Citation2008; Hill et al., Citation2008; Lipsey et al., Citation2012), (b) taking into account the statistical characteristics of the benchmark values with respect to standardization and statistical precision, and possibly (c) translating the intervention effects into intuitive metrics. In the following, we elaborate on each of these requirements. Moreover, to illustrate the application of the guidelines and benchmarks presented in this paper, we developed several example scenarios. We outline one of them at the end of this chapter; the remaining scenarios as well as flow diagrams (Supplemental Figures A.1–A.3) that help researchers choose the appropriate empirical benchmarks for various application scopes can be found in OSM A.1.

Finding a Good Match Between the Benchmark Values and the Intervention

Nature of the Intervention

How can one find a good match between empirical benchmarks and the nature of the intervention? Normative expectations of performance gaps between student demographic groups are arguably the most helpful empirical benchmarks for educational interventions that aim to reduce or eliminate these gaps (Bloom et al., Citation2008; Konstantopoulos & Hedges, Citation2008). Normative expectations of performance gaps between weak and average schools are arguably the most appropriate empirical benchmarks when the target intervention is conceptualized as school-level interventions or whole-school reforms that aim to improve learning outcomes for all students at a certain school (Konstantopoulos & Hedges, Citation2008). In all other cases, and when the interventions aim to improve students’ academic growth over and above what would have occurred during the year, normative expectations of students’ academic growth are very helpful to assess the meaningfulness of these interventions (Bloom et al., Citation2008). For example, normative expectations of students’ academic growth may serve as vital empirical benchmarks for interventions that compare the intervention group to a practice-as-usual control group. Expectations regarding students’ academic growth may also be used as empirical benchmarks in non-experimental research or for descriptive analyses. For example, this kind of benchmark was used to evaluate the impact of teachers’ pedagogical content knowledge on student learning (Baumert et al., Citation2010), and to provide context for data on performance gaps in achievement between countries (OECD, Citation2014, p. 46) and between students with and without migration background (OECD, Citation2007, p. 175).

Target Population and Target Outcome

Once an appropriate type of empirical benchmark is chosen for an educational intervention, researchers should use effect size estimates that were obtained from the same student population and from the same (or very similar) outcome domain or subdomain (Bloom et al., Citation2008; Hill et al., Citation2008; Lipsey et al., Citation2012). Supplemental Figures A.14–A.16 (in supplemental OSM A.4) help them select an appropriate value for each type of benchmark as a function of key characteristics (i.e., target population and outcome domain/subdomain) of the interventionFootnote9. In addition, supplemental OSM A.4 provides five scenarios in which we illustrate the application of our guidelines. In particular, we recommend applying the benchmark values obtained for nationally representative samples when the intervention may be generally applicable or when a single benchmark value needs to be chosen to assess the meaningfulness of the intervention effect (e.g., in a journal publication or study report). Further, we provided empirical benchmark values for student subpopulations that were defined by key characteristics of the German school system, involving different school types and federal states in Germany. These more local benchmarks may be particularly helpful (in addition or as an alternative) to convey the results of intervention effects to researchers, practitioners, or policymakers who are interested in the local rather than national context (Bloom et al., Citation2008; Hill et al., Citation2008; Lipsey et al., Citation2012). Finally, because several effect sizes obtained for certain subdomains demonstrated considerable variation within domains (e.g., performance gaps between demographic student groups in German; see and ) we recommend using benchmark values obtained for the subdomain that is targeted by the intervention. However, when benchmark values are not available for a certain subdomain (e.g., geometry), we recommend using the benchmark value for the corresponding domain (e.g., mathematics) as an estimate.

Taking Into Account the Statistical Characteristics of the Benchmark Values

Standardization

The effect size estimates that we provided in the present paper are most applicable as empirical benchmarks when the intervention effect is standardized by using an estimate for the student-level standard deviation of the outcome as obtained for the total student population (Bloom et al., Citation2008). This also facilitates the comparison of the intervention effect across different student subpopulations because the intervention effect (as observed in the raw metric) is not confounded with differences in the variability of students’ achievement across subpopulations. We also provided ESGrowth.ST and ESSchool.ST for the various school types. When an estimate for the standard deviation of the total student population is not available for a certain outcome measure to standardize the intervention effect, these effect sizes are especially useful to put the intervention effect into the context of a certain school type.

Statistical Precision

When choosing a (meta-analytic) effect size as an empirical benchmark value, it is important to consider the statistical precision (i.e., the standard error and 95% confidence limits) with which it was estimated. Because we took advantage of results obtained from large probability samples, most effect sizes and meta-analytic averages were (very) precisely estimated. However, some standard errors were relatively large (e.g., larger than .05 for mean-level differences [ESGrowth, ESGender, ESMig, and ESSchool] or .03 for the correlational benchmark ESSES). In such cases, two strategies may be helpful. First, researchers can use meta-analytic average values that were obtained for corresponding higher aggregate levels as empirical benchmarks (e.g., nonacademic track schools or the total student population rather than specific school types; domain-specific rather than subdomain-specific averages), because (in most cases) these meta-analytic averages could be estimated with higher statistical precision (see Scenario 4). Second, researchers can use the lower and upper bound values of the 95% confidence interval of the selected effect size/meta-analytic average, which provides a plausible range of benchmark values for assessing the intervention effect (see Scenario 5).

Using Empirical Benchmarks to Translate Intervention Effects

Empirical benchmarks help provide an intuition about the magnitude of the intervention effect. This goal may be further supported when empirical benchmark values are used to translate the effect size of the intervention effect onto more familiar scales (Bloom et al., Citation2008; Konstantopoulos & Hedges, Citation2008; Kraft, Citation2020) by dividing the effect size obtained for the intervention effect d by a certain benchmark value. For example, intervention effects can be translated into years-of-learning by dividing d by a value of ESGrowth that matches the student population and outcome targeted by the intervention. Years-of-learning is widely used because it provides an intuitive metric that can be easily communicated to researchers, practitioners, and policymakers (Lortie-Forgues et al., Citation2021). Despite its strengths, the conversion into “years-of-learning” has been criticized. One major point of criticism is that this conversion is not bounded within a reasonable set of values (Baird & Pane, Citation2019). Specifically, when students’ academic growth approaches zero, estimated years-of-learning become very large and, thus, implausible. The same criticism applies to all conversions that divide the effect size of an intervention effect by an empirical benchmark value. Several statistical conversions of the intervention effect have been developed to address these shortcomings of empirical benchmark values, involving percentile gains or improvements in the number of proficient students (see OSM A.1 and Baird & Pane, Citation2019).

Example Application

In the opening example, a whole-school intervention to improve students’ mathematics achievement yielded an effect of d = 0.15. The research team applies ESSchool = 0.24 (from ) as the benchmark value. This benchmark was chosen because the intervention is conceptualized as a school-level intervention that targeted fourth graders’ mathematics achievement (see Supplemental Figure A.3). Using this value of ESSchool as the benchmark allows the researchers to conclude that the intervention helps to close about 63% (i.e., 0.15/0.24 = 0.63) of the performance gap between weak and average schools in elementary education.

General Discussion

The present paper provided novel meta-analytic evidence on empirical benchmarks to support researchers in their assessment of intervention or policy effects on students’ achievement in elementary and secondary schools. To this end, we built on and expanded the seminal work from the United States (e.g., Bloom et al., Citation2008; Hill et al., Citation2008; Konstantopoulos & Hedges, Citation2008; Lipsey et al., Citation2012) by taking advantage of the IPD and results from all major longitudinal and cross-sectional large-scale assessment programs that are representative for the German student population in Germany. We examined three vital empirical benchmarks for a broad variety of outcome domains/subdomains, namely academic growth of students’ achievement as well as performance gaps between policy-relevant demographic student groups, and low and average performing schools. The pattern of results obtained for these benchmarks for students in Germany generally follows the pattern found in the United States. However, for each kind of benchmark, the magnitude of effect sizes varied across countries. We also observed substantial variations across and within outcome domains as well as across types of schools and federal states within Germany. To conclude, the observed variations in empirical benchmark values imply that the very same intervention or policy effect would be assessed (quite) differently depending on the target population and outcome. These results empirically underscore how important it is to find a good match between an empirical benchmark and both the target population and outcome. We therefore provided guidelines to help find such matches and illustrated the application of these guidelines for assessing the effects of educational interventions with several scenarios (see OSM A.1).

Limitations

The present paper has some general limitations. First, generalizations beyond the tests and samples included in the present analyses should be made with care. The generalizability of the present findings is most plausible when the samples and outcome measures to which generalizations are sought are sufficiently similar in sample composition and test content (see Hedges & Vevea, Citation1998). In particular, most applied achievement measures, but not all (e.g., some verbal skills in English in lower secondary education), covered a broad spectrum of content or skills. We also provided empirical benchmarks for the total student population in Germany as well as for student subpopulations for different school types and federal states. Thus, it is an open question as to how well the present results may generalize to achievement tests that assess more specific skills or areas of knowledge, or how well the present results may generalize to more specific student subpopulations within Germany (e.g., students attending a certain school type in a certain federal state). In this respect, the results obtained from the random-effects models may offer preliminary answers, because the 95% PIs provide a plausible range of values for the (true) empirical benchmark values. Moreover, generalizations to student populations in other countries are most likely to hold when the student population (e.g., the socio-demographic composition; see Supplemental Tables A.1 and A.16) and school systems (e.g., curricula or tracking into school types) are substantially similar to Germany.

Second, to assess the effect of a certain target intervention, average effect sizes obtained from similar educational interventions may serve as vital additional empirical benchmarks (Hill et al., Citation2008; Kraft, Citation2020; Lipsey et al., Citation2012). Given the relatively small number of randomized intervention studies that have been carried out in Germany so far, we did not provide these kinds of benchmarks for students in Germany. For the same reason, we did not provide guidelines for how to assess intervention effects regarding the cost per student or the scalability of the intervention (Kraft, Citation2020; Lipsey et al., Citation2012). When the number of (randomized) intervention studies in Germany grows, future research may add these benchmark values to the body of knowledge on how to assess the effects of educational interventions in Germany.

Conclusion

The present paper provided novel, reliable, and generalizable meta-analytic results for three vital types of empirical benchmarks: students’ academic growth, performance gaps between student demographic groups, and performance gaps between weak and average schools. These benchmarks help to contextualize the standardized effect sizes obtained for the effects of educational interventions and policies on students’ achievement in elementary and secondary schools. We hope that the guidance provided in this paper helps researchers in Germany, and perhaps also other countries, to assess the meaningfulness of intervention and policy effects, and to convey the findings to a broader audience comprising other researchers, practitioners, and policymakers.

Open Scholarship

This article has earned the Center for Open Science badges for Open Data and Open Materials through Open Practices Disclosure. The data and materials are openly accessible at https://osf.io/g4nad/.

Open Research Statements

Study and Analysis Plan Registration

There is no study and analysis plan registration associated with this manuscript.

Data, Code, and Materials Transparency

The data on effect sizes and R code that support the findings of this study are openly available on the Open Science Framework at https://osf.io/g4nad.

Design and Analysis Reporting Guidelines

There is not a completed reporting guideline checklist included as a supplementary file for this manuscript.

Transparency Declaration

The lead author (the manuscript’s guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

Replication Statement

This manuscript reports an original study.

Author Note

This paper uses data from the National Educational Panel Study (NEPS; see Blossfeld & Roßbach, 2019). The NEPS is carried out by the Leibniz Institute for Educational Trajectories (LIfBi, Germany) in cooperation with a nationwide network. Datasets for the Assessment of Student Achievements in German and English as a Foreign Language (DESI) and the longitudinal extensions of the year 2003 (PISA-I + 03) and 2012 (PISA-I + 12) cycles of the Programme for International Student Assessment PISA-I + were made available by the Research Data Center at the Institute for Educational Quality Improvement (FDZ at IQB). Permission from the dataset owners was granted to use these datasets for the research objectives of the present paper. Further, we used the public use files for the German student samples of the year 2000–2018 cycles of PISA, and the year 2013 and 2018 cycles of the International Computer and Information Literacy Study (ICILS) that are publicly available and re-use is permitted via an open license for the research objectives of the present paper. The R code for reproducing all results as well as the data with the effect sizes used in the present paper can be accessed via the Open Science Framework at https://osf.io/x4erk/. We made a preprint of our paper available on edarxiv at https://edarxiv.org/39gbq/. We did not preregister the analyses presented in this paper.

Supplemental material

Supplemental Material

Download Zip (1.2 MB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Grant 392108331.

Notes

1 We do not provide empirical benchmarks that are based on educational intervention effects due to the very small number of randomized trials that have been carried out with student samples in Germany.

2 https://osf.io/x4erk/. Tables and Figures presented in the OSM are indicated by corresponding letters (e.g., Table B.0 in OSM B etc.).

3 It was not possible for us to integrate the effect sizes from previous research by means of meta-analytic models because many original studies did not report the necessary information on sampling variances/standard errors of the effect sizes. We therefore provide minimum, median, and maximum values to summarize these effect sizes in Table B.3.

4 Some of these data have also been used in previous studies to estimate students’ growth in achievement (see Table B.ES). However, these studies used different statistical procedures to estimate effect sizes for students’ growth. Further, none of these studies provided standard errors for these effect sizes, which precludes their meta-analytic integration with fixed-effect or random-effects models.

5 We applied several exclusion criteria to derive the samples for the present analyses (see OSM A.1). Table A.3 itemizes the number of excluded students. Sensitivity analyses showed no systematic differences in the study measures between students that were included and those that were excluded (see Tables A.4–A.7).

6 For example, for SC-3 in grade 9 (at the end of lower secondary education) the standardized mean differences in HISEI/reading/mathematics/science/ICT achievement between the students who entered and did not enter upper education were 0.66/0.83/0.91/0.80/0.81, respectively.

7 For example, for SC-3 in grade 9 (at the end of lower secondary education), the student-level variance of HISEI/reading /mathematics/science/ICT achievement for the students entering upper education was 24%/17%/21%/23%/31% smaller than the variance obtained for the total student population, respectively.

8 This argument was empirically corroborated by the values of ESSchool that were obtained for specific school types. Because the underlying analyses were carried out separately for each school type, mean-level differences between school types could not affect performance gaps between schools. Relative to the performance gaps between schools that were observed in the total student population the values of ESSchool for specific school types were all smaller in size (see Table D.1 in OSF D).

9 Of note, growth estimates that are based on longitudinal studies should generally be preferred over those based on cross-sectional studies because the latter are more affected by selection effects due to student attrition. Further, growth estimates provided in the present study should be preferred over longitudinal estimates from previous research (provided in the upper panel of Table B.3) because we applied a standardized analysis protocol to control for the quality of the data and statistical analyses to mitigate bias and unwanted heterogeneity in estimates of students’ academic growth (see Riley et al., Citation2010). When no longitudinal estimate is available, cross-sectional estimates of students’ annual academic growth obtained from previous research (provided in the lower panel of Table B.3) may serve as useful empirical benchmarks.

References

  • American Educational Research Association (AERA). (2006). Standards for reporting on empirical social science research in AERA publications. Educational Researcher, 35, 33–40.
  • Atteberry, A. C., & McEachin, A. J. (2020). Not where you start, but how much you grow: An addendum to the Coleman report. Educational Researcher, 49(9), 678–685. https://doi.org/10.3102/0013189X20940304
  • Autorengruppe Bildungsberichterstattung. (2020). Bildung in Deutschland 2020. wbv Media. https://doi.org/10.3278/6001820gw
  • Baird, M. D., & Pane, J. F. (2019). Translating standardized effects of education programs into more interpretable metrics. Educational Researcher, 48(4), 217–228. https://doi.org/10.3102/0013189X19848729
  • Baumert, J., & Artelt, C. (2002). Bereichsübergreifende Perspektiven [Taking perspectives from different domains]. In J. Baumert, C. Artelt, E. Klieme, M. Neubrand, M. Prenzel, U. Schiefele, W. Schneider, K.-J. Tillmann, & M. Weiß (Eds.), PISA 2000—Die Länder der Bundesrepublik im Vergleich (pp. 219–236). Leske + Budrich.
  • Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., Klusmann, U., Krauss, S., Neubrand, M., & Tsai, Y. M. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47, 133–180.
  • Baumert, J., Lüdtke, O., Trautwein, U., & Brunner, M. (2009). Large-scale student assessment studies measure the results of processes of knowledge acquisition: Evidence in support of the distinction between intelligence and student achievement. Educational Research Review, 4, 165–176.
  • Becker, M., Lüdtke, O., Trautwein, U., & Baumert, J. (2006). Leistungszuwachs in Mathematik. Evidenz für einen Schereneffekt im mehrgliedrigen Schulsystem? Zeitschrift Für Pädagogische Psychologie, 20(4), 233–242. https://doi.org/10.1024/1010-0652.20.4.233
  • Bloom, H. S., Hill, C. J., Black, A. R., & Lipsey, M. W. (2008). Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions. Journal of Research on Educational Effectiveness, 1(4), 289–328. https://doi.org/10.1080/19345740802400072
  • Blossfeld, H.-P., & Rossbach, H.-G. (Eds.). (2019). Education as a lifelong process: The German national educational panel study (NEPS). (2nd ed.). Springer Fachmedien Wiesbaden : Imprint: Springer VS. https://doi.org/10.1007/978-3-658-23162-0
  • Böhme, K., Sebald, S., Weirich, S., & Stanat, P. (2016). Geschlechtsbezogene Disparitäten [Gender differences]. In P. Stanat, K. Böhme, S. Schipolowski, & N. Haag (Eds.), IQB-Bildungstrend 2015. Sprachliche Kompetenzen am Ende der 9. Jahrgangsstufe im zweiten Ländervergleich (pp. 377–408). Waxmann.
  • Borenstein, M., Higgins, J. P. T., Hedges, L. V., & Rothstein, H. R. (2017). Basics of meta-analysis: I2 is not an absolute measure of heterogeneity. Research Synthesis Methods, 8(1), 5–18. https://doi.org/10.1002/jrsm.1230
  • Borman, G. D., Hewes, G. M., Overman, L. T., & Brown, S. (2003). Comprehensive school reform and achievement: A meta-analysis. Review of Educational Research, 73(2), 125–230. https://doi.org/10.3102/00346543073002125
  • Brunner, M., Keller, L., Stallasch, S. E., Kretschmann, J., Hasl, A., Preckel, F., Lüdtke, O., & Hedges, L. V. (2023). Meta-analyzing individual participant data from studies with complex survey designs: A tutorial on using the two-stage approach for data from educational large-scale assessments. Research Synthesis Methods, 14(1), 5–35. https://doi.org/10.1002/jrsm.1584
  • Burke, D. L., Ensor, J., & Riley, R. D. (2017). Meta‐analysis using individual participant data: One‐stage and two‐stage approaches, and why they may differ. Statistics in Medicine, 36(5), 855–875. https://doi.org/10.1002/sim.7141
  • Cheung, A. C. K., Xie, C., Zhuang, T., Neitzel, A. J., & Slavin, R. E. (2021). Success for all: A quantitative synthesis of U.S. Evaluations. Journal of Research on Educational Effectiveness, 14(1), 90–115. https://doi.org/10.1080/19345747.2020.1868031
  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Lawrence Erlbaum.
  • Cohen, J., Cohen, P., Aiken, L. S., & West, S. G. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum Associates.
  • Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
  • Dadey, N., & Briggs, D. C. (2012). A meta-analysis of growth trends from vertically scaled assessments. Practical Assessment, Research, and Evaluation, 17. https://doi.org/10.7275/F2BM-6R59
  • Findley, M. G., Kikuta, K., & Denly, M. (2021). External validity. Annual Review of Political Science, 24(1), 365–393. https://doi.org/10.1146/annurev-polisci-041719-102556
  • Ganzeboom, H. B. G., & Treiman, D. J. (1996). Internationally comparable measures of occupational status for the 1988 International Standards Classification of Occupations. Social Science Research, 25, 201–239.
  • Gerick, J., Massek, C., Eickelmann, B., & Labusch, A. (2019). Computer- und informationsbezogene Kompetenzen von Mädchen und Jungen im zweiten internationalen Vergleich [Computer and information literacy: Second international comparison of female and male students]. In B. Eickelmann, W. Bos, J. Gerick, F. Goldhammer, H. Schaumburg, K. Schwippert, M. Senkbeil, & J. Vahrenhold (Eds.), ICILS 2018 #Deutschland: Computer- und informationsbezogene Kompetenzen von Schülerinnen und Schülern im zweiten internationalen Vergleich und Kompetenzen im Bereich Computational Thinking (pp. 271–300). Waxmann.
  • Gubbels, J., Put, C., & Assink, M. (2019). Risk factors for school absenteeism and dropout: A meta-analytic review. Journal of Youth and Adolescence, 48, 1637–1667. https://doi.org/10.1007/s10964-019-01072-5
  • Haag, N., Böhme, K., Rjosk, C., & Stanat, P. (2016). Zuwanderungsbezogene Disparitäten [Immigration-related disparities]. In P. Stanat, K. Böhme, S. Schipolowski, & N. Haag (Eds.), IQB-Bildungstrend 2015. Sprachliche Kompetenzen am Ende der 9. Jahrgangsstufe im zweiten Ländervergleich (pp. 431–480). Waxmann.
  • Haag, N., Kocaj, A., Jansen, M., & Kuhl, P. (2017). Soziale Disparitäten [Social inequalities]. In P. Stanat, S. Schipolowski, C. Rjosk, S. Weirich, & N. Haag (Eds.), IQB-Bildungstrend 2016: Kompetenzen in den Fächern Deutsch und Mathematik am Ende der 4. Jahrgangsstufe im zweiten Ländervergleich (pp. 213–236). Waxmann.
  • Hedges, L. V. (2019). Stochastically dependent effect sizes. In H. M. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), Handbook of research synthesis and meta-analysis. (3rd ed., pp. 245–280). Russell Sage Foundation.
  • Hedges, L. V., & Vevea, J. L. (1998). Fixed- and random-effects models in meta-analysis. Psychological Methods, 3(4), 486–504. https://doi.org/10.1037/1082-989X.3.4.486
  • Henschel, S., Heppt, B., Weirich, S., Edele, A., Schipolowski, S., & Stanat, P. (2019). Zuwanderungsbezogene Disparitäten [Immigration-related disparities]. In P. Stanat, S. Schipolowski, N. Mahler, S. Weirich, & S. Henschel (Eds.), IQB-Bildungstrend 2018. Mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe I im zweiten Ländervergleich (pp. 265–294). Waxmann. https://www.content-select.com/index.php?id=bib_view&ean=9783830990444
  • Higgins, J. P. T., Thomas, J., Chandler, J., Cumpston, M., Li, T., Page, M. J., & Welch, V. A. (Eds.). (2021). Cochrane Handbook for systematic reviews of interventions. Cochrane. www.training.cochrane.org/handbook
  • Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for interpreting effect sizes in research. Child Development Perspectives, 2, 172–177.
  • Hußmann, A., Stubbe, T. C., & Kasper, D. (2017). Soziale Herkunft und Lesekompetenzen von Schülerinnen und Schülern [Social origin and students’ reading competences]. In A. Hussmann, H. Wendt, W. Bos, A. Bremerich-Vos, D. Kasper, E.-M. Lankes, N. McElvany, T. C. Stubbe, & R. Valtin (Eds.), IGLU 2016: Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 195–218). Waxmann.
  • Kievit, R. A., Brandmaier, A. M., Ziegler, G., van Harmelen, A.-L., de Mooij, S. M. M., Moutoussis, M., Goodyer, I. M., Bullmore, E., Jones, P. B., Fonagy, P., Lindenberger, U., & Dolan, R. J. (2018). Developmental cognitive neuroscience using latent change score models: A tutorial and applications. Developmental Cognitive Neuroscience, 33, 99–117. https://doi.org/10.1016/j.dcn.2017.11.007
  • Klieme, E. (2012). Assessment of student achievements in German and English as a foreign language (DESI) (version 1) [data set]. IQB - Institute for Educational Quality Improvement. https://doi.org/10.5159/IQB_DESI_V1
  • KMK. (2012). Kompetenzstufenmodell zu den Bildungsstandards für den Hauptschulabschluss und den Mittleren Schulabschluss im Fach Mathematik. KMK. https://www.iqb.hu-berlin.de/bista/ksm/Kompetenzstufenm.pdf
  • Konstantopoulos, S., & Hedges, L. V. (2008). How large an effect can we expect from school reforms? Teachers College Record, 110(8), 1611–1638.
  • Kraft, M. A. (2020). Interpreting effect sizes of education interventions. Educational Researcher, 49(4), 241–253. https://doi.org/10.3102/0013189X20912798
  • Kuhfeld, M., & Soland, J. (2021). The Learning curve: Revisiting the assumption of linear growth during the school year. Journal of Research on Educational Effectiveness, 14(1), 143–171. https://doi.org/10.1080/19345747.2020.1839990
  • Kuhl, P., Haag, N., Federlein, F., Weirich, S., & Schipolowski, S. (2016). Soziale Disparitäten [Social inequalities]. In P. Stanat, K. Böhme, S. Schipolowski, & N. Haag (Eds.), IQB-Bildungstrend 2015. Sprachliche Kompetenzen am Ende der 9. Jahrgangsstufe im zweiten Ländervergleich (pp. 409–430). Waxmann.
  • Langan, D., Higgins, J. P. T., Jackson, D., Bowden, J., Veroniki, A. A., Kontopantelis, E., Viechtbauer, W., & Simmonds, M. (2019). A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Research Synthesis Methods, 10(1), 83–98. https://doi.org/10.1002/jrsm.1316
  • Lee, J., Finn, J., & Liu, X. (2019). Time-indexed effect size for educational research and evaluation: Reinterpreting program effects and achievement gaps in K–12 reading and math. The Journal of Experimental Education, 87(2), 193–213. https://doi.org/10.1080/00220973.2017.1409183
  • Lehner, M. C., Heine, J.-H., Sälzer, C., Reiss, K., Haag, N., & Heinze, A. (2017). Veränderung der mathematischen Kompetenz von der neunten zur zehnten Klassenstufe. Zeitschrift für Erziehungswissenschaft, 20(2), 7–36. https://doi.org/10.1007/s11618-017-0746-2
  • Lipsey, M. W., Puzio, K., Yun, C., Hebert, M. A., Steinka-Fry, K., Cole, M. W., Roberts, M., Anthony, K. S., & Busick, M. D. (2012). Translating the statistical representation of the effects of education interventions into more readily interpretable forms. National Center for Special Education Research. http://eric.ed.gov/?id=ED537446
  • Lortie-Forgues, H., Sio, U. N., & Inglis, M. (2021). How should educational effects be communicated to teachers? Educational Researcher, 50(6), 345–354. https://doi.org/10.3102/0013189X20987856
  • Mahler, N., & Kölm, J. (2019). Soziale Disparitäten [Social inequalities]. In P. Stanat, S. Schipolowski, N. Mahler, S. Weirich, & S. Henschel (Eds.), IQB-Bildungstrend 2018. Mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe I im zweiten Ländervergleich (pp. 265–294). Waxmann. https://www.content-select.com/index.php?id=bib_view&ean=9783830990444
  • Mavridis, D., & Salanti, G. (2013). A practical introduction to multivariate meta-analysis. Statistical Methods in Medical Research, 22(2), 133–158. https://doi.org/10.1177/0962280211432219
  • McArdle, J. J. (2009). Latent variable modeling of differences and changes with longitudinal data. Annual Review of Psychology, 60, 577–605. https://doi.org/10.1146/annurev.psych.60.110707.163612
  • McElvany, N., Kessels, U., Schwabe, F., & Kasper, D. (2017). Geschlecht und Lesekompetenz [Gender and reading competence]. In A. Hussmann, H. Wendt, W. Bos, A. Bremerich-Vos, D. Kasper, E.-M. Lankes, N. McElvany, T. C. Stubbe, & R. Valtin (Eds.), IGLU 2016: Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 177–194). Waxmann.
  • Morris, T. P., Fisher, D. J., Kenward, M. G., & Carpenter, J. R. (2018). Meta-analysis of Gaussian individual patient data: Two-stage or not two-stage? Statistics in Medicine, 37(9), 1419–1438. https://doi.org/10.1002/sim.7589
  • NEPS Network. (2019). National educational panel study, scientific use file of starting cohort grade 9 (10.0.0) [data set]. Leibniz Institute for Educational Trajectories (LIfBi). https://doi.org/10.5157/NEPS:SC4:10.0.0
  • NEPS Network. (2020a). National educational panel study, scientific use file of starting cohort grade 5 [data set]. Leibniz Institute for Educational Trajectories (LIfBi). https://doi.org/10.5157/NEPS:SC3:10.0.0
  • NEPS Network. (2020b). National educational panel study, scientific use file of starting cohort kindergarten (8.0.1) [data set]. Leibniz Institute for Educational Trajectories (LIfBi). https://doi.org/10.5157/NEPS:SC2:8.0.1
  • Organisation for Economic Co-operation and Development (OECD). (2003). Literacy skills for the world of tomorrow: Further results from PISA 2000. OECD.
  • Organisation for Economic Co-operation and Development (OECD). (2007). PISA 2006. Science competencies for tomorrow’s world. Volume 1: Analysis. OECD.
  • Organisation for Economic Co-operation and Development (OECD). (2014). PISA 2012 results. What students know and can do. Student performance in mathematics, reading, and science (Vol. I). OECD.
  • Peng, C.-Y J., Chen, L.-T., Chiang, H.-M., & Chiang, Y.-C. (2013). The impact of APA and AERA guidelines on effect size reporting. Educational Psychology Review, 25(2), 157–209. https://doi.org/10.1007/s10648-013-9218-2
  • Pfost, M., Karing, C., Lorenz, C., & Artelt, C. (2010). Schereneffekte im ein- und mehrgliedrigen Schulsystem. Zeitschrift Für Pädagogische Psychologie, 24(3–4), 259–272. https://doi.org/10.1024/1010-0652/a000025
  • Prenzel, M., Baumert, J., Blum, W., Lehmann, R., Leutner, D., Neubrand, M., Pekrun, R., Rost, J., & Schiefele, U. (2013). Programme for international student assessment plus 2003, 2004 (PISA-I-Plus) (version 1) [data set]. IQB - Institute for Educational Quality Improvement. https://doi.org/10.5159/IQB_PISA_I_PLUS_V1
  • Pustejovsky, J. E. (2014). Converting from d to r to z when the design uses extreme groups, dichotomization, or experimental control. Psychological Methods, 19(1), 92–112. https://doi.org/10.1037/a0033788
  • Pustejovsky, J. E. (2021). ClubSandwich: Cluster-robust (sandwich) variance estimators with small-sample corrections. R package version 0.5.3. https://CRAN.R-project.org/package=clubSandwich
  • R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
  • Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models (2nd ed.). Sage.
  • Reiss, K., Heine, J.-H., Klieme, E., Köller, O., & Stanat, P. (2020). Programme for international student assessment—plus 2012, 2013 (PISA-Plus 2012, 2013) (version 2) [data set]. IQB - Institute for Educational Quality Improvement. https://doi.org/10.5159/IQB_PISA_PLUS_2012-13_V2
  • Retelsdorf, J., & Möller, J. (2008). Entwicklungen von Lesekompetenz und Lesemotivation Schereneffekte in der Sekundarstufe? Zeitschrift Für Entwicklungspsychologie Und Pädagogische Psychologie, 40(4), 179–188. https://doi.org/10.1026/0049-8637.40.4.179
  • Rice, K., Higgins, J. P. T., & Lumley, T. (2018). A re-evaluation of fixed effect(s) meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society, )181(1), 205–227. https://doi.org/10.1111/rssa.12275
  • Riley, R. D., Lambert, P. C., & Abo-Zaid, G. (2010). Meta-analysis of individual participant data: Rationale, conduct, and reporting. BMJ (Clinical Research Ed.), 340, c221. c221. https://doi.org/10.1136/bmj.c221
  • Rjosk, C., Haag, N., Heppt, B., & Stanat, P. (2017). Zuwanderungsbezogene Disparitäten [Immigration-related disparities]. In P. Stanat, S. Schipolowski, C. Rjosk, S. Weirich, & N. Haag (Eds.), IQB-Bildungstrend 2016: Kompetenzen in den Fächern Deutsch und Mathematik am Ende der 4. Jahrgangsstufe im zweiten Ländervergleich (pp. 213–236). Waxmann.
  • Robitzsch, A., Lüdtke, O., Schwippert, K., Goldhammer, F., Kroehne, U., & Köller, O. (2020). Leistungsveränderungen in TIMSS zwischen 2015 und 2019: Die Rolle des Testmediums und des methodischen Vorgehens bei der Trendschätzung. In K. Schwippert, D. Kasper, O. Köller, N. McElvany, C. Selter, M. Steffensky, & H. Wendt (Eds.), TIMSS 2019. Mathematische und naturwissenschaftliche Kompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 169–186). Waxmann Verlag GmbH. https://doi.org/10.31244/9783830993193
  • Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(1), 1–36. https://doi.org/10.18637/jss.v048.i02
  • Salchegger, S. (2016). Selective school systems and academic self-concept: How explicit and implicit school-level tracking relate to the big-fish-–little-pond effect across cultures. Journal of Educational Psychology, 108(3), 405–423. https://doi.org/10.1037/edu0000063
  • Scammacca, N. K., Fall, A.-M., & Roberts, G. (2015). Benchmarks for expected annual academic growth for students in the bottom quartile of the normative distribution. Journal of Research on Educational Effectiveness, 8(3), 366–379. https://doi.org/10.1080/19345747.2014.952464
  • Schiepe-Tiska, A., Rönnebeck, S., Heitmann, P., Schöps, K., Prenzel, M., & Nagy, G. (2017). Die Veränderung der naturwissenschaftlichen Kompetenz von der 9. Zur 10. Klasse bei PISA und den Bildungsstandards unter Berücksichtigung geschlechts- und schulartspezifischer Unterschiede sowie der Zusammensetzung der Schülerschaft. Zeitschrift für Erziehungswissenschaft, 20(2), 151–176. https://doi.org/10.1007/s11618-017-0754-2
  • Schipolowski, S., Wittig, J., Mahler, N., & Stanat, P. (2019). Geschlechtsbezogene Disparitäten [Gender Differences]. In P. Stanat, S. Schipolowski, N. Mahler, S. Weirich, & S. Henschel (Eds.), IQB-Bildungstrend 2018. Mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe I im zweiten Ländervergleich (pp. 237–264). Waxmann. https://www.content-select.com/index.php?id=bib_view&ean=9783830990444
  • Schipolowski, S., Wittig, J., Weirich, S., & Böhme, K. (2017). Geschlechtsbezogene Disparitäten [Gender Differences]. In P. Stanat, S. Schipolowski, C. Rjosk, S. Weirich, & N. Haag (Eds.), IQB-Bildungstrend 2016: Kompetenzen in den Fächern Deutsch und Mathematik am Ende der 4. Jahrgangsstufe im zweiten Ländervergleich (pp. 187–212). Waxmann.
  • Schneider, W., & Stefanek, J. (2004). Entwicklungsveränderungen allgemeiner kognitiver Fähigkeiten und schulbezogener Fertigkeiten im Kindes- und Jugendalter. Zeitschrift Für Entwicklungspsychologie Und Pädagogische Psychologie, 36, 147–159.
  • Secretariat of the Standing Conference of the Ministers of Education and Cultural Affairs of the Länder in the Federal Republic of Germany (KMK). (2019). The Education System in the Federal Republic of Germany 2017/2018. KMK. https://www.kmk.org/fileadmin/Dateien/pdf/Eurydice/Bildungswesen-engl-pdfs/dossier_en_ebook.pdf
  • Senkbeil, M., Drossel, K., Eickelmann, B., & Vennemann, M. (2019). Soziale Herkunft und computer- und informationsbezogene Kompetenzen von Schülerinnen und Schülern im zweiten internationalen Vergleich [Second international comparison: Social inequalities in students’ computer and information literacy]. In B. Eickelmann, W. Bos, J. Gerick, F. Goldhammer, H. Schaumburg, K. Schwippert, M. Senkbeil, & J. Vahrenhold (Eds.), ICILS 2018 #Deutschland: Computer- und informationsbezogene Kompetenzen von Schülerinnen und Schülern im zweiten internationalen Vergleich und Kompetenzen im Bereich Computational Thinking (pp. 301–334). Waxmann.
  • Servicestelle der Initiative Klischeefrei. (2020). Fächerwahl und Schulleistungen [Course selection and academic achievement]. klischee-frei.de - Klischeefrei-Faktenblatt: Fächerwahl und Schulleistungen. https://www.klischee-frei.de/de/klischeefrei_101751.php
  • Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin Company.
  • Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75, 417–453.
  • Stanat, P., & Christensen, G. (2006). Where immigrant students succeed: A comparative review of performance and engagement in PISA 2003. Organisation for Economic Co-operation and Development.
  • Statistisches Bundesamt (Destatis). (2021). Schuljahr 2019/20 [School year 2019/2020]. Fachserie/11/1. https://www.statistischebibliothek.de/mir/receive/DEHeft_mods_00133256
  • Tallmadge, G. K. (1977). The joint dissemination review panel IDEABOOK. U. S. Office of Education.
  • Valentine, J. C., Aloe, A. M., & Wilson, S. J. (2019). Interpreting effect sizes. In H. M. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), Handbook of research synthesis and meta-analysis (3rd ed., pp. 433–452). Russell Sage Foundation.
  • Vennemann, M., Schwippert, K., Eickelmann, B., & Massek, C. (2019). Computer- und informationsbezogene Kompetenzen von Schülerinnen und Schülern mit und ohne Migrationshintergrund im zweiten internationalen Vergleich [Computer and Information Literacy: Second international comparison of students with and without immigrant background]. In B. Eickelmann, W. Bos, J. Gerick, F. Goldhammer, H. Schaumburg, K. Schwippert, M. Senkbeil, & J. Vahrenhold (Eds.), ICILS 2018 #Deutschland: Computer- und informationsbezogene Kompetenzen von Schülerinnen und Schülern im zweiten internationalen Vergleich und Kompetenzen im Bereich Computational Thinking (pp. 335–366). Waxmann Verlag GmbH.
  • Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03
  • Wang, M. C., Haertel, G. D., & Walberg, H. J. (1993). Toward a Knowledge Base for School Learning. Review of Educational Research, 63(3), 249–294. https://doi.org/10.3102/00346543063003249
  • Weis, M., Doroganova, A., Hahnel, C., Becker-Mrotzek, M., Lindauer, T., Artelt, C., & Reis, K. (2019a). Lesekompetenz in PISA 2018. Ergebnisse in einer digitalen Welt [Reading competence in PISA 2018. Results in a digital world]. In K. Reiss, M. Weis, E. Klieme, & O. Köller (Eds.), PISA 2018 Grundbildung im internationalen Vergleich (pp. 47–80). Waxmann Verlag.
  • Weis, M., Doroganova, A., Hahnel, C., Becker-Mrotzek, M., Lindauer, T., Artelt, C., & Reis, K. (2019b). Soziale Herkunft, Zuwanderungshintergrund und Lesekompetenz [Social origin, immigrant backround, and reading competence]. In K. Reiss, M. Weis, E. Klieme, & O. Köller (Eds.), PISA 2018 Grundbildung im internationalen Vergleich (pp. 129–162). Waxmann Verlag.
  • Wendt, H., & Schwippert, K. (2017). Lesekompetenzen von Schülerinnen und Schülern mit und ohne Migrationshintergrund [Reading competences of students with and without immigrant background]. In A. Hussmann, H. Wendt, W. Bos, A. Bremerich-Vos, D. Kasper, E.-M. Lankes, N. McElvany, T. C. Stubbe, & R. Valtin (Eds.), IGLU 2016: Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 195–218). Waxmann.
  • Wendt, H., Schwippert, K., Stubbe, T. C., & Jusufi, D. (2020). Mathematische und naturwissenschaftliche Kompetenzen von Schülerinnen und Schülern mit und ohne Migrationshintergrund [Mathematical and science competences of students with and without immigrant background]. In K. Schwippert, D. Kasper, O. Köller, N. McElvany, C. Selter, M. Steffensky, & H. Wendt (Eds.), TIMSS 2019. Mathematische und naturwissenschaftliche Kompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 291–314). Waxmann Verlag GmbH. https://doi.org/10.31244/9783830993193