233
Views
0
CrossRef citations to date
0
Altmetric
Editorial

Developments in the Reporting of Score Reliability within Counseling Assessment, Research, and Evaluation

At the turn of the century, there was considerable conversation in Measurement and Evaluation in Counseling and Development (MECD) and other periodicals regarding approaches to reporting internal consistency reliability data within scientific reports (Fan & Thompson, Citation2001; Henson, Citation2001; Henson & Thompson, Citation2002; Onwuegbuzie & Daniel, Citation2002; Thompson & Snyder, Citation1998; Thompson & Vacha-Haase, Citation2000). The collective position was that authors had made strides in the inclusion of these data; however, small distinctions related to characterization and interpretation were obscuring the nature of relationships between variables and intervention outcomes within the broader literature base. Taken together, their recommendations across these commentaries provided encouragement and technical guidance for authors to:

  1. Recognize reliability data as a function of test scores, not the tests themselves;

  2. Report sample-specific reliability estimates rather than relying on inductions from other studies;

  3. Include confidence intervals (CIs) for the overall sample reliability estimate, as well as, key subgroups when indicated by the analytic plan; and

  4. Use visual displays to complement textual reporting of reliability estimates and CIs.

A cursory review of the literature in our field indicates that we have made substantial progress in the conceptualization of reliability metrics as data metrics rather than a property of tests. Evidence for this position can be found upstream in the many textbooks used in counseling and related educational programs (Balkin & Kleist, Citation2022; Barrio Minton & Lenz, Citation2019; Sheperis et al., Citation2023; Wester & Wachter Morris, Citation2018), as well as, guiding documents for the use of assessments and their scores (American Educational Research Association et al., Citation2014; Lenz et al., Citation2022). Thus, it is reasonable to conjecture that as a profession our current position along the continuum ranging from this test reliable to these test scores provide evidence for a defensible degree of reliability we are nearer to the latter than the former. While the work may not be done, it is clear that the charge has been accepted and that a paradigm shift is in process.

It can also be argued that a similar degree of progress has been made in the reporting of sample-specific reliability estimates within primary studies. Early reviews of sample-specific score reliability reporting revealed strikingly low rates of non-reporting ranging from 56 to 80% of published reports (Hanson et al., 2002; Miller et al., Citation2009; Yin & Fan, Citation2000). Vacha-Haase et al. (Citation2003) review also highlighted the prevalence of reliability induction practices wherein researchers attribute the psychometric properties of test scores from a normative or alternative sample to theirs even when the sample does not reflect a preponderance of overlap among reported intersecting identities. It is a subtle distinction but this induction is misleading due to its disregard of the interaction between sample characteristics, score variability, and related internal consistency estimates that can lead to the attenuation of observed effects. More recent reliability generalization reviews have indicated a greater presence of sample-specific reporting ranging from 57 to 71% of published reports and thus a 29–43% degree of non-reporting or reliability inductions (Lenz et al., in press; McKay et al., Citation2021; Vincent et al., Citation2019). Given the American Psychological Association’s (2019) Journal Article Reporting Standards guidance to include sample-specific reliability estimates within primary study reports and the related observation of these practices by peer-reviewed outlets, including MECD, it is possible that rates of omissions and inductions will continue to decrease over time.

In contrast to these developments, a review of the professional counseling literature will indicate modest instances of CIs reported for total sample and subgroup score reliability estimates and minimal cases of visual representations for the related data. The increased capacity of statistical software packages such as IBM SPSS Statistics, JASP, and R to compute multiple forms of reliability with CIs across subgroups provides the greatest opportunity to promote transparency in reporting. provides an illustrative example of one approach to including these details along with commonly reported descriptive statistics such as commonly displayed in relation to predictive analyses with a single group of participants. These data depict a scenario wherein the point estimates representing internal consistency reliability are relatively similar across the four predictor variables and within a range that may be regarded as good or acceptable by some common interpretative benchmarks. However, the inclusion of the related CIs provides a clearer depiction of the variation or stability of the reliability coefficients. For example, hypothetical scores on the Shame variable are represented by an alpha coefficient of 0.82 with the expectation that we could be 95% certain that the true value falls somewhere between 0.71 and 0.93. Without the reporting and interpretation of these CIs, we may be content with categorizations of the internal consistency of scores as good when in fact the true estimate may lie at the thresholds of what is considered acceptable or excellent. The inclusion of visual aids may have the virtue of making these relationships more readily observable to readers. These analyses may bear particular relevance to MECD authors who are investigating multi-factor expressions of test scores or measurement invariance between sample subgroups.

Table 1. Illustrative Display of Descriptive Statistics, Bivariate Correlations, and Internal Consistency Reliability Estimates with 95% Confidence Intervals.

Overall, it is clear that counseling researchers have made considerable strides when answering the call for more rigorous reporting practices delivered more than two decades ago. The greatest developments have been actualized in the conceptualization of reliability as a property of data rather than tests and the inclusion of sample-specific reliability estimates within scientific reports. With the availability of current statistics packages, addressing the reporting and interpretation of related confidence intervals may be within our collective developmental potential, as well. The use of visual aids will can only support transparency about the nature of these data.

This special issue of MECD dedicated to score reliability is intended to re-invigorate these early conversations and provide a platform for those that may advance our current status. Kalkbrenner (Citation2024) provided a non-technical review of several internal consistency estimates and guidelines for selecting those that suit particular data characteristics. Balkin et al. (Citation2024) provided an overview of strategies for reporting reliability across diverse samples of participants. Cook and Wind (Citation2024) reviewed the fundamental assumptions and practices for estimating score reliability and precision based on item response theory. Coleman et al. (Citation2024) described the rationale and computational procedures for using intercoder reliability across several investigative paradigms. The remainder of the articles within the issue illustrate strategies to explore the boundaries of score reliability generalization across several measures frequently used in counseling research and practice.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (Eds.). (2014). Standards for educational and psychological testing. American Educational Research Association.
  • American Psychological Association. (2019). Publication manual of the American Psychological Association (7th ed.). American Psychological Association.
  • Balkin, Q., Hunter, B. T., & Erford, R. S. (2024). Reliable for whom? Inferring and reporting reliability across diverse populations. Measurement and Evaluation in Counseling and Development, 57(2), 1–10. https://doi.org/10.1080/07481756.2023.2301286
  • Balkin, R. S., & Kleist, D. (2022). Counseling research: A practitioner-scholar approach (2nd ed.). American Counseling Association.
  • Barrio Minton, C. A., & Lenz, A. S. (2019). Practical approaches to applied research and program evaluation for helping professionals. Routledge.
  • Baugh, F. (2002). Correcting effect sizes for score reliability: A reminder that measurement and substantive issues are linked inextricably. Educational and Psychological Measurement, 62(2), 254–263. https://doi.org/10.1177/0013164402062002004
  • Coleman, M. L., Ragan, M., & Dari, T. (2024). Intercoder reliability for use in qualitative research and evaluation. Measurement and Evaluation in Counseling and Development, 57(2), 1–11. https://doi.org/10.1080/07481756.2024.2303715
  • Cook, R. M., & Wind, S. A. (2024). Item response theory: A modern measurement approach to reliability and precision for counseling researchers. Measurement and Evaluation in Counseling and Development, 57(2), 1–20. https://doi.org/10.1080/07481756.2023.2301284
  • Fan, X., & Thompson, B. (2001). Confidence intervals about score reliability coefficients, please: An EPM Guidelines editorial. Educational and Psychological Measurement, 61(4), 517–531. https://doi.org/10.1177/00131640121971365
  • Henson, R. K. (2001). Understanding internal consistency reliability estimates: A conceptual primer on coefficient alpha. Measurement and Evaluation in Counseling and Development, 34(3), 177–189. https://doi.org/10.1080/07481756.2002.12069034
  • Henson, R. K., & Thompson, B. (2002). Characterizing measurement error in scores across studies: Some recommendations for conducting "reliability generalization" studies. Measurement and Evaluation in Counseling and Development, 35(2), 113–127. https://doi.org/10.1080/07481756.2002.12069054
  • Kalkbrenner, M. T. (2024). Choosing between Cronbach’s coefficient alpha, McDonald’s coefficient omega, and coefficient H: Confidence intervals and the advantages and drawbacks of interpretive guidelines. Measurement and Evaluation in Counseling and Development, 57(2), 1–13. https://doi.org/10.1080/07481756.2023.2283637
  • Lenz, A. S., Ault, H., Balkin, R. S., Barrio Minton, C. A., Erford, B. T., Hays, D. G., Kim, B. S. K., & Li, C. (2022). Responsibilities of users of standardized tests (RUST-4E). Measurement and Evaluation in Counseling and Development, 55(4), 227–235. https://doi.org/10.1080/07481756.2022.2052321
  • Lenz, A. S., Smith, C., & Meegan, A. (in press). A reliability generalization meta-analysis of scores on the Professional Quality of Life (ProQOL) Scale across sample characteristics. Measurement and Evaluation in Counseling and Development, https://doi.org/10.1080/07481756.2024.2320154.
  • McKay, M., Healy, C., & O'Donnell, L. (2021). The Adolescent and Adult Time Inventory-Time Attitudes Scale: A comprehensive review and meta-analysis of psychometric studies. Journal of Personality Assessment, 103(5), 576–587. https://doi.org/10.1080/00223891.2020.1818573
  • Miller, C. S., Woodson, J., Howell, R. T., & Shields, A. L. (2009). Assessing the reliability of scores produced by the substance abuse subtle screening inventory. Substance Use & Misuse, 44(8), 1090–1100. https://doi.org/10.1080/10826080802486772
  • Onwuegbuzie, A. J., & Daniel, L. G. (2002). A framework for reporting and interpreting internal consistency reliability estimates. Measurement and Evaluation in Counseling and Development, 35(2), 89–103. https://doi.org/10.1080/07481756.2002.12069052
  • Sheperis, C. J., Young, J. S., & Daniels, M. H. (2023). Counseling research: Quantitative, qualitative, and mixed methods (3rd. ed.) Pearson.
  • Thompson, B., & Snyder, P. A. (1998). Statistical significance and reliability analyses in recent Journal of Counseling & Development research articles. Journal of Counseling & Development, 76(4), 436–441. https://doi.org/10.1002/j.1556-6676.1998.tb02702.x
  • Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60(2), 174–195. https://doi.org/10.1177/00131640021970448
  • Vacha-Haase, T., Kogan, L. R., & Thompson, B. (2003). Sample compositions and variabilities in published studies versus those in test manuals. In B. Thompson (Ed.), Score reliability: Contemporary thinking on reliability issues (pp. 157–172). Sage Publications, Inc.
  • Vincent, M., Rubio-Aparicio, M., Sánchez-Meca, J., & Gonzálvez, C. (2019). A reliability generalization meta-analysis of the child and adolescent perfectionism scale. Journal of Affective Disorders, 245, 533–544. https://doi.org/10.1016/j.jad.2018.11.049
  • Wester, K. L., & Wachter Morris, C. A. (2018). Making research relevant: Applied research designs for the mental health practitioner. Routledge.
  • Yin, P., & Fan, X. (2000). Assessing the reliability of the Beck Depression Inventory scores: Reliability generalization across studies. Educational and Psychological Measurement, 60(2), 201–223. https://doi.org/10.1177/00131640021970466

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.