39
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Improving the Precision of Classroom Observation Scores Using a Multi-Rater and Multi-Timepoint Item Response Theory Model

& ORCID Icon

References

  • Bauer, D. J., Howard, A. L., Baldasaro, R. E., Curran, P. J., Hussong, A. M., Chassin, L., & Zucker, R. A. (2013). A trifactor model for integrating ratings across multiple informants. Psychological Methods, 18(4), 475. https://doi.org/10.1037/a0032475
  • Bell, C. A., Dobbelaer, M. J., Klette, K., & Visscher, A. (2019). Qualities of classroom observation systems. School Effectiveness and School Improvement, 30(1), 3–29. https://doi.org/10.1080/09243453.2018.1539014
  • Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. https://doi.org/10.1080/10627197.2012.715014
  • Bergin, C., Wind, S. A., Grajeda, S., & Tsai, C. L. (2017). Teacher evaluation: Are principals’ classroom observations accurate at the conclusion of training? Studies in Educational Evaluation, 55, 19–26. https://doi.org/10.1016/j.stueduc.2017.05.002
  • Bock, R. D., Brennan, R. L., & Muraki, E. (2002). The information in multiple ratings. Applied Psychological Measurement, 26(4), 364–375. https://doi.org/10.1177/014662102237794
  • Briggs, D. C., & Alzen, J. L. (2019). Making inferences about teacher observation scores over time. Educational and Psychological Measurement, 79(4), 636–664. https://doi.org/10.1177/0013164419826237
  • Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016
  • Casabianca, J. M., Lockwood, J. R., & McCaffrey, D. F. (2015). Trends in classroom observation scores. Educational and Psychological Measurement, 75(2), 311–337. https://doi.org/10.1177/0013164414539163
  • Casabianca, J. M., McCaffrey, D. F., Gitomer, D. H., Bell, C. A., Hamre, B. K., & Pianta, R. C. (2013). Effect of observation mode on measures of secondary mathematics teaching. Educational and Psychological Measurement, 73(5), 757–783. https://doi.org/10.1177/0013164413486987
  • Cohen, J., & Goldhaber, D. (2016). Building a more complete understanding of teacher evaluation using classroom observations. Educational Researcher, 45(6), 378–387. https://doi.org/10.3102/0013189X16659442
  • Danielson, C. (2007). Enhancing professional practice: A framework for teaching. Association for Supervision and Curriculum Development.
  • Dee, T. S., James, J., & Wyckoff, J. (2021). Is effective teacher evaluation sustainable? Evidence from District of Columbia Public Schools. Education Finance and Policy, 16(2), 313–346. https://doi.org/10.1162/edfp_a_00303
  • De Los Reyes, A., Thomas, S. A., Goodman, K. L., & Kundey, S. M. (2013). Principles underlying the use of multiple informants’ reports. Annual Review of Clinical Psychology, 9(1), 123–149. https://doi.org/10.1146/annurev-clinpsy-050212-185617
  • District of Columbia Public Schools. (2022). IMPACT annual reference guide 2022–2023. https://dcps.dc.gov/publication/current-impact-guidebooks
  • Eid, M., Nussbeck, F. W., Geiser, C., Cole, D. A., Gollwitzer, M., & Lischetzke, T. (2008). Structural equation modeling of multitrait-multimethod data: Different models for different types of methods. Psychological Methods, 13(3), 230–253. https://doi.org/10.1037/a0013219
  • Fried, E. I., van Borkulo, C. D., Epskamp, S., Schoevers, R. A., Tuerlinckx, F., & Borsboom, D. (2016). Measuring depression over time… or not? Lack of unidimensionality and longitudinal measurement invariance in four common rating scales of depression. Psychological Assessment, 28(11), 1354–1367. https://doi.org/10.1037/pas0000275
  • Grissom, J. A., & Bartanen, B. (2019). Strategic retention: Principal effectiveness and teacher turnover in multiple-measure teacher evaluation systems. American Educational Research Journal, 56(2), 514–555. https://doi.org/10.3102/0002831218797931
  • Hamre, B., Hatfield, B., Pianta, R., & Jamil, F. (2014). Evidence for general and domain‐specific elements of teacher–child interactions: Associations with preschool children’s development. Child Development, 85(3), 1257–1274. https://doi.org/10.1111/cdev.12184
  • Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189X12437203
  • Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school personnel. Research paper, MET project. Bill & Melinda Gates Foundation.
  • Jiang, Z., Walker, K., Shi, D., & Cao, J. (2018). Improving generalizability coefficient estimate accuracy: A way to incorporate auxiliary information. Methodological Innovations, 11(2), 1–14. https://doi.org/10.1177/2059799118791397
  • Kane, M. (2011). The errors of our ways. Journal of Educational Measurement, 48(1), 12–30. https://doi.org/10.1111/j.1745-3984.2010.00128.x
  • Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Research paper, MET project. Bill & Melinda Gates Foundation.
  • Kuhfeld, M., Soland, J., & Lewis, K. (2023). Investigating differences in how parents and teachers rate students’ self-control. Psychological Assessment, 35(1), 23. https://doi.org/10.1037/pas0001187
  • Li, H., Liu, J., & Hunter, C. V. (2020). A meta-analysis of the factor structure of the classroom assessment scoring system (CLASS). The Journal of Experimental Education, 88(2), 265–287. https://doi.org/10.1080/00220973.2018.1551184
  • Linacre, J. M. (1989). Many-facet rasch measurement. MESA Press.
  • Liu, S., Bell, C. A., Jones, N. D., & McCaffrey, D. F. (2019). Classroom observation systems in context: A case for the validation of observation systems. Educational Assessment, Evaluation and Accountability, 31(1), 61–95. https://doi.org/10.1007/s11092-018-09291-3
  • Mantzicopoulos, P., French, B. F., Patrick, H., Watson, J. S., & Ahn, I. (2018). The stability of kindergarten teachers’ effectiveness: A generalizability study comparing the framework for teaching and the classroom assessment scoring system. Educational Assessment, 23(1), 24–46. https://doi.org/10.1080/10627197.2017.1408407
  • Mariano, L. T. (2002). Information accumulation, model selection and rater behavior in constructed response student assessments doctoral dissertation. Carnegie Mellon University.
  • McNeish D. (2022). Limitations of the Sum-and-Alpha Approach to Measurement in Behavioral Research. Policy Insights from the Behavioral and Brain Sciences, 9(2), 196–203. https://doi.org/10.1177/23727322221117144
  • McNeish D and Wolf M Gordon. (2020). Thinking twice about sum scores. Behav Res, 52(6), 2287–2305. https://doi.org/10.3758/s13428-020-01398-0
  • Meyer, J. P., Cash, A. H., & Mashburn, A. (2011). Occasions and the reliability of classroom observations: Alternative conceptualizations and methods of analysis. Educational Assessment, 16(4), 227–243. https://doi.org/10.1080/10627197.2011.638884
  • Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
  • Park, Y. S., Chen, J., & Holtzman, S. L. (2015). Evaluating efforts to minimize rater bias in scoring classroom observations. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation systems: New guidance from the measures of effective teaching project (pp. 383–414). Wiley.
  • Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27(4), 341–384. https://doi.org/10.3102/10769986027004341
  • Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom assessment scoring system. Brookes.
  • Raykov, T., & Marcoulides, G. A. (2006). Estimation of generalizability coefficients via a structural equation modeling approach to scale reliability evaluation. International Journal of Testing, 6(1), 81–95. https://doi.org/10.1207/s15327574ijt0601_5
  • Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571. https://doi.org/10.1162/edfp.2009.4.4.537
  • Sandilos, L. E., Shervey, S. W., DiPerna, J. C., Lei, P., & Cheng, W. (2017). Structural validity of CLASS K–3 in primary grades: Testing alternative models. School Psychology Quarterly, 32(2), 226–239. https://doi.org/10.1037/spq0000155
  • Shin, H. J., Rabe-Hesketh, S., & Wilson, M. (2019). Trifactor models for multiple-ratings data. Multivariate Behavioral Research, 54(3), 360–381. https://doi.org/10.1080/00273171.2018.1530091
  • Soland J. (2017). Is Teacher Value Added a Matter of Scale? The Practical Consequences of Treating an Ordinal Scale as Interval for Estimation of Teacher Effects. Applied Measurement in Education, 30(1), 52–70. https://doi.org/10.1080/08957347.2016.1247844
  • Soland J and Kuhfeld M. (2022). Examining the Performance of the Trifactor Model for Multiple Raters. Applied Psychological Measurement, 46(1), 53–67. https://doi.org/10.1177/01466216211051728
  • Soland J and Kuhfeld M. (2022). A Multi-Rater Latent Growth Curve Model. Multivariate Behavioral Research, 57(5), 701–717. https://doi.org/10.1080/00273171.2021.1925076
  • Steinberg, M. P., & Sartain, L. (2015). Does better observation make better teachers: New evidence from a teacher evaluation pilot in Chicago. Education Next, 15(1), 70–76.
  • Tennessee Department of Education. (2022). Tennessee educator acceleration model: Teacher evaluation handbook. https://team-tn.org/wp-content/uploads/2022/06/EET_2022-TEAM-Teacher-Evaluator-Handbook-6_21_22.pdf
  • Verhelst, N. D., & Verstralen, H. H. F. M. (2001). An IRT model for multiple raters. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (pp. 89–108). Springer-Verlag.
  • White, M. C. (2018). Rater performance standards for classroom observation instruments. Educational Researcher, 47(8), 492–501. https://doi.org/10.3102/0013189X18785623
  • White, T. (2014). Adding eyes: The rise, rewards, and risks of multi-rater teacher observation systems. Carnegie Foundation for the Advancement of Teaching.
  • Wilson, M., & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26(3), 283–306. https://doi.org/10.3102/10769986026003283
  • Wind, S. A., & Jones, E. (2019). Not just generalizability: A case for multifaceted latent trait models in teacher observation systems. Educational Researcher, 48(8), 521–533. https://doi.org/10.3102/0013189X19874084
  • Wind, S. A., Tsai, C. L., Grajeda, S. B., & Bergin, C. (2018). Principals’ use of rating scale categories in classroom observations for teacher evaluation. School Effectiveness and School Improvement, 29(3), 485–510. https://doi.org/10.1080/09243453.2018.1470989

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.