Search in:

Advanced search

Educational Assessment Volume 29, 2024 - Issue 2

Submit an article Journal homepage

Views

CrossRef citations to date

Altmetric

Research Article

Improving the Precision of Classroom Observation Scores Using a Multi-Rater and Multi-Timepoint Item Response Theory Model

Kelly Edwardsa University of Virginia, Charlottesville, USACorrespondence[email protected]

James Solandb University of Virginia, Affiliated Research Fellow, NWEA, Charlottesville, USA

https://orcid.org/0000-0001-8895-2871

Pages 103-123 | Published online: 08 May 2024

Cite this article
https://doi.org/10.1080/10627197.2024.2350966
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions

References

Bauer, D. J., Howard, A. L., Baldasaro, R. E., Curran, P. J., Hussong, A. M., Chassin, L., & Zucker, R. A. (2013). A trifactor model for integrating ratings across multiple informants. Psychological Methods, 18(4), 475. https://doi.org/10.1037/a0032475
PubMed Web of Science ®Google Scholar
Bell, C. A., Dobbelaer, M. J., Klette, K., & Visscher, A. (2019). Qualities of classroom observation systems. School Effectiveness and School Improvement, 30(1), 3–29. https://doi.org/10.1080/09243453.2018.1539014
Web of Science ®Google Scholar
Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. https://doi.org/10.1080/10627197.2012.715014
Google Scholar
Bergin, C., Wind, S. A., Grajeda, S., & Tsai, C. L. (2017). Teacher evaluation: Are principals’ classroom observations accurate at the conclusion of training? Studies in Educational Evaluation, 55, 19–26. https://doi.org/10.1016/j.stueduc.2017.05.002
Web of Science ®Google Scholar
Bock, R. D., Brennan, R. L., & Muraki, E. (2002). The information in multiple ratings. Applied Psychological Measurement, 26(4), 364–375. https://doi.org/10.1177/014662102237794
Web of Science ®Google Scholar
Briggs, D. C., & Alzen, J. L. (2019). Making inferences about teacher observation scores over time. Educational and Psychological Measurement, 79(4), 636–664. https://doi.org/10.1177/0013164419826237
PubMed Web of Science ®Google Scholar
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016
PubMed Web of Science ®Google Scholar
Casabianca, J. M., Lockwood, J. R., & McCaffrey, D. F. (2015). Trends in classroom observation scores. Educational and Psychological Measurement, 75(2), 311–337. https://doi.org/10.1177/0013164414539163
PubMed Web of Science ®Google Scholar
Casabianca, J. M., McCaffrey, D. F., Gitomer, D. H., Bell, C. A., Hamre, B. K., & Pianta, R. C. (2013). Effect of observation mode on measures of secondary mathematics teaching. Educational and Psychological Measurement, 73(5), 757–783. https://doi.org/10.1177/0013164413486987
Web of Science ®Google Scholar
Cohen, J., & Goldhaber, D. (2016). Building a more complete understanding of teacher evaluation using classroom observations. Educational Researcher, 45(6), 378–387. https://doi.org/10.3102/0013189X16659442
Web of Science ®Google Scholar
Danielson, C. (2007). Enhancing professional practice: A framework for teaching. Association for Supervision and Curriculum Development.
Google Scholar
Dee, T. S., James, J., & Wyckoff, J. (2021). Is effective teacher evaluation sustainable? Evidence from District of Columbia Public Schools. Education Finance and Policy, 16(2), 313–346. https://doi.org/10.1162/edfp_a_00303
Web of Science ®Google Scholar
De Los Reyes, A., Thomas, S. A., Goodman, K. L., & Kundey, S. M. (2013). Principles underlying the use of multiple informants’ reports. Annual Review of Clinical Psychology, 9(1), 123–149. https://doi.org/10.1146/annurev-clinpsy-050212-185617
PubMed Web of Science ®Google Scholar
District of Columbia Public Schools. (2022). IMPACT annual reference guide 2022–2023. https://dcps.dc.gov/publication/current-impact-guidebooks
Google Scholar
Eid, M., Nussbeck, F. W., Geiser, C., Cole, D. A., Gollwitzer, M., & Lischetzke, T. (2008). Structural equation modeling of multitrait-multimethod data: Different models for different types of methods. Psychological Methods, 13(3), 230–253. https://doi.org/10.1037/a0013219
PubMed Web of Science ®Google Scholar
Fried, E. I., van Borkulo, C. D., Epskamp, S., Schoevers, R. A., Tuerlinckx, F., & Borsboom, D. (2016). Measuring depression over time… or not? Lack of unidimensionality and longitudinal measurement invariance in four common rating scales of depression. Psychological Assessment, 28(11), 1354–1367. https://doi.org/10.1037/pas0000275
PubMed Web of Science ®Google Scholar
Grissom, J. A., & Bartanen, B. (2019). Strategic retention: Principal effectiveness and teacher turnover in multiple-measure teacher evaluation systems. American Educational Research Journal, 56(2), 514–555. https://doi.org/10.3102/0002831218797931
Web of Science ®Google Scholar
Hamre, B., Hatfield, B., Pianta, R., & Jamil, F. (2014). Evidence for general and domain‐specific elements of teacher–child interactions: Associations with preschool children’s development. Child Development, 85(3), 1257–1274. https://doi.org/10.1111/cdev.12184
PubMed Web of Science ®Google Scholar
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189X12437203
Web of Science ®Google Scholar
Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school personnel. Research paper, MET project. Bill & Melinda Gates Foundation.
Google Scholar
Jiang, Z., Walker, K., Shi, D., & Cao, J. (2018). Improving generalizability coefficient estimate accuracy: A way to incorporate auxiliary information. Methodological Innovations, 11(2), 1–14. https://doi.org/10.1177/2059799118791397
Google Scholar
Kane, M. (2011). The errors of our ways. Journal of Educational Measurement, 48(1), 12–30. https://doi.org/10.1111/j.1745-3984.2010.00128.x
Web of Science ®Google Scholar
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Research paper, MET project. Bill & Melinda Gates Foundation.
Google Scholar
Kuhfeld, M., Soland, J., & Lewis, K. (2023). Investigating differences in how parents and teachers rate students’ self-control. Psychological Assessment, 35(1), 23. https://doi.org/10.1037/pas0001187
PubMed Web of Science ®Google Scholar
Li, H., Liu, J., & Hunter, C. V. (2020). A meta-analysis of the factor structure of the classroom assessment scoring system (CLASS). The Journal of Experimental Education, 88(2), 265–287. https://doi.org/10.1080/00220973.2018.1551184
Web of Science ®Google Scholar
Linacre, J. M. (1989). Many-facet rasch measurement. MESA Press.
Google Scholar
Liu, S., Bell, C. A., Jones, N. D., & McCaffrey, D. F. (2019). Classroom observation systems in context: A case for the validation of observation systems. Educational Assessment, Evaluation and Accountability, 31(1), 61–95. https://doi.org/10.1007/s11092-018-09291-3
Web of Science ®Google Scholar
Mantzicopoulos, P., French, B. F., Patrick, H., Watson, J. S., & Ahn, I. (2018). The stability of kindergarten teachers’ effectiveness: A generalizability study comparing the framework for teaching and the classroom assessment scoring system. Educational Assessment, 23(1), 24–46. https://doi.org/10.1080/10627197.2017.1408407
Web of Science ®Google Scholar
Mariano, L. T. (2002). Information accumulation, model selection and rater behavior in constructed response student assessments doctoral dissertation. Carnegie Mellon University.
Google Scholar
McNeish D. (2022). Limitations of the Sum-and-Alpha Approach to Measurement in Behavioral Research. Policy Insights from the Behavioral and Brain Sciences, 9(2), 196–203. https://doi.org/10.1177/23727322221117144
Google Scholar
McNeish D and Wolf M Gordon. (2020). Thinking twice about sum scores. Behav Res, 52(6), 2287–2305. https://doi.org/10.3758/s13428-020-01398-0
Google Scholar
Meyer, J. P., Cash, A. H., & Mashburn, A. (2011). Occasions and the reliability of classroom observations: Alternative conceptualizations and methods of analysis. Educational Assessment, 16(4), 227–243. https://doi.org/10.1080/10627197.2011.638884
Google Scholar
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
PubMedGoogle Scholar
Park, Y. S., Chen, J., & Holtzman, S. L. (2015). Evaluating efforts to minimize rater bias in scoring classroom observations. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation systems: New guidance from the measures of effective teaching project (pp. 383–414). Wiley.
Google Scholar
Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27(4), 341–384. https://doi.org/10.3102/10769986027004341
Web of Science ®Google Scholar
Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom assessment scoring system. Brookes.
Google Scholar
Raykov, T., & Marcoulides, G. A. (2006). Estimation of generalizability coefficients via a structural equation modeling approach to scale reliability evaluation. International Journal of Testing, 6(1), 81–95. https://doi.org/10.1207/s15327574ijt0601_5
Google Scholar
Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571. https://doi.org/10.1162/edfp.2009.4.4.537
Web of Science ®Google Scholar
Sandilos, L. E., Shervey, S. W., DiPerna, J. C., Lei, P., & Cheng, W. (2017). Structural validity of CLASS K–3 in primary grades: Testing alternative models. School Psychology Quarterly, 32(2), 226–239. https://doi.org/10.1037/spq0000155
PubMed Web of Science ®Google Scholar
Shin, H. J., Rabe-Hesketh, S., & Wilson, M. (2019). Trifactor models for multiple-ratings data. Multivariate Behavioral Research, 54(3), 360–381. https://doi.org/10.1080/00273171.2018.1530091
PubMed Web of Science ®Google Scholar
Soland J. (2017). Is Teacher Value Added a Matter of Scale? The Practical Consequences of Treating an Ordinal Scale as Interval for Estimation of Teacher Effects. Applied Measurement in Education, 30(1), 52–70. https://doi.org/10.1080/08957347.2016.1247844
Web of Science ®Google Scholar
Soland J and Kuhfeld M. (2022). Examining the Performance of the Trifactor Model for Multiple Raters. Applied Psychological Measurement, 46(1), 53–67. https://doi.org/10.1177/01466216211051728
PubMed Web of Science ®Google Scholar
Soland J and Kuhfeld M. (2022). A Multi-Rater Latent Growth Curve Model. Multivariate Behavioral Research, 57(5), 701–717. https://doi.org/10.1080/00273171.2021.1925076
PubMed Web of Science ®Google Scholar
Steinberg, M. P., & Sartain, L. (2015). Does better observation make better teachers: New evidence from a teacher evaluation pilot in Chicago. Education Next, 15(1), 70–76.
Google Scholar
Tennessee Department of Education. (2022). Tennessee educator acceleration model: Teacher evaluation handbook. https://team-tn.org/wp-content/uploads/2022/06/EET_2022-TEAM-Teacher-Evaluator-Handbook-6_21_22.pdf
Google Scholar
Verhelst, N. D., & Verstralen, H. H. F. M. (2001). An IRT model for multiple raters. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (pp. 89–108). Springer-Verlag.
Google Scholar
White, M. C. (2018). Rater performance standards for classroom observation instruments. Educational Researcher, 47(8), 492–501. https://doi.org/10.3102/0013189X18785623
Web of Science ®Google Scholar
White, T. (2014). Adding eyes: The rise, rewards, and risks of multi-rater teacher observation systems. Carnegie Foundation for the Advancement of Teaching.
Google Scholar
Wilson, M., & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26(3), 283–306. https://doi.org/10.3102/10769986026003283
Web of Science ®Google Scholar
Wind, S. A., & Jones, E. (2019). Not just generalizability: A case for multifaceted latent trait models in teacher observation systems. Educational Researcher, 48(8), 521–533. https://doi.org/10.3102/0013189X19874084
Web of Science ®Google Scholar
Wind, S. A., Tsai, C. L., Grajeda, S. B., & Bergin, C. (2018). Principals’ use of rating scale categories in classroom observations for teacher evaluation. School Effectiveness and School Improvement, 29(3), 485–510. https://doi.org/10.1080/09243453.2018.1470989
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Improving the Precision of Classroom Observation Scores Using a Multi-Rater and Multi-Timepoint Item Response Theory Model

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Improving the Precision of Classroom Observation Scores Using a Multi-Rater and Multi-Timepoint Item Response Theory Model

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date