34
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Latent-Variable Modelling of Ordinal Outcomes in Language Data Analysis

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Received 28 Jul 2023, Accepted 01 Mar 2024, Published online: 08 Apr 2024
 

ABSTRACT

In empirical work, ordinal variables are typically analysed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modelling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.

Acknowledgement

We would like to thank Santiago Barreda and an anonymous reviewer for their constructive comments on an earlier version of this paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Supplemental data

Supplemental data for this article can be accessed online at https://doi.org/10.1080/09296174.2024.2329448

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Notes

1. All images in this paper have been published under the Creative Commons Attribution 4.0 licence (CC BY 4.0, http://creativecommons.org/licenses/by/4.0) in the accompanying OSF project (https://osf.io/jnv27).

3. No ethics approval was obtained for this study, since – due to its design and participant characteristics – this was (and is, as of 2023) not required by European or Maltese national regulations or policies. For a careful weighing of research-ethical considerations, please refer to the data protection impact assessment in the TROLLing post (Krug et al. Citation2023).

4. In the notation used here, the term in brackets denotes the random intercepts for the variable informant.

5. These are the fitted values obtained through appropriate combination of the regression coefficients (i.e. the fixed intercept and slopes).

6. This is the regression line fitted to the (unjittered) data.

7. For the regression analysis reported in this paper, the predictor year of birth was centred and rescaled to range from − 1 (1950) to + 1 (2000). therefore shows slope coefficients (and their CIs) multiplied by 2, to obtain the difference associated with a 2-unit (i.e. 50-year) rather than 1-unit (25-year) change in the predictor.

8. The parameterization of the model determines the centre of the latent scale and is therefore one of the features affecting the values taken on by the latent variable. A model feature that influences the variability of the latent variable is the link function chosen (e.g. probit or logit): For the logit link, the latent-variable variance is π2/3, i.e. about 3.3 times larger than for the probit link. Finally, the data will affect the value of the latent variable (only) if the model is identified by setting the intercept to 0. The centre of the latent scale will then depend on the scaling of the predictor variables (i.e. which condition is referenced when setting all predictors to 0) and the association between predictors and outcome (i.e. the expected distribution of the outcome variable for the condition denoted by the intercept).

9. As we will see shortly, this involves defining the thresholds (i.e. parameterizing the model) in such a way that the mean of the threshold parameters is zero.

10. As explained in more detail in the tutorial that is included in the supplementary materials (https://osf.io/jnv27), in our analyses, this constraint is actually enacted at the post-processing stage. Thus, the parameterization we used when fitting the model using the R package ‘ordinal’ does not define thresholds to be centred. It is the package we use to construct model predictions (emmeans; Lenth, Citation2023) that (by default) returns predictions (or estimates) on a latent scale that is centred at the average over thresholds.

11. In the R package ‘ordinal’, this can be achieved by setting the argument ‘threshold’ to ‘symmetric’. This is illustrated in the short R tutorial that accompanies the present paper (https://osf.io/78ezm).

12. While the custom thresholds we are using in our analysis make sense in light of our ordinal response scale and our interpretative preferences, the basic point the current paper is trying to make does not hinge on how thresholds are parameterized. Thus, if they are unstructured (i.e. estimated flexibly, with no a-priori constraints on their relative spacing or midpoint), the same methods of interpretation can still be applied. As a technical aside, however, we note that a benefit of constraining thresholds in this way is that it reduces the number of parameters that must be estimated, thus yielding a more parsimonious model structure.

13. We thank a perceptive reviewer for the following hint: Since the logistic distribution (i.e. the probability distribution underlying the logit link) is well-approximated by a Student-t distribution with about 8 degrees of freedom (see Agresti, Citation2010, p. 330), the logit link is also more robust to aberrant data points.

14. These scores are in fact only partly standardized, since they are not centred about their mean.

15. This is due to the fact that the MRM relies on a combined (i.e. pooled) estimate of residual variation around the conditional averages. Since the variation of scores is smaller near the endpoints of the scale (see Section 3), the residual standard deviation is downwardly biased for conditional means near the scale midpoint. The opposite is true, of course, for estimates near the bounds of the scale.

Additional information

Funding

This work was supported by German Humboldt Foundation, Spanish Ministry of Education and Science with European Regional Development Fund (Grant ID: HUM2007-60706/FILO), and Bavarian Ministry for Science, Research and the Arts.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 394.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.