349
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Text Complexity of Chinese Elementary School Textbooks: Analysis of Text Linguistic Features Using Machine Learning Algorithms

, ORCID Icon, & ORCID Icon
Pages 235-255 | Published online: 14 Aug 2023
 

ABSTRACT

Purpose

This study sought to 1) identify linguistic features important for Chinese text complexity with a theory-based and systematic approach, and 2) address how feature sets and algorithms affect the performance of Chinese text complexity models.

Method

Texts from Chinese language arts textbooks from Grades 1 to 6 (N  = 1,478) in Mainland China were analyzed. The predictor variables were 265 linguistic features of texts: 154 lexical features and 111 sentence and discourse features. The outcome variable was the complexity level of texts; a one-semester-scale was applied, thus 12 levels in total (two semesters per grade).

Results

Features of the categories of character and word frequency, character and word semantic features, lexical diversity, part-of-speech syntactic categories, and referential cohesion were found the most important. With the important features identified, we found that text complexity models with features at all levels outperformed those with features at only one level. Models using the two machine learning algorithms (Random Forest Regression and Support Vector Regression) outperformed those using Linear Regression.

Conclusion

This work clarifies important linguistic features for Chinese text complexity, and points to the necessity of considering features across levels and using machine learning algorithms in future text complexity research.

Acknowledgments

 We thank Hailey Gibbs at the University of Maryland, College Park, for her kind help with proofreading.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1. There are two scripts in the modern Chinese language, the Traditional Chinese script used in Hong Kong, Taiwan, and Macau, and the Simplified Chinese script mainly used in Mainland China. Although visually distinct, the two scripts carry the characteristics of the Chinese writing system in the same manner. Thus, we found it feasible to consider findings from both scripts in the context of text complexity research.

2. The regression models were used in our study under the consideration that the complexity levels of texts increase continuously throughout elementary school, without a clear boundary between two adjacent semester levels as claimed in Phani et al. (Citation2019).

3. We acknowledge that using absolute accuracy to evaluate regression models may not be appropriate (François & Miltsakaki, Citation2012), and we decided to include absolute accuracy here only to compare our results with previous Chinese text complexity studies, some of which merely reported absolute accuracy of their models (Sung et al., Citation2016; Tseng et al., Citation2019; Wu et al., Citation2020). We used a rounding method to convert continuous estimated values to categorical levels, e.g., an estimated value between 3.5 and 4.4 was considered a complexity level of 4 following previous practice (François & Miltsakaki, Citation2012).

4. We employed a 5-fold cross-validation, and thus there were five data points for each evaluation indices (e.g., R2) under each of the nine conditions (in the combination of three feature sets and three algorithms).

5. Both of our models would have achieved an absolute accuracy of .76 if we had used a two-grade-level scale like the existing models (.59–.64, Wu et al., Citation2020). Our models would have achieved the absolute accuracy of .49 (RFR) and .51 (SVR) if we have used a one-grade-level scale as existing models (.44–.72, Sung et al., Citation2016; .49–.76, Tseng et al., Citation2019).

Additional information

Funding

This research was supported by grants from the Ministry of Education of the People’s Republic of China [17YJA190009] to Hong Li. The writing of this paper was partially supported by a Seed Funding Grant at The Education University of Hong Kong [RG 37/2021-2022 R] to Yixun Li.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 337.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.