ABSTRACT
Purpose
This study sought to 1) identify linguistic features important for Chinese text complexity with a theory-based and systematic approach, and 2) address how feature sets and algorithms affect the performance of Chinese text complexity models.
Method
Texts from Chinese language arts textbooks from Grades 1 to 6 (N = 1,478) in Mainland China were analyzed. The predictor variables were 265 linguistic features of texts: 154 lexical features and 111 sentence and discourse features. The outcome variable was the complexity level of texts; a one-semester-scale was applied, thus 12 levels in total (two semesters per grade).
Results
Features of the categories of character and word frequency, character and word semantic features, lexical diversity, part-of-speech syntactic categories, and referential cohesion were found the most important. With the important features identified, we found that text complexity models with features at all levels outperformed those with features at only one level. Models using the two machine learning algorithms (Random Forest Regression and Support Vector Regression) outperformed those using Linear Regression.
Conclusion
This work clarifies important linguistic features for Chinese text complexity, and points to the necessity of considering features across levels and using machine learning algorithms in future text complexity research.
Acknowledgments
We thank Hailey Gibbs at the University of Maryland, College Park, for her kind help with proofreading.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1. There are two scripts in the modern Chinese language, the Traditional Chinese script used in Hong Kong, Taiwan, and Macau, and the Simplified Chinese script mainly used in Mainland China. Although visually distinct, the two scripts carry the characteristics of the Chinese writing system in the same manner. Thus, we found it feasible to consider findings from both scripts in the context of text complexity research.
2. The regression models were used in our study under the consideration that the complexity levels of texts increase continuously throughout elementary school, without a clear boundary between two adjacent semester levels as claimed in Phani et al. (Citation2019).
3. We acknowledge that using absolute accuracy to evaluate regression models may not be appropriate (François & Miltsakaki, Citation2012), and we decided to include absolute accuracy here only to compare our results with previous Chinese text complexity studies, some of which merely reported absolute accuracy of their models (Sung et al., Citation2016; Tseng et al., Citation2019; Wu et al., Citation2020). We used a rounding method to convert continuous estimated values to categorical levels, e.g., an estimated value between 3.5 and 4.4 was considered a complexity level of 4 following previous practice (François & Miltsakaki, Citation2012).
4. We employed a 5-fold cross-validation, and thus there were five data points for each evaluation indices (e.g., R2) under each of the nine conditions (in the combination of three feature sets and three algorithms).
5. Both of our models would have achieved an absolute accuracy of .76 if we had used a two-grade-level scale like the existing models (.59–.64, Wu et al., Citation2020). Our models would have achieved the absolute accuracy of .49 (RFR) and .51 (SVR) if we have used a one-grade-level scale as existing models (.44–.72, Sung et al., Citation2016; .49–.76, Tseng et al., Citation2019).