469
Views
0
CrossRef citations to date
0
Altmetric
Research Article

CFSE: a Chinese short text classification method based on character frequency sub-word enhancement

, , , &
Article: 2263663 | Received 08 Jun 2023, Accepted 21 Sep 2023, Published online: 06 Oct 2023
 

Abstract

As a foundation task of natural language processing, text classification is widely used in information retrieval, public opinion analysis, and other related tasks. Facing the problem of sparse features of Chinese short texts, which affects the classification accuracy of Chinese short texts, this paper proposes a Chinese short text classification method based on the Character Frequency Sub-word Enhancement (CFSE), which can effectively improve the classification accuracy of Chinese short texts. First, the initial Chinese-character sequence is mapped to the corresponding Character Frequency Sub-word (CFS) sequence based on the global character1 frequency information. Second, the relationship features among data are extracted based on BiLSTM-Att processing CFS sequence, and the semantic features of the initial Chinese-character sequence are obtained through ERNIE. Finally, these two kinds of features are fused and input into the text classifier to obtain the classification results. Experimental results show that the proposed method can improve the classification accuracy of Chinese short texts.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 Character in this paper refers to a single Chinese character.

Additional information

Funding

This work was supported by Graduate Innovation Fund project of Anhui University of Science and Technology [grant number 2022CX2127]; National Natural Science Foundation of China [grant number 62076006]; Anhui Province University Natural Science Research Project [grant number 2023AH050846]; The Opening Foundation of State Key Laboratory of Cognitive Intelligence [grant number COGOS-2023HE02], and by the University Synergy Innovation Program of Anhui Province [grant number GXXT-2021-008].