A novel framework for Chinese personal sensitive information detection

Chenglong Rena School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaView further author information

Xiao Lanb Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Xingshu Chena School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of China;b Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaView further author information

Yonggang Luob Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaView further author information

Shuhua Ruana School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of China;b Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Abstract

With the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machine learning methods to classify sensitive text. These methods face challenges in context analysis and adapting to Chinese language characteristics. This paper proposes CPSID, a method for detecting Chinese personal sensitive information. On the one hand, CPSID utilises rule matching to detect specific personal sensitive information only containing letters and numbers. More importantly, CPSID constructs a sequence labelling model named EBC (ELECTRA-BiLSTM-CRF) to detect more complex personal sensitive information that consist of Chinese characters. The EBC model uses the latest ELECTRA algorithm to implement word embedding, and uses BiLSTM and CRF models to extract personal sensitive information, which can detect Chinese sensitive entities accurately by analysing context information. The model achieves an F1 score of 94.09% on Chinese datasets, outperforming other similar models. Additionally, experiments on real data show CPSID has a better detection result than individual methods (rule matching or sequence labelling).

KEYWORDS:

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by National Natural Science Foundation of China [grant number U19A2081], Fundamental Research Funds for the Central Universities [grant number 2022SCU12116] and Science and Engineering Connotation Development Project of Sichuan University [grant number 2020SCUNG129].

A novel framework for Chinese personal sensitive information detection

Information for

Open access

Opportunities

Help and information

A novel framework for Chinese personal sensitive information detection

Abstract

Disclosure statement

Additional information

Funding

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature