1,015
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Improved Word Segmentation System for Chinese Criminal Judgment Documents

Article: 2297524 | Received 04 Oct 2023, Accepted 15 Dec 2023, Published online: 21 Dec 2023

ABSTRACT

In this paper, a system for automatic word segmentation of Chinese criminal judgment documents is proposed. The system uses a hybrid model composed of fine-tuned BERT (Bidirectional Encoder Representations from Transformers), BiLSTM (Bidirectional Long Short Term Memory) and CRF (Conditional Random Field) for named entity recognition, and introduces a custom dictionary that includes common professional terms in Chinese criminal trial documents, as well as a rule system based on judicial system and litigation procedure related regulations, to further improve the accuracy of word segmentation. BERT uses a deep bidirectional Transformer encoder to pre-train general language representations from large-scale unlabeled text corpora. BiLSTM uses two LSTM networks, one for the forward direction and one for the backward direction, to capture the context from both sides of the input sequence. CRF uses a set of features and weights to define a log-linear distribution over the output sequence. Experimental results show that the proposed system has significantly improved word segmentation accuracy compared to the current commonly used Chinese word segmentation models. In the results of the segmentation of the test data, the F1 scores for jieba, THULAC and the segmentation system proposed in this paper are 85.59%, 87.94% and 94.82%, respectively.

Introduction

Chinese word segmentation (CWS) is the task of splitting Chinese text (a sequence of Chinese characters) into words. Unlike English and other languages that use spaces or punctuation marks to separate words, Chinese text does not have explicit word boundaries. Therefore, as a necessary preprocessing step for many natural language processing (NLP) applications that operate on the word level, such as machine translation, information retrieval, sentiment analysis, and named entity recognition (NER), CWS plays a crucial role.

CWS is not a trivial task, as there are many ambiguities and variations in Chinese word formation and usage. For example, the segmentation of the same character sequence into different words can vary depending on the context or domain. Moreover, different corpora and tasks may adopt different segmentation standards and criteria. Hence, sophisticated models and algorithms are required for CWS to capture the linguistic and statistical properties of Chinese text.

In recent years, significant progress has been made in CWS algorithms, with various models developed based on different approaches. Based on published results in academic papers and related websites, many of these models have demonstrated relatively good performance on specific CWS tasks. Nevertheless, these models still exhibit limitations, especially in particular CWS applications. I am currently involved in an NLP project focused on Chinese criminal judgment documents, which demand high accuracy in language use and consequently in NLP results. As a fundamental NLP task, the accuracy of CWS can directly impact the correctness of final outcomes. However, criminal judgment documents contain many domain-specific, less common words and syntactic structures. Existing CWS models still struggle with these issues and produce erroneous segmentations, which propagate errors to later stages. Take two commonly used open-source tools for CWS, Jieba and THULAC (THU Lexical Analyzer for Chinese) as examples. Jieba uses a prefix dictionary structure to achieve efficient word graph scanning and builds a directed acyclic graph (DAG) (Wieczorek Citation2016) for all possible word combinations. Subsequently, Jieba employs the dynamic programming algorithm to identify the most probable combination based on word frequency. For unknown words, Jieba uses a Hidden Markov Model (HMM) (Mor, Garhwal, and Loura Citation2020) based on the Viterbi algorithm to recognize new words by character-based tagging. THULAC implements a word lattice re-ranking method (Jiang, Mi, and Liu Citation2008) for segmentation and tagging. It is trained on integrated segmentation and part-of-speech labeled corpora, achieving relatively high accuracy and fast processing speed. However, some problems arose when using the aforementioned models to test CWS on Chinese criminal judgments. For example, the word segmentation results for personal names were not accurate enough. In the same criminal judgment, different word segmentation results would be produced for the same defendant’s name according to the different contexts. For a defendant named “龚某 (Gong Mou)” in a criminal case, there were two word segmentation results: “龚某 (Gong Mou)” and “龚某犯 (Gong Mou Fan),” where the latter mistakenly divided the first character “犯 (Fan)” of the phrase “犯交通肇事罪 (Fan Jiao Tong Zhao Shi Zui)” (meaning “committed the crime of traffic accident”) following “龚某 (Gong Mou)” as the last character of the defendant’s name. Another example was that the word segmentation results for organizational names were not accurate enough. For the name of the procuratorate “相城区人民检察院 (Xiang Cheng Qu Ren Min Jian Cha Yuan)” (meaning “Xiangcheng District People’s Procuratorate”), the word segmentation result was “相 (adverb) 城区 (noun) 人民检察院 (noun),” which incorrectly divided a complete name of a basic-level people’s procuratorate into an adverb and two nouns. These results indicate that there is still room for improvement in the current CWS models when it comes to specific word segmentation tasks, such as Chinese criminal judgments. The purpose of this paper is to enhance the word segmentation accuracy of Chinese criminal judgments by proposing an improved scheme.

The results above demonstrate that existing commonly used CWS models are unable to adequately perform text segmentation tasks for criminal judgment documents and other professional domain texts. This limitation is primarily observed in their inability to accurately recognize and segment professional vocabularies. Considering the professional requirements, these defects are deemed unacceptable. The reason for this lies in the fact that these models were primarily designed with a focus on nonprofessional texts, such as news articles, prose, novels, and comments. The algorithms or training data they employed are not optimized for specific professional domains. To address these deficiencies, the CWS system proposed in this paper adopts a hybrid model that integrates multiple advanced algorithms, incorporates fine-tuning with professional texts, and introduces a custom dictionary and rule system. The objective is to overcome the common challenges faced by conventional CWS models in specific professional domains and enhance the segmentation accuracy specifically for Chinese criminal judgment documents.

Related Work

According to the recent related research, CWS methods can be roughly categorized into dictionary-based methods, statistical-based methods and deep learning-based methods.

Dictionary-based methods rely on a predefined lexicon to match the input text with the words in the dictionary. They are simple and fast, but suffer from the problems of out-of-vocabulary words, ambiguity resolution, and domain adaptation. Some approaches have also been proposed to improve dictionary-based methods or combine dictionary-based methods with other methods. For example, Liu et al. proposed two methods to exploit the dictionary information for improving the performance of neural network based CWS (Liu et al. Citation2019). The first method is based on pseudo labeled data generation, which uses a dictionary-based CWS model to segment unlabeled sentences and then uses the pseudo labeled sentences to augment the training data for the neural CWS model. The second method is based on multi-task learning, which jointly trains the neural CWS model and a dictionary-based word prediction model, and uses an attention mechanism to transfer the dictionary knowledge from the word prediction model to the CWS model. Xiong et al. introduced a word segmentation algorithm that can better cover the ancient Chinese vocabulary used in imperial edicts, and applies it to analyze the psycholinguistic words in the edicts of the Western and Eastern Jin Dynasties (Xiong et al. Citation2021). By dividing the words into different categories and comparing the different categories, the paper demonstrates the applicability and feasibility of the dictionary-based classical CWS algorithm for studying classical Chinese literature and culture. Tang et al. proposed a new method for CWS based on a novel dictionary mechanism and a forward maximum matching algorithm (Tang, Wu, and Li Citation2015). The paper provided some simulation results to show that the proposed method can achieve a higher segmentation accuracy and better performance than the traditional methods.

Statistical-based methods use probabilistic models to learn the word boundaries from a large corpus of annotated data. They can handle unknown words and ambiguity better than dictionary-based methods, but require sufficient training data and feature engineering. Zhang et al. proposed a novel approach to CWS that treated it as a character-based tagging problem (Zhang et al. Citation2010). It used a six-tag set, six n-gram feature templates, a conditional random field (CRF) model, and an assistant segmenter that leverages additional linguistic resources. It adapted to different segmentation standards by using the assistant segmenter as a bridge. Liu et al. proposed a method to improve the performance of CWS on different domains by using free data from the Internet that contain partial annotation information (Liu et al. Citation2014). The paper transformed various sources of free data, such as domain-specific lexicons and semi-annotated web pages, into a unified form of partial annotation, which indicated the word boundary status of some characters in a sentence. The paper then used a variant of CRF model that could leverage both fully and partially annotated data for training and inference. Li et al. proposed a unified, character-based, generative model that could incorporate additional resources to handle different types of out-of-vocabulary (OOV) words in CWS (Li, Zong, and Su Citation2015). The model consisted of several submodels, each of which could use different types of additional information, such as dictionary, named entity, and suffix information. Xu et al. proposed a unified model for cross-domain and semi-supervised NER in Chinese social media (Xu et al. Citation2018). The model consisted of two parts: one for cross-domain learning and one for semi-supervised learning. Cross-domain learning could learn useful information from formal domain corpora based on domain similarity. Semi-supervised learning could learn relevant information from unlabeled text in social media by self-training.

Deep learning-based methods use neural networks to automatically learn the features and representations for CWS. They can achieve state-of-the-art performance on various benchmarks and domains, but need more computational resources and may suffer from overfitting. Kong et al. proposed a combination of representation learning and structured prediction to model the segmentation and labeling of sequential data (Kong, Dyer, and Smith Citation2016). The paper introduces segmental recurrent neural networks (SRNNs), which define a joint probability distribution over segmentations of the input and labelings of the segments, given an input sequence. The key idea is to use bidirectional recurrent neural networks (RNNs) to embed every possible segment of the input in a continuous space, and then use these segment embeddings to compute the compatibility scores with output labels. These local scores are integrated using a global semi-Markov CRF, which allows explicit modeling of dependencies between adjacent labels and segment lengths. Zhang and Yang proposed a lattice-structured LSTM model to perform NER on Chinese text (Zhang and Yang Citation2018). The model took as input a sequence of characters and all potential words that matched a lexicon, and output a sequence of entity labels. The model could leverage both character-level and word-level information, and avoided the errors caused by word segmentation. The model used gated recurrent cells to select the most relevant features from the lattice input for better NER performance. Ma et al. proposed a stacked bidirectional LSTM model with two input features (character and bigram) to perform CWS (Ma, Ganchev, and Weiss Citation2018). The model uses pre-trained embeddings, dropout, and hyperparameter tuning to achieve better accuracy than more complex neural network architectures. The model outputs four labels (Begin, Inside, End, Single) for each character position to indicate the word boundaries. Diao et al. proposed a BERT-based Chinese text encoder that incorporated n-gram features into the pre-training and fine-tuning process (Diao et al. Citation2020). The model, called ZEN, took as input a sequence of characters and all possible words that match a lexicon, and output a sequence of labels for various natural language processing tasks. The model could capture both character-level and word-level information. The model used a lattice structure to represent the input, and used gated recurrent cells to select the most relevant features from the lattice for better performance. Wang et al. proposed a graph neural network model to augment the vanilla sequence labeling model output with similar tagging examples retrieved from the whole training set (Wang et al. Citation2023). The model, called GNN-SL, constructed a heterogeneous graph between the input word sequence and the retrieved tagging examples, and used gated recurrent cells to transfer information between them. The model then used the augmented node that aggregated information from its neighbors to make predictions. This method enabled the model to handle long-tail cases and minority categories better by referring to similar training examples.

Each type of method has its own advantages and disadvantages. Dictionary-based methods rely on a predefined lexicon to segment words based on the longest matching principle. They are simple and fast, but they cannot handle unknown words or ambiguous cases well. They also need to maintain and update the dictionary regularly. Statistical-based methods can deal with unknown words and ambiguity better than dictionary-based methods, but they require a large amount of annotated data and feature engineering. They also suffer from data sparsity and domain adaptation issues. Deep learning-based methods can capture complex and nonlinear features automatically, and achieve state-of-the-art performance on CWS. However, they also need a lot of labeled data and computational resources. They are less interpretable and robust than statistical-based methods.

Moreover, automatic language detection and classification can aid word segmentation by providing context about the language being processed, allowing for the application of language-specific segmentation rules and improving the accuracy of the segmentation process. For example, Tuncer et al. proposed a novel method for identifying the language of a speech signal using two new techniques: polymer pattern (PP) and tent maximum absolute pooling (TMAP) (Tuncer et al. Citation2022). PP is a method that extracts low-frequency features from the speech signal by applying a polynomial function to the spectrogram. TMAP is a method that extracts high-frequency features from the speech signal by applying a tent-shaped filter to the spectrogram. The paper also introduced a new corpus for language identification, called LI45, which contained 45 languages. Barua et al. introduced a novel handcrafted machine learning framework that uses the graph of the favipiravir molecule as a feature extractor for vowel-based specific language impairment (SLI) diagnosis (Barua et al. Citation2022). The framework consisted of five components: the favipiravir molecular structure pattern, the statistical feature extractor, the wavelet packet decomposition, the iterative neighborhood component analysis, and the support vector machine classifier. Kirik et al. proposed a novel machine learning framework for identifying the language spoken by a person using electroencephalography (EEG) signals (Kirik et al. Citation2023). The framework used four feature extraction functions based on Feynman graph patterns, which were inspired by the diagrams used in physics to represent particle interactions. The framework also used wavelet transform, neighborhood component analysis, k-nearest neighbor classifier, and iterative majority voting to achieve high accuracy in language detection. Omar et al. proposed a method to detect hate speech and abusive content in Arabic text from different online social networks (Omar, Mahmoud, and El‐Hafeez Citation2020). A standard Arabic dataset was constructed by using the data collected from four internet platforms. Features were extracted by using three methods: term frequency-inverse document frequency (TF-IDF), word embeddings, and character embeddings. To validate the effectiveness of the proposed dataset, 12 machine learning algorithms and two deep learning architectures were applied to classify the text into seven categories of hate speech and abusive content. Khairy et al. reviewed the existing methods for identifying abusive and harmful messages in Arabic social media (Khairy et al. Citation2021). These methods included machine learning and deep learning methods. The paper concluded the gaps and limitations of the existing methods, revealing that the automatic detection of abusive language and cyberbullying in Arabic content was a challenging and under-researched task that required more attention and efforts from the research community. Another method proposed by Omar et al. filtered social networks contents based on topic classification, sentiment analysis and multilabel classification (Omar et al. Citation2021). In the paper, machine-learning models were used to train and test a standard multi-label Arabic dataset. The relationship between topics and hate speech were examined. Another method proposed by Khairy et al. used different single and ensemble machine learning algorithms to automatically detect offensive language or cyberbullying in Arabic text (Khairy et al. Citation2023). In the paper, the performance of three single classifiers (Naive Bayes, Support Vector Machine, and Random Forest) and three ensemble models (Bagging, Boosting, and Voting) on three Arabic datasets were compared. The paper also applied hyperparameter tuning to the Voting technique to improve the performance. Another method proposed by Omar and El‐Hafeez compared quantum computing and machine learning for sentiment analysis of Arabic tweets (Omar and El‐Hafeez Citation2023). The paper used two datasets of Arabic tweets and applied both classic machine learning and quantum computing approaches to perform sentiment analysis. The performance of the approaches was evaluated based on accuracy, precision, recall, F1 score and processing time.

Method and Experiment

As mentioned, current commonly used CWS models exhibit relatively high error rates when applied to criminal judgment documents. The main reason lies in the highly specialized nature of the language used in criminal judgment documents, which presents challenges for general-purpose CWS models. The dictionaries and training data for building general-purpose CWS models typically come from internet media platforms, newspaper articles, and electronic versions of literature works. As such, they mainly cover content related to current affairs news, commentaries, and fiction targeting general audience. These texts have different vocabularies, styles, and structures from the criminal judgment documents, which are formal, legal, and standardized. Therefore, when using the general segmentation models to segment the criminal judgment documents, many OOV words will be generated, which will affect the accuracy of the word segmentation. To improve the accuracy of CWS on criminal judgments, several approaches can be considered. First, the model architecture itself can be optimized. More advanced architectures or a combination of multiple models can be used to improve the accuracy of the word segmentation. Second, training the model on textual data that is more relevant to the legal domain, such as statutes, regulations, and a subset of manually annotated judgment documents, can help enrich the vocabulary and capture linguistic patterns characteristic of this genre. Third, instead of treating judgments as generic texts, specific stylistic features of judgment documents can be incorporated, such as the use of terminology, archaic adjectival modifiers, and standard templates. For example, NER can be used to identify legal entities. Keywords that frequently appear in judgments but rarely occur in general texts can also be added to the model’s vocabulary. Leveraging such in-domain information will likely lead to more robust word segmentation performance on criminal judgments.

NER is a task of identifying and classifying proper nouns and other specific terms in a text, such as person names, locations, organizations, dates, etc. NER can help to correctly handle OOV words in CWS. In the CWS system proposed in this paper, NER is performed by a module that integrates three models: BERT (Devlin et al. Citation2019), BiLSTM (Hochreiter and Schmidhuber Citation1997; Schuster and Paliwal Citation1997) and CRF (Lafferty, McCallum, and Pereira Citation2001). BERT (Bidirectional Encoder Representations from Transformers) uses a deep bidirectional Transformer encoder to pre-train general language representations from large-scale unlabeled text corpora. It can capture both left and right context in all layers of the encoder, which enables it to learn rich and expressive representations for words and sentences. BERT uses two pre-training objectives: masked language modeling (MLM) and next sentence prediction (NSP). MLM randomly masks some tokens in the input sentence and predicts the original tokens based on the context. NSP predicts whether two sentences are consecutive or not based on the classification and separator tokens. These two objectives allow BERT to learn both syntactic and semantic information from the text. After pre-training, BERT can be fine-tuned with just one additional output layer for various downstream tasks, such as classification, regression, or sequence labeling. BERT can also be adapted to different domains or languages by using domain-specific or multilingual corpora for pre-training. BiLSTM (Bidirectional Long Short Term Memory) uses two LSTM networks, one for the forward direction and one for the backward direction, to capture the context from both sides of the input sequence. By concatenating or summing the outputs of the two LSTMs, Bi-LSTM can produce a richer representation of each element in the sequence. BiLSTM uses the same structure as LSTM, but with two hidden layers instead of one. LSTM consists of a memory cell and three gates: input gate, forget gate, and output gate. The memory cell can store and update information over time, and the gates can control the flow of information into and out of the cell. BiLSTM uses two sets of memory cells and gates, one for each direction, and combines their outputs at each time step. CRF models the conditional probability of the output sequence given the input sequence, rather than the joint probability of both sequences. This allows CRF to avoid the label bias problem that affects other models, such as hidden Markov models (HMMs) or maximum entropy Markov models (MEMMs). CRF uses a set of features and weights to define a log-linear distribution over the output sequence. The features can capture both local and global dependencies among the input and output variables, and the weights can be learned from the training data using maximum likelihood estimation or maximum a posteriori estimation.

The main idea of using a hybrid model of BERT, BiLSTM, and CRF for NER is to leverage the advantages of each component to achieve a better performance. First, the model uses BERT to obtain the word embeddings for each word in a sentence. Second, the model feeds the word embeddings into a BiLSTM network to get the hidden states of each word. Third, the model adds a CRF layer on top of the BiLSTM network to output the optimal label sequence for the sentence. The structure of this hybrid model is shown in .

Figure 1. Structure of the hybrid model of BERT, BiLSTM and CRF for NER.

Figure 1. Structure of the hybrid model of BERT, BiLSTM and CRF for NER.

The advantages of using a hybrid model over using BERT or BiLSTM or CRF alone are as follows. BERT is a powerful pre-trained language model that can capture the semantic and syntactic information of natural language. However, BERT alone cannot effectively model the sequential dependencies and label transitions of NER tasks. Therefore, adding a BiLSTM layer after BERT can help to encode the contextual information and capture the long-term dependencies of the input sequence. BiLSTM can process the input sequence from both directions and generate hidden states that contain information from the past and future context. However, BiLSTM alone cannot enforce the label consistency and use the prior knowledge of the label structure. Therefore, adding a CRF layer after BiLSTM can help to model the label transitions and find the optimal label sequence. CRF is a probabilistic graphical model that can jointly decode the label sequence by maximizing the conditional probability of the output given the input. However, CRF alone cannot learn the rich representations of the input sequence and rely on hand-crafted features. Therefore, adding a BERT layer before CRF can help to learn the contextualized embeddings and reduce the feature engineering effort.

Pre-trained models such as BERT combined with BiLSTM and CRF have been shown to achieve state-of-the-art results on various Chinese NER tasks (Dai et al. Citation2019; Hu, Zhang, and Hu Citation2022; Yang, Gan, and Zhang Citation2022). However, to improve the accuracy on NER in Chinese criminal judgment documents specifically, further fine-tuning using labeled Chinese criminal judgment documents as training data is needed. The labeled Chinese criminal judgment documents used for fine-tuning can also be leveraged to build a custom dictionary of common domain-specific terminology. Incorporating this dictionary can further improve CWS accuracy. Additionally, there exist certain “rules” within Chinese criminal judgment documents that can be exploited to boost segmentation performance. These include common fixed phrases and regulations originating from the judicial system and litigation procedures. By encoding these rules, the segmentation component can better handle legal jargon and unfamiliar entities not learnt during training. For fixed phrases such as the example mentioned in the introduction section: “defendant name + crime name,” when the segmentation model detects this pattern, it can separate the defendant name and the following verb-object phrase that indicates the crime name, instead of mistakenly identifying the following text as part of the defendant name. Similarly, there are other fixed phrases that are often mis-segmented in Chinese criminal judgment documents, such as “并处罚金” (meaning “and impose a fine,” correct segmentation: “并” (conjunction) + “处” (verb) + “罚金” (noun), instead of “并” (conjunction) + “处罚金” (noun)) and “认罪认罚具结书” (meaning “recognizance to admit guilt and accept punishment,” correct segmentation: recognize it as a whole noun, instead of “认罪” (verb) + “认罚” (verb) + “具结书” (noun)). For judicial system and litigation procedures regulations, the example mentioned in the introduction section shows the mis-recognition of procuratorate names. According to the legal regulations, China’s procuratorates are divided into the supreme people’s procuratorate, local people’s procuratorates at various levels and special people’s procuratorates such as military procuratorates. Among them, local people’s procuratorates at various levels are further divided into provincial people’s procuratorates, municipal people’s procuratorates and basic-level people’s procuratorates. In Chinese criminal judgment documents, basic-level people’s procuratorates are generally expressed as “province name + district name or county name etc. + people’s procuratorate,” while municipal people’s procuratorates are generally expressed as “province name + city name + people’s procuratorate.” If these regulations are used as a basis and set as segmentation rules, the segmentation accuracy can be significantly improved. The main processing steps of the system can be listed as follows.

Furthermore, the main processing steps of the hybrid model can be listed as follows.

The hybrid model is pre-trained on the Microsoft Research Asia Chinese named entity recognition dataset (MSRA CN NER Dataset), which is a general corpus containing 46,364 sentences for training and 4365 sentences for validation (Levow Citation2006). The model learns the general language knowledge and features from the corpus. Then, the model is fine-tuned on a manually annotated dataset of 300 Chinese criminal judgment documents. This dataset can reflect the characteristics and requirements of the target domain and task. The fine-tuning process updates the parameters of the pre-trained model using the domain-specific dataset, with the goal of making the fine-tuned model adapt to the target domain and task without forgetting the general language knowledge and features learned from the pre-training process. The performance of the fine-tuned model is evaluated on the test set of the domain-specific dataset, using metrics such as precision, recall, and F1 score to measure how well the fine-tuned model performs on the target task.

In this experiment, 300 manually annotated Chinese criminal judgment documents were used to fine-tune the pre-trained hybrid model. The hybrid model was trained by using a NVIDIA GeForce RTX 4070 graphics card and the training time was recorded. The parameters used for training are listed in . During the training of the model, the loss was recorded every 25 steps. Furthermore, commonly used professional terms in these 300 criminal judgment documents were statistically analyzed and added into the custom dictionary of the segmentation system. With references to laws and regulations related to the judicial system and litigation procedures, such as the Criminal Procedure Law of the People’s Republic of China, the Organic Law of the People’s Procuratorates of the People’s Republic of China, and the Organic Law of the People’s Courts of the People’s Republic of China, as well as relevant judicial interpretations, supplementary provisions, etc., some relevant provisions involved in criminal judgments were converted into segmentation rules. During the segmentation process, the application priority of the three components was custom dictionary, custom segmentation rules, and hybrid model. This experiment used 100 Chinese criminal judgment documents as test data, and calculated the precision, recall and F1 scores of segmentation results using common Chinese segmentation models such as jieba and THULAC, as well as the segmentation model proposed in this paper.

Table 1. Training parameters.

Results and Discussion

The loss variation during the training of the proposed model is shown in . The training time is 4338 seconds. Moreover, the precision, recall and F1 score of the test results for different models are listed in . Specifically, using the proposed segmentation model, professional terms in the original text could be correctly identified and retained, whereas with other models, long professional terms were usually split into multiple words. Therefore, the segmentation precision was greatly improved when using the segmentation system proposed in this paper. Meanwhile, as the recognition rate improved overall, the recall rate also increased accordingly, which led to the overall F1 score increasing significantly.

Figure 2. Loss variation during training.

Figure 2. Loss variation during training.

Table 2. Experimental results.

In general, the performance of a pre-trained and fine-tuned NER model depends on several factors, such as the quality and quantity of the data used for both steps, the similarity between the pre-training and fine-tuning objectives, the choice of the pre-trained language model, and the hyperparameters of the training process. If the data used for fine-tuning is insufficient, which means low coverage, diversity, or regularity of the entity mentions and contexts, it may affect the performance of the NER model in different ways. Low coverage may impair the NER model’s generalization ability to unseen mentions or rare types, especially if the pre-trained language model does not have a good representation of them. Low diversity may hinder the NER model from handling complex or ambiguous cases, such as nested entities, cross-sentence entities, or entities with multiple types. Low regularity may cause the NER model to rely too much on the pre-trained language model’s tokenization or generation, which may not be optimal for the NER task.

Specifically, in this experiment, due to some limitations, the number of Chinese criminal judgment documents used for fine-tuning was still insufficient, and the types of crimes covered in the judgment documents were relatively limited. However, even in this case, the proposed method can demonstrate its effectiveness and validity. First, the dataset used for fine-tuning covers many high-frequency and common crimes, and these cases are also relatively representative. These can ensure that the hybrid model can learn the most important and frequent patterns and features of the NER task, and achieve a relatively high accuracy and recall on the common entities. Second, the dataset reflects some characteristics and difficulties of the NER task, such as entity diversity, entity nesting, entity ambiguity, etc. These can stimulate the hybrid model’s learning ability, and improve its generalization and adaptation ability on the rare and complex entities. Third, the dataset has a degree of similarity and complementarity with the data used for pre-training, such as language similarity, domain relevance, style difference, etc. These can enhance the hybrid model’s knowledge transfer ability, and improve its efficiency and flexibility on the NER task. Last but not the least, the custom dictionary and the rule-based segmentation system provide useful guidance and assistance to the word segmentation system, such as term recognition, term annotation, phrase segmentation, etc. These can improve the word segmentation system’s accuracy and robustness on the Chinese criminal judgment documents, and thus improve the performance.

Another limitation of the proposed model is that it takes more time to complete the word segmentation task for the input texts compared to models with simpler structures. However, for criminal judgment documents, the accuracy of word segmentation is more important than the efficiency. For a court, the number of judgment documents produced daily is limited, so the word segmentation system can process the accumulated criminal judgment documents of a single court incrementally. If the system is applied in practice, it can be deployed to different courts to process the judgment documents in parallel, so that the time cost will be reduced significantly. Moreover, the word segmentation system is not the only component of the natural language processing pipeline for criminal judgment documents, and the time spent on other tasks such as information extraction, summarization, or analysis may be more considerable. The quality of the word segmentation results can also affect the performance of the subsequent tasks, and thus the overall value of the NLP system for criminal judgment documents. Therefore, it is worth investing more time in word segmentation to ensure high-quality outputs.

Conclusion

This paper proposes an automatic system for segmenting Chinese criminal judgment documents. The system mainly consists of three components: a hybrid model, a custom dictionary, and a rule-based subsystem. The hybrid model is composed of BERT, BiLSTM and CRF. After fine-tuning using Chinese criminal judgment documents as training data, it performs NER. The custom dictionary includes common professional terms in Chinese criminal judgment documents, which enhances the system’s ability to recognize professional terms. The rule-based subsystem presets segmentation rules based on relevant provisions of the judicial system and litigation procedures, as well as common fixed phrases and expressions in Chinese criminal judgment documents, to avoid segmentation errors that may occur when segmenting judgment documents. Experimental results show that compared with the commonly used CWS models, the proposed method improves the F1 scores by 6.8\% to 9.2\%. The improved word segmentation accuracy for Chinese criminal judgment documents can facilitate the subsequent semantic analysis and understanding, and prevent the negative effects and quality issues of legal text processing and application due to semantic deviation and loss. It can lay a foundation for further processing and application of legal texts, such as legal text classification, summarization, retrieval, comparison, reasoning, etc., and enhance the ability and level of artificial intelligence applied to the legal professional field. This method can also be applied to other domains, such as word segmentation for medical texts, financial texts, etc., and thus support various tasks such as information extraction and sentiment analysis.

In terms of future work, several aspects are currently under consideration. First, the accuracy of automatic word segmentation can be enhanced by improving the quantity and quality of the training data, and further refining the custom dictionary and rule-based subsystem. Second, structural improvements to the model are being explored. For example, there is a consideration to incorporate BART (Bidirectional and Auto-Regressive Transformers) for named entity recognition, and to test the impact on the overall system’s automatic word segmentation effectiveness. Additionally, building upon the automatic word segmentation system used for Chinese criminal judgment documents, research will be conducted into subsequent NLP tasks such as key information extraction.

Acknowledgements

This research is financially supported by the scientific research foundation from Hubei University of Technology.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Data Availability Statement

The data that support the findings of this study are openly available in ModelScope at https://www.modelscope.cn/datasets/damo/msra_ner/summary, reference number (Levow Citation2006).

Additional information

Funding

The work was supported by the Scientific research foundation from Hubei University of Technology .

References

  • Barua, P. D., E. Aydemir, S. Dogan, M. Erten, F. Kaysi, T. Tuncer, H. Fujita, E. E. Palmer, and U. R. Acharya. 2022. Novel favipiravir pattern-based learning model for automated detection of specific language impairment disorder using vowels. Neural Computing and Applications 35 (8):6065–17. doi:10.1007/s00521-022-07999-4.
  • Dai, Z., X. Wang, P. Ni, Y. Li, G. Li, and X. Bai. 2019. Named entity recognition using BERT BiLSTM CRF for Chinese Electronic Health Records. 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, Suzhou, China, 1–5.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, United States 1:4171–86.
  • Diao, S., J. Bai, Y. Song, T. Zhang, and Y. Wang. 2020. ZEN: Pre-training Chinese text encoder enhanced by N-Gram representations. Findings of the Association for Computational Linguistics: EMNLP 2020:4729–40.
  • Hochreiter, S., and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9 (8):1735–80. doi:10.1162/neco.1997.9.8.1735.
  • Hu, X., H. Zhang, and S. Hu 2022. Chinese named entity recognition based on BERTbased-BiLSTM-CRF model. IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS), Zhuhai, China, 100–04.
  • Jiang, W., H. Mi, and Q. Liu. 2008. Word lattice reranking for Chinese word segmentation and part-of-speech tagging. Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, United Kingdom, 385–92.
  • Khairy, M., T. M. Mahmoud, and T. Abd El‐Hafeez. 2021. Automatic detection of cyberbullying and abusive language in Arabic Content on Social Networks: A survey. Procedia Computer Science 189:156–66. doi:10.1016/j.procs.2021.05.080.
  • Khairy, M., T. M. Mahmoud, A. Omar, and T. A. El‐Hafeez. 2023. Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection. Language Resources and Evaluation 2023. doi:10.1007/s10579-023-09683-y.
  • Kirik, S., S. Dogan, M. Baygin, P. D. Barua, C. F. Demir, T. Keles, A. M. Yildiz, N. Baygin, I. Tuncer, T. Tuncer, et al. 2023. FGPat18: Feynman graph pattern-based language detection model using EEG signals. Biomedical Signal Processing and Control 85:104927. doi:10.1016/j.bspc.2023.104927.
  • Kong, L., C. Dyer, and N. A. Smith 2016. Segmental Recurrent Neural Networks. International Conference on Learning Representations, San Juan, Puerto Rico.
  • Lafferty, J., A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning, Williamstown, United States, 282–89.
  • Levow, G. 2006. The third International Chinese Language Processing BakeOFF: Word segmentation and named entity recognition. Meeting of the Association for Computational Linguistics, Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, 108–17.
  • Liu, J., F. Wu, C. Wu, Y. Huang, and X. Xie. 2019. Neural Chinese word segmentation with dictionary. Neurocomputing 338:46–54. doi:10.1016/j.neucom.2019.01.085.
  • Liu, Y., Y. Zhang, W. Che, T. Liu, and F. Wu. 2014. Domain adaptation for CRF-Based Chinese word segmentation using free annotations. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 864–74.
  • Li, X., C. Zong, and K.-Y. Su. 2015. A unified model for solving the OOV problem of Chinese word segmentation. ACM Transactions on Asian and Low-Resource Language Information Processing 14 (3):1–29. doi:10.1145/2699940.
  • Ma, J., K. Ganchev, and D. Weiss. 2018. State-of-the-art Chinese word segmentation with bi-LSTMs. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 4902–08.
  • Mor, B., S. Garhwal, and A. Loura. 2020. A systematic review of hidden Markov models and their applications. Archives of Computational Methods in Engineering 28 (3):1429–48. doi:10.1007/s11831-020-09422-4.
  • Omar, A., and T. A. El‐Hafeez. 2023. Quantum computing and machine learning for Arabic language sentiment classification in social media. Scientific Reports 13 (1):17305. doi:10.1038/s41598-023-44113-7.
  • Omar, A., T. M. Mahmoud, and T. A. El‐Hafeez. 2020. Comparative performance of machine learning and deep learning algorithms for Arabic hate speech detection in OSNs. Advances in Intelligent Systems & Computing 1153:247–57.
  • Omar, A., T. M. Mahmoud, T. A. El‐Hafeez, and A. Mahfouz. 2021. Multi-label Arabic text classification in online social networks. Information Systems 100:101785. doi:10.1016/j.is.2021.101785.
  • Schuster, M., and K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11):2673–81. doi:10.1109/78.650093.
  • Tang, J., Q. Wu, and Y. Li. 2015. An optimization algorithm of Chinese word segmentation based on dictionary. International Conference on Network and Information Systems for Computers, Wuhan, China, 259–62.
  • Tuncer, T., S. Dogan, E. Akbal, A. Cicekli, and U. R. Acharya. 2022. Development of accurate automated language identification model using polymer pattern and tent maximum absolute pooling techniques. Neural Computing and Applications 34 (6):4875–88. doi:10.1007/s00521-021-06678-0.
  • Wang, S., Y. Meng, R. Ouyang, J. Li, T. Zhang, L. Lyu, and G. Wang. 2023. GNN-SL: Sequence labeling based on nearest examples via GNN. Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 12679–92.
  • Wieczorek, W. 2016. An algorithm based on a directed acyclic word graph. Studies in Computational Intelligence 673:77–81.
  • Xiong, H., G. Wu, S. Xue, H. Li, and T. Zhu. 2021. Dictionary-based classical Chinese word segmentation and its application on imperial edicts of Jin Dynasties. Human Centered Computing, Virtual Event, 153–60.
  • Xu, J., H. He, X. Sun, X. Ren, and S. Li. 2018. Cross-domain and semisupervised named entity recognition in Chinese social Media: A unified model. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (11):2142–52. doi:10.1109/TASLP.2018.2856625.
  • Yang, R., Y. Gan, and C. Zhang. 2022. Chinese named entity recognition based on BERT and lightweight feature extraction model. Information 13 (11):515. doi:10.3390/info13110515.
  • Zhang, H., C. Huang, M. Li, and B.-L. Lu. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing 9 (2):1–32. doi:10.1145/1781134.1781135.
  • Zhang, Y., and J. Yang. 2018. Chinese NER using lattice LSTM. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia 1:1554–64.