718
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Hierarchical multi-instance multi-label learning for Chinese patent text classification

, , , , , ORCID Icon & show all
Article: 2295818 | Received 09 Apr 2023, Accepted 12 Dec 2023, Published online: 03 Jan 2024

Abstract

To further enhance the accuracy of the Chinese patent classification, this paper proposes a model, based on the patent structure and takes the patent claim as subjects, with multi-instance multi-label learning as the main method. Firstly, the patent claims are divided into multiple independent texts using the sequence number as the splitting token. For each patent, multiple claims are regarded as multiple instances, and the corresponding IPCs serve as its multiple labels. Next, the concept of secondary_label is introduced following the composition rules of IPC, and the relationships between instances and multiple secondary_labels are mined through the construction of fully-connected layers. To capture more comprehensive semantic information of instances, BIGRU and self-attention are employed to enhance semantics and reduce information loss during the training process. Finally, the max-pooling operations are utilised to obtain the predicted categories of patents based on capturing the relationships between instances and different hierarchical labels. Experimental results on the '2017 Chinese patent dataset' demonstrate that the multi-instance multi-label approach can effectively mine deeper relationships between patents and labels in classification tasks. As a result, our model significantly improves the accuracy of patent text classification.

1. Introduction

As a basic technology in the patent management process, patent text classification plays an important role in the organisation, analysis, retrieval, and other tasks of patent documents. With the development of technology, there has been a continuous increase in the number of patent applications, posing a significant challenge to the patent management process.

Current research on patent classification aims to reduce reliance on manual methods and enhance the efficiency of patent documents management. Some progress has been made using abstracts and titles as the basis. Previous researchers proposed an approach that combined transformers with hierarchical algorithms (Huang et al., Citation2019) and a model that constructed the semantic network (Sarica et al., Citation2020) to solve the patent text classification task. However, the specific writing structure of patents and the abundance of technical information contained in patent claim provide the potential for improving classification accuracy. For example, in the patent “关系抽取模型的建立方法以及关系抽取方法 (method for establishing relation extraction model and method for relation extraction)”, a key sentence included in the second claim, “基于知识图谱的实体概念结构构建 …  (the construction of entity concept structure based on knowledge graph …)”, has been ignored in the classification process, which may cause incomplete prediction results. Besides that, the structural characteristic of IPC is often overlooked in related studies, which may lead to low classification accuracy.

Based on the considerations above, a model that can not only consider the technical information in patent claim but also leverage the specific structure of IPC needs to be proposed. For the sake of achieving the aforementioned motivation, the following two issues should be taken into account. (1) How to make the most of technical information hidden in patent claim and improve the representation ability of input claims for better training effectiveness. (2) In view of the structural characteristic of IPC, how to construct deeper connections between patent claim and IPCs, and further improve the classification accuracy.

Given the two considerations above, a hierarchical multi-instance multi-label learning for Chinese patent text classification, abbreviated as HMM-CPT, is proposed. Different from the previous work, the patent claim and IPC structure are given lots of attention as the focus of our study. Our research group has carried out various research on text analysis tasks and provided strong support for the work of this paper (Zhang et al., Citation2023; Zhang et al., Citation2022).

The overall framework of the HMM -CPT method is shown in Figure . Our contributions are summarised as follows.

  • Construct a relationship mining model with the aim of establishing associations between claims and IPCs, thereby obtaining a more comprehensive connection between instances and labels.

  • Proposed a hierarchical multi-instance multi-label learning framework, utilising the hierarchical structure of IPC. By leveraging the information association at the current level, more accurate predictions can be obtained for the prior level categories.

Figure 1. The framework of the HMM-CPT method.

Figure 1. The framework of the HMM-CPT method.

In this article, the framework of hierarchical multi-instance multi-label learning for Chinese patent classification is composed of three parts: data preprocessing, instance-label connection construction, and bag-label relationship obtaining. The novelties and advantages of our model are demonstrated as follows:

  1. The training data is prepared in the following ways: Each patent is treated as an independent 'bag', and patent claims in each 'bag'are divided into multiple texts, which enables the information in patent claims to be comprehensively mined.

  2. Multiple fully-connected layers are constructed to establish the connections between each claim and secondary_labels. This kind of mapping connection makes relationship mining more comprehensive.

  3. We add a max-pooling layer composed of two max-pooling operations in different directions, which results in a more accurate prediction at the prior level.

Overall, the method we proposed not only enriches the research content of patent text classification but also establishes a deeper connection between patent texts and labels, which helps enhance the effectiveness of patent text classification.

The remainder of this paper starts with a brief review of related literature on patent text classification tasks. Then we present the process of experiment data preparation in section 3, where the model construction process, training process, and other details of the HMM-CPT are demonstrated completely. In the fourth section, several groups of experiments are conducted to verify the effectiveness of our proposed method. A general review of this paper and the prospect of sustainable research work in the future are summarised in the last section.

2. Related work

The existing patent classification methods can be roughly classified into two categories, which are methods based on machine learning and deep learning, respectively. And current researchers put their sights on improvement and optimisation based on these two kinds of methods.

2.1. The methods based on machine learning

The methods based on machine learning refer to utilising prior knowledge to obtain a regular model through abundant studies and then using the obtained model to predict the data with unknown results (Choi et al., Citation2019). Many classical models and algorithms have been applied to text classification tasks, such as k-NN (Wang et al., Citation2019), naïve Bayes(NB) (Chen et al., Citation2019), support vector machine (SVM) (Goudjil et al., Citation2018), bag-of-words model(cBOW) (Soumya George & Joseph, Citation2014), etc. Based on the naïve Bayes method, Xiao et al. (Citation2018) established a text classification model, which realised the effective classification of patent texts in the security field. For the structural characteristics of patent text, Bao et al. (Citation2018) put forward a framework based on multi-instance learning, in which the patent text is taken as a bag, and the abstract and title are taken as instances. In another work, the semantic information of labels is used to define the specific representation between text and labels, and the self-attention mechanism is introduced to enhance the ability of text presentation and further verify the advance of the proposed multi-label classifier. (Xiao et al., Citation2019).

With the unsolved problem that some samples, which belong to a single category, were ignored in the text feature acquisition process, Hu et al. (Citation2020) improved the information gain function and introduced the weight coefficient to adjust the information gain value of important features, thus promoting the accuracy of patent classification. The unsolved problem still challenges the researchers in that the automatic classification method could not identify unknown words. For this issue, Xiao et al. (Citation2021) fused sentence vectors with proper noun vectors and constructed a patent classification model.

2.2. The methods based on deep learning

Deep learning is an overall framework that combines mathematical knowledge with computer algorithms and acquires the preset results through large-scale data training and calculation. Deep learning is characterised by strong learning ability, adaptability, portability, and good performance in large-scale data processing, etc. thus, it is commonly introduced into massive text-processing tasks. Typical deep learning algorithms include Convolutional Neural Networks(CNN) (Jiang et al., Citation2022; Wang et al., Citation2019), Recurrent Neural Networks(RNN) (Mansueli et al., Citation2022), Generative Adversarial Networks(GANs) (Croce et al., Citation2020), Deep Reinforcement Learning(DRL) (Keneshloo et al., Citation2019), LSTM (Huan et al., Citation2022; Su, Citation2018; Zhang et al., Citation2023), etc. Lee et al. (Citation2022) proposed a multi-model deep learning framework by using quantitative information and texts of patents to improve the patent classification effectiveness. The employment of LSTM in patent classification tasks is also very common. It can be used to construct a corpus training model to predict the patent categories (Ma et al., Citation2018). And some other scholars verified that CNN performs well on patent automatic classification in multiple fields (Wang et al., Citation2022; Hu et al., Citation2018). Due to the demands for massive patent classification, Lyu et al. (Citation2020) designed seven kinds of automatic methods based on deep learning combining the word sequence features, context features, and key classification features of patent texts, and further validated the effectiveness of the deep learning method in CPT classification tasks.

In addition to the scholars mentioned above, other researchers also worked hard to improve the effectiveness of patent classification. Yu et al. (Citation2020) put forward a model that fused two-channel features that were generated by processed text mapping, to improve the accuracy of the automatic classification for CPT. In the case of patent text on a large scale, difficulties arise due to similar content in CPT. For that problem, Wu et al. (Citation2020) introduced an attention mechanism into the bidirectional long-term and short-term memory network and further distinguished similar texts by weighing them differently. Several BERT optimal models were used to classify the multi-label patents automatically, on the basis that the first four digits of IPC numbers were selected as classification labels (Xinyu et al., Citation2022; Zhang et al., Citation2022). In order to achieve better classification results, Wen et al. (Citation2021) proposed a multi-level model that combined ALBERT with bidirectional gating units, which can enhance the semantic representation of long-distance keywords to some extent. Similar work was carried out by Zhang who attached great importance to the semantic meaning of the technical text in the patent (Zhang et al., Citation2021).

2.3. Research on multi-instance multi-label learning for text classification

The concept of MIML(Multi-Instance Multi-Label learning) is proposed to address the multi-classification problem in texts or images (Li et al., Citation2017). Its basic idea is described as follows. The data in the dataset are taken as numbers of bags, and each bag contains several instances in it; a bag can possess several labels, and the numbers of instances and labels are not necessarily equal (Xiang et al., Citation2018). MIML can construct a learning model according to the labelled bags, which have been classified correctly, to predict the unclassified bags. Through the analysis of the above concepts and the structure of patent texts, Feng and Zhou (Citation2017) put forward the deep MIML network, which can conduct the generation and follow-up learning of instances. Further, by means of establishing the relation between instances and the corresponding labels, the deep MIML network has shown good performance on text classification tasks. In another idea, the multi-label problem in patent texts is transformed into the label sorting problem, thus the Fast MIML model is constructed to mine the labels in the unpredicted texts (Huang et al., Citation2018). In order to conduct a more systematic analysis of the application of the MIML method in patent text classification, Bao applied seven methods based on MIML structure to the classification task for CPT, respectively. Then, the effectiveness of MIML was validated through many experiments (Bao et al., Citation2021).

The above research conducted by many scholars indicates the adaptiveness of the MIML method in dealing with the CPT classification task. However, most of those methods took abstracts and titles as research objects, and lacked mining and analysis for quantitative technical information in patent claim. Besides, most of these studies have not considered the hierarchical structure characteristics of IPC, which leads to inaccurate predictions of patent labels, thus causing low classification accuracy. In such a case, a hierarchical multi-instance multi-label learning for Chinese patent texts, abbreviated to HMM-CPT, is put forward in this paper, in which the Bi-GRU (Lu et al., Citation2020)network and self-attention mechanism (Sivakumar & Rajalakshmi, Citation2021) were employed to improve the effectiveness of the classification. The data composition and methodology are elaborated on in the following sections.

3. Preparing works

According to the aforementioned related works, a message reveals that it is feasible to apply MIML methods to solve the patent text classification problem, and it does make good effects. Further, the composition of the patent text and the formal structure of claims give the feasibility for using the MIML network to mine the deeper contents.

3.1. Dataset

In this part, the format and processing of the patent claim, as well as the IPC structure, will be described in detail.

C1. The patent claim

The patent claim covers all innovative features and contents described in the specification, including independent and subordinate claims, numbered with Arabic figures. An example is given in Table . Claim 2 is a subordinate claim of Claim 1, and these two claims are divided by sequence numbers. All IPCs belonging to this patent are listed in the third column. This kind of structure provides the potential for the application of the multi-instance multi-label approach.

Table 1. The claims and IPCs of a patent.

C2 The preprocessing of the CPT

The patent claim text is usually larger in length compared with the abstract, and both of them have a similar writing format, such as '根据权利要求1所述的 …  … (according to the first claim that …),其特征在于 …  … (the characteristic is …), 一种 …  … 装置 (a device that …) , 特征在于 …  … (and the characteristic is …)'. These words or phrases, which are used to organise the text structure, do not provide much useful information in the training process. Therefore the dataset needs to be preprocessed, and the steps are as follows:

  1. Stop words removing. Summarise the writing formats of the claims to construct specific stop word lists for the dataset, and remove meaningless words and phrases that have structural functions but no useful information.

  2. Text division. Take the single patent as a unit, and divide the claims in the dataset with the sequence number as the splitting token.

  3. Word segmentation. Jieba tool is utilised to process the patent claims in the CPT

C3. The IPC

The IPC structure is a hierarchical structure, composed of five levels: section, class, subclass, main group, and subgroup. According to statistical data from the website 'incoPAT', the IPC system possesses 8 sections, 131 classes, and over 630 subclasses. An example is given in Table . In the practical case study, the first three layers are our main focus.

Table 2. An example of IPC hierarchical structure.

Definition 1. (secondary_label)

In our proposed method, the concept of the secondary_label is defined as follows: (1) forIPCQQsection-level-category:labelQclass-level-category:secondary\_label\ of\ Q(1) Formula(1) is explained by taking the data in Table as an example. If the section level is the final classification result we want, the class level will be treated as the research object. That is, F is used as the label, F02 is used as a secondary_label of F.

3.2. The multi-instance multi-label structure of patent texts

The idea of the multi-instance multi-label deep learning network can be described as follows: regarding the dataset as a set of bags, each bag is composed of multiple instances and possesses multiple labels. In this paper, as shown in Figure , one patent is regarded as a bag, each claim of this patent is regarded as an instance, and all IPCs belonging to this patent are used as multi-labels. The section-level and class-level of the IPC numbers are taken as labels in this paper shown in Table .

Figure 2. The MIML structure for patent text.

Figure 2. The MIML structure for patent text.

Table 3. Example of text division and labels of Chinese patent.

4. Methodology

In order to mine the technical information in patent claims and thereby improve the classification accuracy of CPT, MIML learning was introduced into the patent text classification task to predict the categories for unclassified patents by training a model based on the information of classified patents.

According to the idea of MIML learning, the mathematical expression in this paper is as follows: the CPT dataset D={(X1,L1),(X2,L2),,(Xp,Lp)}, Xand Lrepresents the instance set and label set of each bag, respectively.Wi={x1,x2xn},Li={l1,l2,lm},n,m>=1, Wiand Lirepresent the instance set and label set of Xi. sl={sl1,sl2,,slt} represents the secondary_label set of Xi.

4.1. Hierarchical multi-instance multi-label learning method for CPT classification

Based on the idea of the MIML learning, the HMM-CPT(Hierarchical Multi-instance Multi-label learning for Chinese Patent Text classification) is proposed. The overall process structure is shown in Figure , which consists of the input layer, modelling layer, and pooling layer. The function and mechanism of each layer are elaborated on in detail successively.

Figure 3. Hierarchical multi-instance multi-label learning method for CPT classification.

Figure 3. Hierarchical multi-instance multi-label learning method for CPT classification.

The process of this model is described as follows: Firstly, the BERT model is employed to obtain the text feature for input patent claims. Then, label features and text features are fed into the multiple fully-connected layers to establish connections between claims and secondary_labels. Further, the self-attention mechanism and Bi-GRU are introduced to enhance the text feature representation and the reservation of context semantics. Finally, two max-pooling operations with different aims are conducted to obtain the relation between bags and labels. The whole algorithm process can be shown by Algorithm 1.

The algorithm of the HMM-CPT structure is shown as follows:

Algorithm 1 describes the overall process of the HMM-CPT, in which two loop statements are included. Steps 1–4 are used to preprocess the patent text dataset, and the preprocessed data are fed into the next step; steps 5–9 are employed to prepare the text data for subsequent input; steps 10–17 are introduced to mine the instance-label relation; the prediction results work out in steps 17–21. For popular understanding, in this algorithm, n represents the number of instances, and t represents the number of secondary_labels. Two loop statements are included in this algorithm, and thus the time complexity of this algorithm is O(nt).

4.2. Input layer

For a given Chinese patent text in our dataset,X={x1,x2,,xi,xn},1in, xirepresents the i-th claim.L={l1,l2,,lj,lm},1jm, ljrepresents the j-th label. The input layer aims to preprocess the input text to generate the instances for the following training. Before obtaining the text representation, the patent claims of each patent are divided into multiple independent texts, as shown in Algorithm 2.

The algorithm of the text division is shown as follows:

Algorithm 2 describes the text division process in the front part of the HMM -CPT. The patent claims texts are read through steps 1–2; steps 3–11 describe the division operation on claim texts; concretely, the specific symbol '//', which is derived from the replacement with the sequence numbers, is used as the splitting token; with steps 12–13, the divided claim texts are put out for the subsequent process. One loop statement and one judgment statement are included in this algorithm, and thus the time complexity of the algorithm is O(n), where n represents the number of claims.

The text feature of input claims is obtained by the BERT model. The process is formulated as follows: (2) text_feature(xi)=BERT(input_claimxi)(2) The mechanism of the BERT model is shown in Figure .

Figure 4. The structure of the BERT model.

Figure 4. The structure of the BERT model.

4.3. Modelling layer

The purpose of the modelling layer is to model the connection between instances(that is, the claim text) and secondary_labels. The fully-connected layers are employed to establish the mapping between instances and secondary_labels through multiple convolutional operations, and what we get is the instance-label layer mentioned in Figure .

Figure 5. The construction process of the instance-label layer.

Figure 5. The construction process of the instance-label layer.

A two-dimensional fully-connected layer, which can be used to deal with the single-instance multi-label classification, is proposed at first, aiming to build the connections between one claim and all secondary_labels. That is, once the representation of claim x is obtained, a two-dimensional matrix of size t*t is constructed. t represents the number of secondary_labels. The purpose of the fully-connected layer is to construct the mapping from a single claim to all secondary_labels of the corresponding patent.

In such a fully-connected layer, the (i, j)-th node represents the matching score between the claim x and the i-th secondary_label of the j-th label. Here, we choose the softmax function as the activation function. The form of activation is as follows: (3) ci,j=f(ωi,j+bi,j)(3) (4) softmax(z)i=ezij=1Nezj,i=1,,N(4) In the above formula, f(•) is the activation function, weight vector ωi,j is the matching template for the i-th secondary_label of the j-th label. zrepresents the vector mapping of instance i, and N represents the dimension of the vector.

Further, the Bi-GRU and self-attention mechanism are introduced to enhance the text representation of instances. Finally, two max-pooling layers in different directions are conducted. The process can be expressed as: (5) mappingi=1n(Xi,Li)=convo(j=1t(xi, slj))(5) The self-attention mechanism computing process is shown in Figure ,

Figure 6. Self-attention mechanism computing process.

Figure 6. Self-attention mechanism computing process.

According to Figure , this computing process can be formulated as follows: (6) s1,i=exp(h1,i)/jexp(h1,i)(6)

4.3. Max-pooling layer

In this layer, the relation between instances and labels is finally obtained by conducting two max-pooling operations on the instance-label layer constructed in the previous step. Concretely, the first pooling operation is a max-pooling on every instance-secondary_label layer, which aims to pick out the highest-scored secondary_label for every label; the second one is a max-pooling on the instance level, by which the sorted matching scores between labels and the corresponding patent are obtained. The whole process is described in Figure and formula (7); (7) hier_miml=maxpoolingbagi=1(Xi,Lj){M1}M1=maxpoolinglabelmapping(j=1(xi,slj)(7)

Figure 7. The max-pooling process.

Figure 7. The max-pooling process.

5. Experiments

In this section, the experimental details are elaborated on, including the experimental data, setting up, results, and analysis.

5.1. Experimental data

2461 patents in the mechanical engineering field were selected from the first 15000 patents of the '2017 Chinese patent dataset' as our experimental data.Footnote1 The selection principles of patent texts are as follows:

  1. The number of claims is greater than or equal to 6 (according to the average number of claims calculated from the first 15,000 patents), which is intended to ensure the sufficiency of information and the presence of multi-instance characteristics.

  2. The number of IPCs is greater than or equal to 2, which aims to make sure the patents have multi-label attributes.

Some examples of IPC deconstruction are listed in Table .

Table 4. Examples of IPC deconstruction.

Dataset 1:title-abstract-IPC. The patent titles and abstracts are taken as the main experimental objects. The dataset is divided into the training set and testing set at the ratio of 4:1.

Dataset 2: title-abstract-claims-IPC. The patent titles, abstracts, and claims are taken as the main experimental objects. The dataset is divided into the training set and testing set at the ratio of 4:1 as well.

Some samples in our datasets are listed in Table .

Table 5. Samples in the patent dataset.

5.2. Evaluation index

Three indexes are used to evaluate the experimental results, that is, Hamming Loss(HL), Average Precision(AP), and F1-score. The explanations and formulas of the three indexes are as follows: (8) HL=1Mi=1M|YiZi|L(8) (9) Acc=TPTP+FP(9) (10) F1score=2AccRecallAcc+RecallAcc=TPTP+FPRecall=TPTP+FN(10) HL indicates the misclassification degree of the claim on a certain label in the test set. The smaller the value of HL, the higher the prediction accuracy of a certain label.

AP indicates the accuracy of the patent text classification prediction, that is, the proportion of the correctly predicted samples in the total patents. The larger the AP value, the higher the prediction accuracy.

F1-score indicates the accuracy of the model prediction. The higher the value, the better the performance of the model.

In the above formulas (6), (7), and (8), M and L represent the total number of patents and labels respectively; Yi and Zi represent the actual number and predicted number for the i-th patent. The specific symbol ⊕ represents the exclusive-or operation (XOR), and f(xi, yj) indicates the matching score between instance xi and label yj.

5.3. Experimental methods

To verify the effectiveness of our proposed method based on the hierarchical MIML learning oriented to Chinese patent texts, the experiment is carried out in the following ways.

  1. Data processing: taking a single patent as a unit, the patent claims are divided with the sequence number as the splitting token; the stop words are removed; the jieba tool is used for word segmentation; finally, the input text is obtained.

  2. Model training: For each patent text, its title, abstract, and multiple claims serve as multiple instances. the text features are obtained by the BERT model, and then fed into the convolutional neural network for training together with the labels. Concretely, the Bi-GRU and self-attention mechanism are used to enhance the representation of instances.

  3. Model testing: taking the patent texts in the test set as the object, several groups of experiments are set to test the classification model constructed after the above two steps.

5.4. Experimental settings

Several models with good performance were selected as the comparison objects, based on the reference of multiple literature sources. The models mentioned in our experiments are listed in Table .

Table 6. Models and the corresponding descriptions.

It should be noted that, in our experiments, HMM-CPT is a multi-instance multi-label method. The patent title, abstract, and multiple claims are taken as the multi-instance input of this model. As for the five models from BiLSTM + ATT to WBAT in Table , the patent title, abstract, and patent claim are input into these models as a whole.

The models listed in Table are experimentally evaluated on two datasets, respectively. The two datasets are Dataset 1:title-abstract-IPC, and Dataset 2: title-abstract-claims-IPC.

5.5. Experimental results and analysis

The experiments are carried out at two levels of the IPC: the section level and the class level. at the section level, the class level digits of IPC are taken as the secondary_label. While the subclass level digits of IPC are taken as the secondary_label at the class level.

The experimental results are listed in Table and Table .

Table 7. The experimental results at the section level.

Table 8. The experimental results at the class level.

Based on the experimental results in Tables and , we have made the following analysis.

  1. From a vertical perspective, as shown in Tables and , our proposed method HMM-CPT outperforms other methods on Acc and F1-score. We have analyzed the reasons and summarised them as follows:

    1. In the experiment, except for HMM-CPT, all other models treats the title, abstract, and patent claim as a single instance. Although the amount of information is the same, the connection established between a single instance and multiple labels is relatively limited. Therefore, the accuracy of label prediction is lower.

    2. The HMM-CPT method treats the title, abstract, and each claim as separate instances, and then establishes the relationship between each instance and each label. Consequently, the instance-label relationships obtained through training are more comprehensive, leading to better performance in the label prediction step.

    3. The structure of both the text and labels can also be utilised to obtain better classification results. By establishing connections between instances and secondary_labels, the knowledge scope associated with the labels is expanded. This allows for more accurate predictions when classifying unclassified patents.

  2. From a horizontal perspective, The experimental accuracy of each model on Dataset 2(title-abstract-claims-IPC) is higher than that on Dataset 1(title-abstract-IPC).

    1. Based on our analysis, we believe that a patent abstract generally encompasses the overall content of a patent. However, due to its concise language, some technical details may not be fully reflected. The patent claims contain more information compared to the abstract, and this information helps classify and predict patent texts.

    2. Compared to the patent specification, the advantage of the patent claim lies in its textual structure. The patent claim is composed of multiple independent claims, and the semantics of each claim are relatively complete. This structure provides us with an opportunity to use a multi-instance multi-label learning approach.

In summary, our proposed model takes the patent claim, which contains more useful information, as the main research object. In addition, our proposed method combines multi-instance multi-label learning with the IPC hierarchical structure, which enables a more comprehensive connection established between labels and instances. That is, the HMM-CPT constructs the relationships between claims and multiple secondary_labels, thereby obtaining more comprehensive and accurate results in label-level prediction.

As can be seen from the experimental data, current deep learning models can indeed achieve good classification performance, which provides a significant reference for this research. It is worth noting that, the model WBAT, Word2Vec + Bi-GRU + ATT + TextCNN, did not perform well at the HL value, and the reason may be that this method did not concentrate on IPC structure but text feature mining. In our experiment, the whole IPCs are treated as labels for WBAT. As shown in Table , the Acc-value of WBAT is the highest except for our proposed method. Figures represent the results more intuitively.

Figure 8. HL results on dataset1&2 at section level(left) &class level(right).

Figure 8. HL results on dataset1&2 at section level(left) &class level(right).

Figure 9. Acc(%) results on dataset1&2 at section level(left) &class level(right).

Figure 9. Acc(%) results on dataset1&2 at section level(left) &class level(right).

Figure 10. F1-score results on dataset1&2 at section level(left) &class level(right).

Figure 10. F1-score results on dataset1&2 at section level(left) &class level(right).

5.6. Ablation study

An ablation experiment was designed to verify the rationality of each module that constitutes our proposed model. The results of the ablation experiment are shown in Tables and .

Table 9. ablation results at the section level.

Table 10. Ablation results at the class level.

From the results of the ablation experiments in Tables and , it can be observed that the addition of ATT and Bi-GRU indeed contributes to improving the accuracy of the Chinese patent text classification proposed in this paper. Concretely, the results of the ablation experiments in Figures and demonstrate that the inclusion of ATT and Bi-GRU contributes to the improvement of the accuracy of the proposed Chinese patent text classification method. In terms of the Acc and F1 values, the difference between MIML + ATT and MIML + Bi-GRU is not significant, but there is an obvious enhancement observed when both ATT and Bi-GRU are combined in the MIML + ATT + Bi-GRU model. It is worth noting that ATT focuses on emphasising the semantic relevance of keywords in sentences, whereas Bi-GRU excels in enhancing the overall semantic coherence of sentences. Therefore, the integration of the multi-instance multi-label framework with ATT and Bi-GRU for text feature representation is a rational approach for improving the accuracy of Chinese patent text classification.

5.7. Other cases

In the previous set of experiments, when extracting experimental data, the principle 'the number of claims is greater than or equal to 6' was based on the average quantity observed in the first 15,000 patents of the 2017 Chinese patent dataset. Another question that needs to be considered is whether the more claims a patent contains, the higher its classification accuracy is.

For this issue, another set of experiments was designed. The patents were selected from the first 15,000 in our dataset with varying numbers of claims, ranging from 2 to 15, and each group consisted of approximately 100 patents. The experimental results are presented in the form of a line graph, as shown in Figure .

Figure 11. The impact of the number of claims on the classification accuracy.

Figure 11. The impact of the number of claims on the classification accuracy.

From the two lines in Figure , it can be observed that as the number of claims increases from 2 to 6, the classification accuracy gradually improves. However, beyond 6 claims, the accuracy levels off without significant improvement. After analyzing the experimental results and patent claim text, it can be concluded that there is redundancy in the information contained in patent claims. When more than 8 or even more claims are introduced, although the supplementary content increases, the usable information does not significantly increase. This indicates that the inclusion of patent claims does provide additional information for patent classification, but a higher quantity does not necessarily result in better classification performance. Based on the accuracy at the section level and class level, it can be concluded that the optimal number of claims for patent classification is between 6 and 8.

6. Conclusions

To fully mine the information in the Chinese patent texts, and improve the accuracy of classification, the HMM-CPT, a method based on multi-instance multi-label learning, is proposed. Concretely, the proposal of this method is grounded on a full consideration of patent text structure, and the patent claim is taken as the main research object. Besides, the hierarchical structure of the label is also taken into consideration. Specifically, the IPCs are deconstructed into several parts for different requirements. By deeply mining technical information hidden in patent claims and constructing the relationship between claims and IPCs, the classification accuracy for Chinese patent texts is improved.

Although our proposed method can improve the classification accuracy for Chinese patent texts to some extent, there are still limitations to consider. The technical information in patent claims can provide supplementary information for the construction of patent text classification models. However, due to its writing style, there is also some repetitive text in it, which unavoidably leads to information redundancy in the classification process and wastes computational resources. In the future, increased consideration can be given to the structure and writing style of the patent claim to explore ways to reduce information redundancy in patent claim texts, which aims to provide underlying theoretical support for Chinese patent text classification.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China (Grant number 62076006), the University Synergy Innovation Program of Anhui Province (Grant number GXXT-2021-008), the Opening Foundation of State Key Laboratory of Cognitive Intelligence (Grant number COGOS-2023HE02), and the University Natural Science Research Project of Anhui Province(Grant number 2023AH050846).

Notes

References

  • Bao, X., Liu, G. F., & Cui, J. H. (2021). Application of multi instance multi label learning in Chinese patent automatic classification. Library and Information Service, 65(8), 107–113. https://doi.org/10.16353/j.cnki.1000-7490.2018.11.026
  • Chen, J., Dai, Z., Duan, J., Matzinger, H., & Popescu, I. (2019). Naive Bayes with correlation factor for text classification problem. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) (pp. 1051–1056). IEEE. https://doi.org/10.1109/ICMLA.2019.00177.
  • Choi, S., Lee, H., Park, E. L., & Choi, S. (2019). Deep patent landscaping model using transformer and graph embedding. arXiv preprint arXiv:1903, 05823. https://doi.org/10.48550/arXiv.1903.05823
  • Croce, D., Castellucci, G., & Basili, R. (2020, July). GAN-BERT: Generative adversarial learning for robust text classification with a bunch of labeled examples. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 2114–2119). https://doi.org/10.18653/v1/2020.acl-main.191.
  • Feng, J., & Zhou, Z. H. (2017, February). Deep MIML network. In Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1). https://doi.org/10.1609/aaai.v31i1.10890.
  • Goudjil, M., Koudil, M., Bedda, M., & Ghoggali, N. (2018). A novel active learning method using SVM for text classification. International Journal of Automation and Computing, 15(3), 290–298. http://doi.org/10.1007/s11633-015-0912-z
  • Hu, Y., Qiu, Q., Yu, X., & Wu, J. (2020). Semi-supervised patent text classification method based on improved tri-training algorithm. J. Zhejiang Univ.(Eng. Sci.), 54, 331–339. https://doi.org/10.3785/j.issn.1008-973X.2020.02.014
  • Huan, H., Guo, Z., Cai, T., & He, Z. (2022). A text classification method based on a convolutional and bidirectional long short-term memory model. Connection Science, 34(1), 2108–2124. https://doi.org/10.1080/09540091.2022.2098926
  • Huang, S. J., Gao, W., & Zhou, Z. H. (2018). Fast multi-instance multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11), 2614–2627. https://doi.org/10.1109/TPAMI.2018.2861732
  • Huang, W., Chen, E., Liu, Q., Chen, Y., Huang, Z., Liu, Y., … Wang, S. (2019). Hierarchical multi-label text classification: An attention-based recurrent network approach. In Proceedings of the 28th ACM international conference on information and knowledge management (pp. 1051–1060). https://doi.org/10.1145/3357384.3357885.
  • Jiang, S., Hu, J., Magee, C. L., & Luo, J. (2022). Deep learning for technical document classification. IEEE Transactions on Engineering Management, 71, 1163–1179. https://doi.org/10.1109/TEM.2022.3152216
  • Jie, H., Shaobo, L. I., Liya, Y. U., & Guanci, Y. A. N. G. (2018). A patent classification model based on convolutional neural networks and rand forest. Science Technology and Engineering, 18(6), 268–272. https://doi.org/10.3969/j.issn.1671-1815.2018.06.042
  • Keneshloo, Y., Ramakrishnan, N., & Reddy, C. K. (2019, May). Deep transfer reinforcement learning for text summarization. In Proceedings of the 2019 SIAM International Conference on Data Mining (pp. 675–683). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611975673.76.
  • Lee, J., Kang, J., Kim, Y., Jang, D., & Park, S. (2022). Multimodal Deep Learning for Patent Classification. In Proceedings of Sixth International Congress on Information and Communication Technology: ICICT 2021, London, Volume 4 (pp. 281–289). Springer Singapore. https://doi.org/10.1007/978-981-16-2102-4_26.
  • Li, X., Wan, S., Zou, C., & Yin, B. (2017). Multi-instance Multi-label Learning for Image Categorization Based on Integrated Contextual Information. In Image and Graphics: 9th International Conference, ICIG 2017, Shanghai, China, September 13–15, 2017, Revised Selected Papers, Part I 9 (pp. 639–650). Springer International Publishing. https://doi.org/10.1007/978-3-319-71607-7_56.
  • Lu, Q., Zhu, Z., Xu, F., Zhang, D., Wu, W., & Guo, Q. (2020). Bi-GRU sentiment classification for Chinese based on grammar rules and bert. International Journal of Computational Intelligence Systems, 13(1), 538–548. https://doi.org/10.2991/ijcis.d.200423.001
  • Lucheng, L., Tao, H., Jian, Z., & Yajuan, Z. (2020). Research on the method of Chinese patent automatic classification based on deep learning. Library and Information Service, 64(10), 75. https://doi.org/10.13266/j.issn.0252-3116.2020.10.009
  • Ma, J. H., Wang, R. Y., & Yao, S. (2018). Patent classification method based on depth learning. Computer Engineering, 44(10), 209–214. https://doi.org/10.19678/j.issn.1000-3428.0048159
  • Mansueli, R., Domingues, M. A., & Feltrim, V. D. (2022, November). Improving Multilabel Text Classification with Stacking and Recurrent Neural Networks. In Proceedings of the Brazilian Symposium on Multimedia and the Web (pp. 117–122). https://doi.org/10.1145/3539637.3557000.
  • Sarica, S., Luo, J., & Wood, K. L. (2020). Technet: Technology semantic network based on patent data. Expert Systems with Applications, 142, 112995. https://doi.org/10.1016/j.eswa.2019.112995
  • Shunxiang, Z., Aoqiang, Z., Guangli, Z., Zhongliang, W., & KuanChing, L. (2023). Building fake review detection model based on sentiment intensity and PU learning. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1016/j.jnca.2020.102815
  • Sivakumar, S., & Rajalakshmi, R. (2021). Self-attention based sentiment analysis with effective embedding techniques. International Journal of Computer Applications in Technology, 65(1), 65–77. https://doi.org/10.1504/IJCAT.2021.113651
  • Soumya George, K., & Joseph, S. (2014). Text classification by augmenting bag of words (BOW) representation with co-occurrence feature. IOSR Journal of Computer Engineering, 16(1), 34–38. https://doi.org/10.9790/0661-16153438
  • Su, H. (2018). Patent Technology Classification and Citation Level Projection. Deep Learning, Winter 2018, Stanford University, CA.
  • Wang, F., Liu, Z., & Wang, C. (2019). An improved kNN text classification method. International Journal of Computational Science and Engineering, 20(3), 397–403. https://doi.org/10.1504/IJCSE.2019.103944
  • Wang, R., Li, Z., Cao, J., Chen, T., & Wang, L. (2019, July). Convolutional recurrent neural networks for text classification. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1–6). IEEE. https://doi.org/10.1109/IJCNN.2019.8852406.
  • Wang, T., Zhu, X. F., & Tang, G. (2022). Knowledge-enhanced graph convolutional neural networks for text classification. Journal of Zhejiang University ( Engineering), 002. 056.10.3785j.issn.1008-973X.2022.02.013
  • Wen, C., Zeng, C., Ren, J., & Zhang, Y. (2021). Patent text classification based on ALBERT and bidirectional gated recurrent unit. Journal of Computer Applications, 41(2), 407. https://doi.org/10.11772/j.issn.1001-9081.2020050730
  • Wu, J., Huang, C., & Chen, Y. (2020, October). Patent Text Classification Study Based on Bi-LSTM-A Model. In 2020 5th International Conference on Control, Robotics and Cybernetics (CRC) (pp. 1–5). IEEE. https://doi.org/10.1109/CRC51253.2020.9253461.
  • Xiang, B., Feng, L. G., & Li, Y. G. (2018). Patent text classification method based on multi-instance learning. Information Studies: Theory & Application, 41(11), 144–148. https://doi.org/10.16353/j.cnki.1000—7490.2018.11.026
  • Xiao, L., Huang, X., Chen, B., & Jing, L. (2019). Label-specific document representation for multi-label text classification. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 466–475). https://doi.org/10.18653/v1/D19-1044.
  • Xiao, L., Wang, G., & Liu, Y. (2018, December). Patent text classification based on naive bayesian method. In 2018 11th International Symposium on Computational Intelligence and Design (ISCID) (Vol. 1, pp. 57–60). IEEE. https://doi.org/10.1109/ISCID.2018.00020.
  • Xiao, Y., Li, H., Zhang, L., Lv, X., & You, X. (2021). Research on Chinese patent text classification method based on feature fusion. Data Analysis and Knowledge Discovery, 1. https://kns.cnki.net/kcms/detail/10.1478.G2.20211104.1706.002.html.
  • Xinyu, T., Ruijie, Z., & Yonghe, L. (2022). Multi-label patent classification with pre-training model. Data Analysis and Knowledge Discovery, 6(2/3), 129–137. https://doi.org/10.11925/infotech.2096-3467.2021.0930
  • Yu, B., & Zhang, P. (2020). WPOS-GRU patent classification method based on two-channel feature fusion. Application Research of Computer, 37(3), 655–658. https://doi.org/10.19734/j.issn.1001-3695.2018.08.0628
  • Zhang, S., Xu, H., Zhu, G., Chen, X., & Li, K. (2022). A data processing method based on sequence labeling and syntactic analysis for extracting new sentiment words from product reviews. Soft Computing, 26, 1–14. https://doi.org/10.1007/s00500-021-06228-9
  • Zhang, S., Yu, H., & Zhu, G. (2022). An emotional classification method of Chinese short comment text based on ELECTRA. Connection Science, 34(1), 254–273. https://doi.org/10.1080/09540091.2021.1985968
  • Zhang, S., Zhao, T., Wu, H., Zhu, G., & Li, K. (2023). TS-GCN: Aspect-level sentiment classification model for consumer reviews. Computer Science and Information Systems, 20(1), 117–136. https://doi.org/10.2298/csis220325052z
  • Zhang, Z., Xu, T., Zhang, L., Du, Y., Xiong, H., & Chen, E. (2021). Knowledge powered cooperative semantic fusion for patent classification. In Artificial Intelligence: First CAAI International Conference, CICAI 2021, Hangzhou, China, June 5–6, 2021, Proceedings, Part I 1 (pp. 111–122). Springer International Publishing. https://doi.org/10.1007/978-3-030-93046-2_10.